Field of Science

Showing posts with label open science. Show all posts
Showing posts with label open science. Show all posts

BLAST problem solved



The reason BLAST was finding no variation at the central positions of the 39 nt sequences was that I had set its 'word' length to 20. This told it to begin searching by looking for perfect matches to an initial sequence that was 20 nt long. This meant that it could never find a mismatch at the central position because such a mismatch would have been flanked by matched segments that were, at best, 19 nt long.

So I tested word lengths of 10 and of 8. Both eliminated the central mismatch problem, at the trivial expense of increasing search time to at worst 10 seconds for the whole genome.

I'm gradually getting a better understanding of how BLAST searches work (it takes me a while to absorb the complexities), so I've also improved the searches in other ways - allowing higher "E-values" so I get sequences with more than one mismatch, and setting the maximum number or results to a value appropriate for my database size. I've also improvved my Word/Excel shuffle methods, so I get a cleaner dataset. And I now carefully note the numbers of sequences at the various steps.

The graph above is the control analysis for only one orientation of only one of the geneome sequences. So now I'm ready to search all three genomes against the control dataset and, if this looks good, against the USS dataset.

Pushing for more open science

I'm going to take advantage of my SMBE talk to proselytize for open science and research blogs.

The idea of open-access scientific publication is spreading. It's relatively easy for scientists to see the benefits of making the results of their research accessible to more people. But hardly anybody seems to consider the potential benefits of opening up the process of doing the science (as I'm trying to do with this blog). The reason is competition; most researchers are afraid that they'll be scooped if their competitors find out what they're doing.

Although competition can be a positive force in science, stimulating us to try to do better work than our competitors, I also think that the secrecy that arises from fear of competitors is also a very negative force. Competitors are all potential collaborators, and focus on competition shuts out the potential for collaboration. I was impressed by a NIH grant-writing workshop that told me that the first think NIH does when people call to ask about funding opportunities for a new line of investigation is to connect them with people doing related work, encouraging them to contact these as possible collaborators.

Of course it's easy for us, because we're working on a problem whose importance isn't widely recognized....

Stringency of Gibbs analysis

In revising our USS manuscript I've been thinking again about the stringency of the Gibbs motif searches. The stringency is set by telling the program how many sites it should expect to find. For most analyses we set this at 2000 (and it would find about 1350) but for the analysis of covariation between positions we set it at 3000 so we would have many poorer matches (it would find about 1650) and thus have more variation to analyze.

The Gibbs analysis assigns a score to each site it finds. With 'expect 2000' most sites have scores close to 1.0, but there are always some (~40 in the set I counted) with scores of zero and a few with scores between 0.5 and 0.9.

I only yesterday discovered how to get Excel to make a histogram; here's the histogram showing the distribution of scores for one of the 'expect 3000' searches. As with the 'expect 2000' analyses, most of the USS sites have scores close to 1.0, and few have scores int he middle range. But there are lots more sites with scores=0, and today I've been checking out how bad these sites are.

I had feared that they were garbage included by mistake, just increasing the background noise but not really resembling the USS consensus. So I was pleasantly surprised to see that these sites have a strong USS signal when displayed as a sequence logo (the top logo in the figure). For comparison I made another logo using only the sites that scored 1.0; that's the bottom logo in the figure.

Of course this analysis doesn't tell us whether the low-scoring sites found by the Gibbs analysis actually function as USS in DNA uptake. Most of them miss the consensus at at least one core position, but the ~25% with perfect cores also have stronger flanking consensus (I'm not sure how to interpret this).

Not that kind of framing

I'm working on the revisions for the USS paper, writing up the new analysis of reading frames and constraints. I put the data in the table on the left; the "Relative to best codons" values estimate how easily the USS-encoded tripeptide would be translated. I'm puzzling over why only 49 USSs are in reading frame A.

The graph below shows USS number plotted as a function of the postulated cause, proteome number (this is plotted the opposite way to that in my previous post on this analysis). This plot more clearly shows that one point (frame A) is an outlier; it has a lot fewer USSs than we

would expect given how often the tripeptide it specifies appears in the proteome.

Could this be because of codon constraints, i.e. because USSs in frame A require the tripeptide KVR to be encoded by inconvenient codons? No, the codon score is quite high (0.77) meaning that this frame's USS-specified codons are commonly used in the proteome.

If anything there should be more USSs in this frame than predicted by the proteome abundance of its tripeptide, because the consensus is weak for the first and second positions of the first codon and the second position of the second codon.

Reading frames vs USSs

Sorry for the dead air; I was away unexpectedly with no internet access.

We got the reviews back for our USS manuscript. Not too bad. Both reviewers asked for more analysis of the role of reading frames in USS locations. (See previous posts here and here.) This turned out to be both easy and fun to do.

Many USSs are in the protein-coding parts of the genome, and these can be sorted by which of the 6 possible reading frames the respective proteins are encoded. The first two figures show the relationships of the USSs to the reading frames.

The frames aren't equally used. Frames A, B and C have 49, 125 and 425 USSs respectively, and frames D, E and F have 474, 125 and 157. The differences are too large to be explained by chance alone, and we think they arise because USS in the less-used reading frames impose more severe constraints on protein function, so that many USSs arising in these frames are eliminated by natural selection.

The new analysis considers the two factors likely to contribute to this disparity. (Because the flanking segments exert only modest constraints on amino acid sequence, I've limited the new analysis to only the most frequent tripeptides encoded by the USS core in each frame.)

The first factor likely to affect USS reading frame usage is the differing biochemical properties of the tripeptides that USS cores in these reading frames will encode. Some combinations are intrinsically more versatile than others, useful at many different locations in a wide variety of proteins, whereas other tripeptide combinations will only be useful in particular contexts.

The second factor is 'codon bias'. Most amino acids are specifiable by at least 2 (often 4 and sometimes 6) different codons, some of which are more efficiently translated than others, and cells preferentially use the easiest-to-translate codons for proteins they need to make a lot of.

The new analysis evaluates the first factor by comparing the number of USSs in each frame to the total usage of the six tripeptides in all the proteins of the genome (the 'proteome'). If the differing versatilities of the tripeptides is responsible for their differing use in USSs, we should see a correlation between total number and USS-encoded number. The results are shown by the red symbols and line in the graph below. The predicted correlation exists. (The line fitted to the points has a confidence score of 0.86. I know enough statistics to know that this is a reasonably strong correlation.)

The blue symbols in the graph show the results of the other analysis. To look for a correlation between codon bias and USS reading frame usage, I needed a crude score indicating how easily each USS-encoded tripeptide would be translated. I was able to get a table of codon usage for the H. influenzae proteome from the TIGR website; this gave the percent usage of each codon. So for each tripeptide I calculated a 'USS-codons' score as the sum of the percentages of the codons specified by USSs in that reading frame, and a 'best-codons' score as the sum of the percentages of the most commonly used codons for its three amino acids. Then I calculated a 'codon cost' as the ratios of these scores, and multiplied it by 100 so it would fit neatly on the graph.

If codon cost contributes to the disparity in reading frame usage by USS, we expect the blue points to show an inverse relationship; the highest codon costs should be for the least-used reading frames, so the blue line should slope down to the right. Instead we see no correlation at all. This tells us that codon bias makes little or no contribution to persistence of USSs in different reading frames.

We're not surprised by this second result. (In fact one of the postdocs was so sure of it that she didn't think the analysis was worth doing.) Other analysis we've done has indicated that USS are rarely found in categories of genes known to be subject to strong codon bias. But that's for a different paper, and this new analysis will nicely address concerns raised by both reviewers of our present paper.

Next?

OK, so the induction of CRP-S genes experiments established clearly that the ppdA gene's CRP-S promoter isn't being induced by the growth conditions I tried. And that changes in beta-gal activity don't necessarily result from changes in promoter activity. What to do now?

To make progress in these experiments, I need better tools for manipulating genes. I want to put the ppdD::lacZ fusion into the chromosome (it's on a plasmid now) and I want to test the effect of knocking out various genes, and I want to try a 'wild' E. coli strain. All of these require recombining desired genes into the chromosome, so I need to get the recombineering technique working for me. One of the post-docs was working on this, so I just need to take up where she left off.

I also need to get back to the tests I was setting up of components of the laser-tweezers project. I was just about ready to test attaching DNA to beads (have the beads with the bound streptavidin, have the biotinylated nucleotides to put on the ends of the DNA, have the DNA prep). Separately, I need to work with the new post-doc on an antibody-based method to attach H. influenzae cells to other beads, so they can easily be pushed around under the microscope.