Originally I had set the probability cutoff for the BLAST searches to 10e-10, because I wanted to exclude alignments to USS sequences that weren't true homologs (i.e. to sequences from unrelated positions in the genome that were similar because they were also USSs, not because they both descended from the same ancestral USS). This constraint meant that I got sequences with only 1 (or no) internal mismatches, and I was quite confident that these would all be true homologs. (In retrospect I might have been overconfident.)
After I did the first control analysis (of randomly selected non-USS sequences) and discovered that my BLAST searches were producing patterns unrelated to USS, I lowered the probability cutoff to 10e-4. This meant that I often got several different genomic sequences aligned to the same 39 bp USS sequence. I know that only one of these could be the true homolog, and I assumed that the best match would indeed be the true homolog (I had worked out an easy way to eliminate all but the best match). This would have been true if (1) all of the USS sequences in the H. influenzae Rd genome did indeed have true homologs in each of the other H. influenzae strain genomes, and if (2) I hadn't been searching the two strands of the genome separately. But (1) a small fraction almost certainly don't, and (2) I was, so a significant fraction of the alignments I was scoring were to the wrong sequences.
This wouldn't have been a big deal if including wrong sequences just introduced noise into the analysis, lowering its ability to detect differences caused by the USSs. But a long and heated discussion with one of the postdocs had helped me realize that including non-homologous USSs would have precisely the effect I was attributing to homologous USSs - reducing the amount of variation at USS positions relative to non-USS positions.

I had been thinking that the low probability cutoff (really 'high', as 10e-4 is higher than 10e-10) was good because it produced a set of alignments with more mismatched positions and thus more variation to feed into the analysis. But some of this variation was due to mistaken alignments, so I decided to go back to the original sets of alignments and redo the analysis, using only those that had no more than three mismatched positions. This eliminated almost all of the non-homologous alignments.
The graphs above show the difference between the two analyses for one of the three non-Rd genomes. I've cut off the 5 non-USS positions at each side, so these are only 29 bp wide. The 'better analysis' has lower genetic variation (Y axis goes to 3, not 4), because the alignments I had eliminated were the ones with the most variation (some of it due to non-homologous alignments), but the pattern is the same.
This is reassuring for two reasons. First, the pattern I saw in the earlier analyses (low variation at USS-motif positions) is still there, and still very strong, so I know it wasn't due to including non-homologous alignments. Second, because eliminating most of the problem alignments hasn't caused the pattern to become much weaker, I can be confident that any non-homologous alignments that might have slipped past the new more stringent cutoff aren't responsible for the pattern.
I redid the controls too with the new cutoff, and they give baseline variation of about 1.3%, with standard deviation of about 0.15%. So the least variable USS positions have much less variation than the rest of the genome, and the most variable have about twice as much. This high variation may be simply because the set of USS sequences has overall more variation than the rest of the genome. This is expected to be the case because USS are preferentially located in poorly constrained genomic positions such as non-coding regions and non-essential genes.
Now we just have to get the revised manuscript back to the journal editors before they lose patience with us.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
No comments:
Post a Comment
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS