The reason BLAST was finding no variation at the central positions of the 39 nt sequences was that I had set its 'word' length to 20. This told it to begin searching by looking for perfect matches to an initial sequence that was 20 nt long. This meant that it could never find a mismatch at the central position because such a mismatch would have been flanked by matched segments that were, at best, 19 nt long.
So I tested word lengths of 10 and of 8. Both eliminated the central mismatch problem, at the trivial expense of increasing search time to at worst 10 seconds for the whole genome.
I'm gradually getting a better understanding of how BLAST searches work (it takes me a while to absorb the complexities), so I've also improved the searches in other ways - allowing higher "E-values" so I get sequences with more than one mismatch, and setting the maximum number or results to a value appropriate for my database size. I've also improvved my Word/Excel shuffle methods, so I get a cleaner dataset. And I now carefully note the numbers of sequences at the various steps.
The graph above is the control analysis for only one orientation of only one of the geneome sequences. So now I'm ready to search all three genomes against the control dataset and, if this looks good, against the USS dataset.
No comments:
Post a Comment
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS