A few suggestions from the post-doc ("Try keeping all the BLAST files in the same folder.") got my Unix problems solved. Commenters on my last post have provided lots of suggestions, some of which are useful and some of which address problems I've already solved. Neil has even set up a NodalPoint page about this problem, but I haven't yet figured out how to edit in my comments.
I've now used standalone (local) BLAST to search the database of 2136 different 39 nt segments of the H. influenzae Rd genome with the source genome (Rd) sequence. This is the 'positive control' for the searches I want to do with the non-Rd genome sequences. This worked, and let me do some trials to work out which options I should specify. Then I did the same searches with the three other genomes I have. This showed me other problems, some of which I haven't worked out yet.
I have two big remaining problems.
First, I need to understand BLAST searches well enough to optimize the alignments. I understand that mismatches close to the ends of the sequences will be under-represented, because of how BLAST works. I think this appeared in the search results - instead of alignments with single mismatches near the ends I think I got alignments that had been shortened by 4 nt. I may be able to minimize this by setting the options appropriately, but I probably can't eliminate it. Luckily the first and last 5 nt are the least important for this analysis.
Second, the information I need to get from the non-Rd searches is the locations of mismatches within the 39 nt segments. For this I found that representing the output as pairwise alignments made easiest to extract the information that specifies the positions of the mismatches within the alignments. This is relatively straightforward (yes Neil, with Word and Excel) provided the mismatches are internal to 39 nt alignments, but will need some more sophisticated tricks for alignments that are truncated at one or the other end. Another problem for extracting the position info is that half of the 39 nt sequences are in the opposite orientation to the others, and so they are reversed in the output.
The guy in the next office strongly recommended BBEdit, so I bought that. He said it does a lot of the editing chores that I'd otherwise need to write Perl scripts to do. Sounded great. But BBEdit has wandered far from its "Bare Bones Software" (="BB") roots, and learning how to use it will take some time...
A mathematical theory of communication
2 hours ago in Doc Madhattan