Commenters are offering help and advice! Damn, I love open science! But I need to clarify what I'm trying to do.
I have a FASTA file with about 2000 segments of the H. influenzae strain Rd genome, each exactly 39bp (a few are shown on the left (yes, I know this is not FASTA format)).
Genbank has genome sequences of 12 other H. influenzae strains (shown on the left). I want to take each of the 2000 Rd sequences and find the sequences of its homologs in the other strains' genomes. If all the genomes have homologs of all the sequences, that will give me, in principle, 24,000 39bp sequences.
I've done this manually for a few of the 39bp sequences. Often BLAST says it found no matches - I don't really understand why.
When BLAST agrees to do what I want, it produces alignments like those on the left. The upper one shows an Rd sequence to which all the genomes have perfect matches; the lower one shows an Rd sequence several of whose homologs have a single mismatch to (yikes, is that syntax correct?).
What I need to do is automate this process, so BLAST searches the 12 genome sequences with each of the 2000 queries. Ideally the output will be in a form that I can easily use for the next step.
Because each of my ~2000 query sequences is centered on one of the ~2000 USS motifs in the Rd genome, they can all be aligned with each other (look back at the top figure to see this). What I want to do is create a meta-alignment of all the 24,000 homologs to these 2000 sequences. Then I will use this meta-alignment to score, for each of the 39 aligned positions how often one of the homologous sequences has a mismatch to its query sequence. That is, I want to align all 24,000 sequences (or all 24,000 patterns of dots), and create a histogram showing how often each position differs from its Rd homolog.
Only some of the 39 positions make significant contributions to the USS motif; seven internal positions and the five at each end are not part of the motif. I hope this analysis will show me whether positions that contribute to the USS motif are more or less variable than nearby USS-independent positions.
Idiosyncratic Thinking: a computer heuristics lecture
16 hours ago in Doc Madhattan