Field of Science

Clarification of the analysis I want to do

Commenters are offering help and advice! Damn, I love open science! But I need to clarify what I'm trying to do.

I have a FASTA file with about 2000 segments of the H. influenzae strain Rd genome, each exactly 39bp (a few are shown on the left (yes, I know this is not FASTA format)).





Genbank has genome sequences of 12 other H. influenzae strains (shown on the left). I want to take each of the 2000 Rd sequences and find the sequences of its homologs in the other strains' genomes. If all the genomes have homologs of all the sequences, that will give me, in principle, 24,000 39bp sequences.






I've done this manually for a few of the 39bp sequences. Often BLAST says it found no matches - I don't really understand why.

When BLAST agrees to do what I want, it produces alignments like those on the left. The upper one shows an Rd sequence to which all the genomes have perfect matches; the lower one shows an Rd sequence several of whose homologs have a single mismatch to (yikes, is that syntax correct?).

What I need to do is automate this process, so BLAST searches the 12 genome sequences with each of the 2000 queries. Ideally the output will be in a form that I can easily use for the next step.

Because each of my ~2000 query sequences is centered on one of the ~2000 USS motifs in the Rd genome, they can all be aligned with each other (look back at the top figure to see this). What I want to do is create a meta-alignment of all the 24,000 homologs to these 2000 sequences. Then I will use this meta-alignment to score, for each of the 39 aligned positions how often one of the homologous sequences has a mismatch to its query sequence. That is, I want to align all 24,000 sequences (or all 24,000 patterns of dots), and create a histogram showing how often each position differs from its Rd homolog.

Only some of the 39 positions make significant contributions to the USS motif; seven internal positions and the five at each end are not part of the motif. I hope this analysis will show me whether positions that contribute to the USS motif are more or less variable than nearby USS-independent positions.

4 comments:

  1. I think the matching should be fairly straightforward. I tried a proof-of-concept thing with just one influenzae genome (PitEE) and two USS sequences. It took a few minutes to download the genome and put it into a database, but the scan itself took less than a half second (probably less than a tenth of a second) and yielded two hits as expected (or at least, as I expected). Stripping away the extraneous info the output looks like this (It will probably look like crap on the screen, though):


    Query: 39 GCAATAAAAGTTTGACTATTCTAACCGCACTTTTTAATA 1
    |||||||||||||||||||||||||||||||||||||||
    Sbjct: 376046 GCAATAAAAGTTTGACTATTCTAACCGCACTTTTTAATA 376084


    Query: 39 TATTAAATCCTCAGCAAATTTCAACCGCACTTTTTCCTT 1
    ||||||||||||||||||| |||||||||||||||||||
    Sbjct: 375304 TATTAAATCCTCAGCAAATATCAACCGCACTTTTTCCTT 375342

    Is this useful?

    Ideally the output will be in a form that I can easily use for the next step.

    The output can be tweaked some, but I think to get pairwise alignments there are a certain number of defaults that come along with it. It could be put into any other format with a bit of scripting, though, I presume.

    ReplyDelete
  2. All true, but the big hurdle is the 'bit of scripting'. I'm hoping the guy in the enxt office can help me with this.

    ReplyDelete
  3. The "script" to run each sequence against the genomes is basically one line in Linux, as long as the databases are set locally and you have all the query sequences in one directory. The tricky part is to define the parameters to run blast and the post-processing, depending on the type of output is selected..

    If you already have all the sequences and now that they share a motif that is highly conserved an alignment using local alignment (L-INS-i) with Mafft might be enough.

    Send me an email if you want more info.

    ReplyDelete
  4. I need to align each 39nt sequence with only its true homolog to score positions of mismatches. Then I want to create a histogram showing the frequencies of mismatches at each positions, summed over all the USSs in the genome. I think I'm nearly ready to do it (see today's post).

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS