Commenters are offering help and advice! Damn, I love open science! But I need to clarify what I'm trying to do.
I have a FASTA file with about 2000 segments of the H. influenzae strain Rd genome, each exactly 39bp (a few are shown on the left (yes, I know this is not FASTA format)).
Genbank has genome sequences of 12 other H. influenzae strains (shown on the left). I want to take each of the 2000 Rd sequences and find the sequences of its homologs in the other strains' genomes. If all the genomes have homologs of all the sequences, that will give me, in principle, 24,000 39bp sequences.
I've done this manually for a few of the 39bp sequences. Often BLAST says it found no matches - I don't really understand why.
When BLAST agrees to do what I want, it produces alignments like those on the left. The upper one shows an Rd sequence to which all the genomes have perfect matches; the lower one shows an Rd sequence several of whose homologs have a single mismatch to (yikes, is that syntax correct?).
What I need to do is automate this process, so BLAST searches the 12 genome sequences with each of the 2000 queries. Ideally the output will be in a form that I can easily use for the next step.
Because each of my ~2000 query sequences is centered on one of the ~2000 USS motifs in the Rd genome, they can all be aligned with each other (look back at the top figure to see this). What I want to do is create a meta-alignment of all the 24,000 homologs to these 2000 sequences. Then I will use this meta-alignment to score, for each of the 39 aligned positions how often one of the homologous sequences has a mismatch to its query sequence. That is, I want to align all 24,000 sequences (or all 24,000 patterns of dots), and create a histogram showing how often each position differs from its Rd homolog.
Only some of the 39 positions make significant contributions to the USS motif; seven internal positions and the five at each end are not part of the motif. I hope this analysis will show me whether positions that contribute to the USS motif are more or less variable than nearby USS-independent positions.
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
From Valley Forge to the Lab: Parallels between Washington's Maneuvers and Drug Development4 weeks ago in The Curious Wavefunction
-
Political pollsters are pretending they know what's happening. They don't.4 weeks ago in Genomics, Medicine, and Pseudoscience
-
-
Course Corrections5 months ago in Angry by Choice
-
-
The Site is Dead, Long Live the Site2 years ago in Catalogue of Organisms
-
The Site is Dead, Long Live the Site2 years ago in Variety of Life
-
Does mathematics carry human biases?4 years ago in PLEKTIX
-
-
-
-
A New Placodont from the Late Triassic of China5 years ago in Chinleana
-
Posted: July 22, 2018 at 03:03PM6 years ago in Field Notes
-
Bryophyte Herbarium Survey7 years ago in Moss Plants and More
-
Harnessing innate immunity to cure HIV8 years ago in Rule of 6ix
-
WE MOVED!8 years ago in Games with Words
-
-
-
-
post doc job opportunity on ribosome biochemistry!9 years ago in Protein Evolution and Other Musings
-
Growing the kidney: re-blogged from Science Bitez9 years ago in The View from a Microbiologist
-
Blogging Microbes- Communicating Microbiology to Netizens10 years ago in Memoirs of a Defective Brain
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl12 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House13 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs13 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby13 years ago in The Large Picture Blog
-
in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
4 comments:
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
I think the matching should be fairly straightforward. I tried a proof-of-concept thing with just one influenzae genome (PitEE) and two USS sequences. It took a few minutes to download the genome and put it into a database, but the scan itself took less than a half second (probably less than a tenth of a second) and yielded two hits as expected (or at least, as I expected). Stripping away the extraneous info the output looks like this (It will probably look like crap on the screen, though):
ReplyDeleteQuery: 39 GCAATAAAAGTTTGACTATTCTAACCGCACTTTTTAATA 1
|||||||||||||||||||||||||||||||||||||||
Sbjct: 376046 GCAATAAAAGTTTGACTATTCTAACCGCACTTTTTAATA 376084
Query: 39 TATTAAATCCTCAGCAAATTTCAACCGCACTTTTTCCTT 1
||||||||||||||||||| |||||||||||||||||||
Sbjct: 375304 TATTAAATCCTCAGCAAATATCAACCGCACTTTTTCCTT 375342
Is this useful?
Ideally the output will be in a form that I can easily use for the next step.
The output can be tweaked some, but I think to get pairwise alignments there are a certain number of defaults that come along with it. It could be put into any other format with a bit of scripting, though, I presume.
All true, but the big hurdle is the 'bit of scripting'. I'm hoping the guy in the enxt office can help me with this.
ReplyDeleteThe "script" to run each sequence against the genomes is basically one line in Linux, as long as the databases are set locally and you have all the query sequences in one directory. The tricky part is to define the parameters to run blast and the post-processing, depending on the type of output is selected..
ReplyDeleteIf you already have all the sequences and now that they share a motif that is highly conserved an alignment using local alignment (L-INS-i) with Mafft might be enough.
Send me an email if you want more info.
I need to align each 39nt sequence with only its true homolog to score positions of mismatches. Then I want to create a histogram showing the frequencies of mismatches at each positions, summed over all the USSs in the genome. I think I'm nearly ready to do it (see today's post).
ReplyDelete