In between thinking about the sxy manuscript (yes, it's getting closer to being done), I'm making progress on the motif analysis.
One big issue arises because about 70% of the USS are in sequences that also code for proteins. So these sequences have two different evolutionary forces acting on them. First, they must continue to code for proteins that function well - any new USS that compromise the function of the part of the proteins where they arise will be eliminated by natural selection. Second, the USS that function well in DNA uptake are preferentially replacing sequences that function less well.
We already know that USS in coding regions are preferentially found in particular reading frames. The figure to the left shows the number of perfect USS cores in each of the 6 possible reading frames of their proteins, and also shows the amino acids the USS would specify in each reading frame (source Karlin et al. 1996 NAR 24:4263-4272). You can see that most of the 395 'forward' USS are in reading frame 3, where the USS core specifies the tripeptide SAV (amino acids serine alanine valine). A similar fraction of the 571 'reverse' USSs are in reading frame -2, where their cores specify TAL (threonine alanine leucine). In both orientations the favoured frame specifies a central versatile amino acid flanked by one hydrophilic and one hydrophobic amino acid. Thus USS whose cores encode SAL and TAV have been suggested to be common because they are better tolerated than the others.
This analysis considered only perfect USS cores. Now I'm using the Gibbs sampler to get better datasets, and I want to find out whether being in coding regions affects the USS motif. I've finally managed to get the sampler to work on sets of genes, by splitting the gene sequence file into four short files it could handle and by specifying the expected motif structure with a strict 'prior' file.
The TIGR database provides links where I could download a file with all the coding sequences and a fine with all the non-coding sequences. Originally I thought I could only use these to get single motifs for (i) all the USS that code in the forward direction, (ii) all the USS that code in the reverse direction, and (iii) all the USS in non-coding parts of the genome. I know that it would be much more informative if I could sort out the sequences by reading frame (as in the drawing above), but I thought that would take some sophisticated bioinformatics skills.
But today I realized that it would actually be simple. The Gibbs output specifies, for each site that fits the motif, the position of its first base relative to the first base of the sequence. Because the first base is the start of the protein, I can find the reading frame simply by dividing the position number by 3. If the position of the first base of the motif is an even multiple of three (e.g. position = 93) then the motif is in frame 1. If it's not an even multiple but the remainder is 1, the motif is in frame 2. If the remainder is 2, the motif is in frame 3.
So all I need to do is paste the list of output sites into Excel (after some massaging in Word to get rid of excess spaces and separate the columns by tabs). Then I take the column of positions, use Excel's MOD function to get the remainder after division by 3, and SORT the sequences by the value in this column.
I had originally searched for USS coding in the reverse direction by searching the forward-coding sequences for the reverse motif. This isn't very satisfactory as it gives me a reverse-orientation motif, which is hard to compare with the forward one. But now I've found a web page that will take batches of sequences and convert them into their reverse complements (the sequence of the other strand), so I can analyze these for the forward motif. (Is your head spinning yet, dear reader?) Unfortunately it can only handle about 100 sequences at once, so I have to do the whole set (about 1730) in parts.
So that's what I'm about to do. I think I have all the Gibbs output files I need for this. The outputs aren't always centered on the same USS position so they're not all equally useful, but I think I have complete sets with the same center. And I have a few more queue'd up just in case.
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
Don't tell me they found Tyrannosaurus rex meat again!2 weeks ago in Genomics, Medicine, and Pseudoscience
-
-
-
Course Corrections4 months ago in Angry by Choice
-
-
The Site is Dead, Long Live the Site2 years ago in Catalogue of Organisms
-
The Site is Dead, Long Live the Site2 years ago in Variety of Life
-
Does mathematics carry human biases?4 years ago in PLEKTIX
-
-
-
-
A New Placodont from the Late Triassic of China5 years ago in Chinleana
-
Posted: July 22, 2018 at 03:03PM6 years ago in Field Notes
-
Bryophyte Herbarium Survey7 years ago in Moss Plants and More
-
Harnessing innate immunity to cure HIV8 years ago in Rule of 6ix
-
WE MOVED!8 years ago in Games with Words
-
-
-
-
post doc job opportunity on ribosome biochemistry!9 years ago in Protein Evolution and Other Musings
-
Growing the kidney: re-blogged from Science Bitez9 years ago in The View from a Microbiologist
-
Blogging Microbes- Communicating Microbiology to Netizens10 years ago in Memoirs of a Defective Brain
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl12 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House13 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs13 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby13 years ago in The Large Picture Blog
-
in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
2 comments:
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
You could really use some standalone software for doing basic, everyday stuff like reverse complementing without the restrictions imposed by web servers.
ReplyDeleteI really like EMBOSS. It's a package with hundreds of tools for sequence processing and analysis. Once installed, reverse complementing would be as easy as typing "revseq", then your sequence filename.
Are you guys Mac OSX users? If so you should check out ebiotools, which is an easy way to install EMBOSS and heaps of other bioinformatics tools. It's mentioned in this blog thread.
(Disclaimer: I can't address Mac OSX technical queries as I'm a Linux user).
It's true, we often find it easier to do things with the inappropriate tools we know than to learn to use new and better tools. This is partly because we're usually doing something we haven't done before; all the bioinformatics I'm doing now is a new direction for me, and I may rarely or never need to do these exact things again.
ReplyDeleteBut the EMBOSS package looks like it will be useful for lots of things, and well-documented enough that we can use it easily.
And yes, we're committed Mac people, so I'll try ebiotools.
Thanks!