I'm drafting a manuscript that will compare the sequence preference of the H. influenzae DNA uptake machinery to the consensus sequence of the USS-related repeats in the H. influenzae genome. The previous posts (Reanalysis of old uptake data and New wine in old bottles) were about assembling old uptake-preference data. This post is about work towards an unbiased consensus of the USS-related repeats in the genome.
As I discussed at the end of Representation matters, our present understanding of the USS-like repeats comes from analyses that were intrinsically biased. For example, the analyses presented in our USS-2006 manuscript (see link in sidebar) began with the assumption that the USS would be best defined by a unbroken string of approximately 9 bases. However some of these analyses led to consensus sequences where the flanking AT-rich segments had nearly as strong a consensus as the core we were searching for, leading me to suspect that, if we had instead searched for the flanking consensus, we might have gotten a different but equally valid (or equally invalid) USS consensus. (I'm not happy with this description of the problem, but I'll continue anyway.)
So my goal is to use a motif-search program to search the genome for highly-repeated patterns, starting the searches with as few assumptions as possible, and varying those assumptions to include other possibilities. Several such programs are available on-line; one of the grad students has quite a bit of experience with them. Usually you can just paste the sequences you want analyzed into a web form, click a few boxes, and the analysis appears on the screen (or is sent to your in-box). My kind of bioinformatics.
The grad student advises using the Gibbs Motif Sampler, which is available in simplified form through RSA-Tools and in more sophisticated form through the Wadsworth Institute. The FASTA problem is solved (thanks to helpful readers), so I first tried just giving both versions the first 25000bp of the genome sequence, in five 5000bp sequences, and asking them to find one 9bp motif. The Gibbs analysis is built on Bayesian probabilities, so it asks you to specify a 'prior' expected number of occurrences of the motif. (I'm glad that I now understand a bit about what Bayesian means, if not how to do it - see posts early this month.)
The results come back in various forms; tables of frequencies of each base at each position, and an alignment of the occurrences of the motif it found. The best way to visualize the results is to hand then alignment over to the sequence logo program Weblogo (see Representation matters).
Here are a couple of examples of the logos. The one on the left comes from a search for a 10bp motif, and the one on the right from a search for a 30bp motif.
The first thing I learned (of course) was that this analysis wasn't going to be as simple as I'd hoped.
I already know that the long version of the USS consensus is a bit 'palindromic', because Ham Smith pointed this out when he discovered the long version in 1995. ...Brief interruption while I looked for a decent explanation of 'DNA palindrome' on-line, found nothing, and then drew the accompanying illustration. These base-paired DNA sequences are palindromic. In each, the sequence of the upper strand, read from its 5' end to its 3' end (red arrows) is the same as the sequence of the lower strand, read from its 5' end to its 3' end (blue arrows). (The analogy is to a text palindrome like "Madam, I'm Adam".
The USS consensus is far from being a perfect palindrome, but looking at the logo of the long version above shows that there are AAAs at the left end, TTTs at the right end, and a TTTAAA bit in the middle. For the Gibbs motif searches, this is enough that, if the expected number of occurrences is high (i.e. the search stringency is low), it aligns "+"-orientation USSs to "-"-orientation ones.
This only makes sense when we remember that the USS core is not at all palindromic, so on one strand it reads 5'-AAGTGCGGT and on the other strand it reads 5'-TTCACGCCA. About half the USSs in the genome point in one direction (e.g. clockwise around the genome, we can call this "+") and the other half point the other way ("-").
The tendency of Gibbs to align some of them back-to-front can be reduced (maybe even eliminated) by setting the expected number of occurrences quite low (no higher than the number of perfect USS cores we would really expect). This doesn't prevent the Gibbs program from finding more than this number. I need to think more about whether this does prevent it from finding occurrences that match quite imperfectly - my difficulty is that my own brain has a strong tendency to think mainly about the USS core.
I think this is how science is supposed to work - a kind of dialectic between reality and our biases. We start each analysis with expectations that don't match reality, then our biased analysis produces results that, although still wrong, are nevertheless closer to reality than the expectations we started with. We then take these modified expectations into the next analysis, and again get results that are a bit closer to reality. We never perfectly comprehend reality, but with good science we get closer and closer.
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
From Valley Forge to the Lab: Parallels between Washington's Maneuvers and Drug Development4 weeks ago in The Curious Wavefunction
-
Political pollsters are pretending they know what's happening. They don't.4 weeks ago in Genomics, Medicine, and Pseudoscience
-
-
Course Corrections5 months ago in Angry by Choice
-
-
The Site is Dead, Long Live the Site2 years ago in Catalogue of Organisms
-
The Site is Dead, Long Live the Site2 years ago in Variety of Life
-
Does mathematics carry human biases?4 years ago in PLEKTIX
-
-
-
-
A New Placodont from the Late Triassic of China5 years ago in Chinleana
-
Posted: July 22, 2018 at 03:03PM6 years ago in Field Notes
-
Bryophyte Herbarium Survey7 years ago in Moss Plants and More
-
Harnessing innate immunity to cure HIV8 years ago in Rule of 6ix
-
WE MOVED!8 years ago in Games with Words
-
-
-
-
post doc job opportunity on ribosome biochemistry!9 years ago in Protein Evolution and Other Musings
-
Growing the kidney: re-blogged from Science Bitez9 years ago in The View from a Microbiologist
-
Blogging Microbes- Communicating Microbiology to Netizens10 years ago in Memoirs of a Defective Brain
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl12 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House13 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs13 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby13 years ago in The Large Picture Blog
-
in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
2 comments:
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
I am interested to hear the details of the analyses you're doing.....like what different size motifs you are looking for (you mentioned 10-bp and 30bp in the blog) and what sort of numbers you get out at the end.
ReplyDeleteHelpful post.
ReplyDelete