Field of Science

The postdoc's new analysis saves us work

Just before he left for a brief Christmas vacation the postdoc did a detailed analysis of the genomic uptake sequences identified by (i) the genomic USS motif identified by the GibbsMotif Sampler and (ii) the DNA uptake motif identified by his sequencing experiment.  The two motifs look quite different, and if we applied them both to the same long random DNA sequences we expect that they would pick out different sub-sequences that more-or-less correspond to the motif.  But what will they identify in the H. influenzae genome?

We expect the sequences picked out by the genomic USS motif to resemble the search motif (because that's how the motif was identified in the first place).  But what will the uptake motif find?  It's much simpler, so will it find mainly sequences that just have the four-base inner core GCGG motif?
The analysis is done by sliding the motif across the genome, at each position using the motif to calculate a score for the 32 bases lined up with the motif.  This is done with each strand of the 1,830,138 bp genome, so a total of 3,660,276 scores are generated with each motif.  The postdoc then plotted a histogram of the scores for each motif.
At this resolution it's no different than you would get for a random DNA sequence.  But if we zoom in on the bottom right corner of each graph, we see little blips of about 2000 high-scoring positions.  As expected, the sequences of the 1793 positions in the genomic-motif scoring blip give a motif that looks just like the genomic motif we searched with.  Unexpectedly, the sequences of the 1892 positions found with the much simpler uptake motif also give a motif a lot like the genomic motif, much more complex  than the uptake motif.  In fact, the two searches found mostly the same positions; 1689 of the positions in each blip were also present in the other blip.

One of the explanations we were considering for the differences between the two motifs is that the Gibbs Motif Sampler might have unrecognized biases that caused the sequences it identified to not be properly representative of the sequences in the genome.  (The most likely candidate is the way we specified the search frame for the Gibbs analysis.)  We were going to test this possibility by simulating the evolution of some genomes using each of the motifs in turn, and then test whether Gibbs searches of these evolved genomes gave the original motifs.

But this new result tells us that this possibility is not the explanation for the discrepancy between the two motifs. The uptake sequences in the genome really do look like the full genomic motif, even though the bias of the uptake machinery only cares strongly about the four inner-core bases.  I confess that I like this result partly because it saves me from having to run a bunch of USS-evolution simulations to generate sequences for Gibbs analysis.

We suggest three other explanations.  First, the steps leading from uptake to recombination might have sequence biases, so that only sequences with the complex motif efficiently recombine.  Second, there might be functional constraints on the sequences after they've recombined, so that the complex ones are more likely to become fixed in the population.  Although it's certainly likely that some sequence biases and functional constraints do exist, to me it seems very unlikely that they would generate such a complex motif.  Thus I prefer our final possibility, that the uptake motif produced by the data is incomplete because it neglects the effects of interactions between the different positions that contributed to uptake (that is, because it incorrectly assumes that each base in the motif acts independently of the others).

We then go on to describe the interaction analysis we've done and the tests we've made (well, the postdoc's about to make) using defined sequences.


  1. "This is done with each strand of the 1,830,138 bp genome, so a total of 3,660,276 scores are generated with each motif."

    As a side-note, since the scoring matrix is 32 positions long, There are only 1,830,107 alignment frames for a linear sequence. To get all frames, I pseudo-circularized the genome by duplicating the first 31 bases of the genome and concatenating them to the end.

  2. I have also preferred the final possibility (the interactions explain the discrepancy). But we can't entirely discount the other two without some other work.

    I think the use of "complex" above is confusing. Sequences with high scores using the genomic motif have LESS potential complexity than sequences with high scores using the uptake motif. More "specific" or more "informative" might make more sense.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS