Just before he left for a brief Christmas vacation the postdoc did a detailed analysis of the genomic uptake sequences identified by (i) the genomic USS motif identified by the GibbsMotif Sampler and (ii) the DNA uptake motif identified by his sequencing experiment. The two motifs look quite different, and if we applied them both to the same long random DNA sequences we expect that they would pick out different sub-sequences that more-or-less correspond to the motif. But what will they identify in the H. influenzae genome?
We expect the sequences picked out by the genomic USS motif to resemble the search motif (because that's how the motif was identified in the first place). But what will the uptake motif find? It's much simpler, so will it find mainly sequences that just have the four-base inner core GCGG motif?
The analysis is done by sliding the motif across the genome, at each position using the motif to calculate a score for the 32 bases lined up with the motif. This is done with each strand of the 1,830,138 bp genome, so a total of 3,660,276 scores are generated with each motif. The postdoc then plotted a histogram of the scores for each motif.
One of the explanations we were considering for the differences between the two motifs is that the Gibbs Motif Sampler might have unrecognized biases that caused the sequences it identified to not be properly representative of the sequences in the genome. (The most likely candidate is the way we specified the search frame for the Gibbs analysis.) We were going to test this possibility by simulating the evolution of some genomes using each of the motifs in turn, and then test whether Gibbs searches of these evolved genomes gave the original motifs.
But this new result tells us that this possibility is not the explanation for the discrepancy between the two motifs. The uptake sequences in the genome really do look like the full genomic motif, even though the bias of the uptake machinery only cares strongly about the four inner-core bases. I confess that I like this result partly because it saves me from having to run a bunch of USS-evolution simulations to generate sequences for Gibbs analysis.
We suggest three other explanations. First, the steps leading from uptake to recombination might have sequence biases, so that only sequences with the complex motif efficiently recombine. Second, there might be functional constraints on the sequences after they've recombined, so that the complex ones are more likely to become fixed in the population. Although it's certainly likely that some sequence biases and functional constraints do exist, to me it seems very unlikely that they would generate such a complex motif. Thus I prefer our final possibility, that the uptake motif produced by the data is incomplete because it neglects the effects of interactions between the different positions that contributed to uptake (that is, because it incorrectly assumes that each base in the motif acts independently of the others).
We then go on to describe the interaction analysis we've done and the tests we've made (well, the postdoc's about to make) using defined sequences.