This analysis is prompted by the disparity (dissonance? discrepancy? disagreement?) between the uptake bias he's measured with his degenerate DNA fragments and the sequences overrepresented in the genome. The top part of the figure is a diagram of one of the DNA sequences that H. influenzae cells prefer to take up. The middle part is a 'sequence logo' based on the related sequences found in the H. influenzae genome, and the bottom part is a logo based on the uptake biases measured by the post-doc.
Because the two logos were drawn from very different datasets, we can't directly compare their overall 'importance' (the technical term is 'information content', indicated by the height of each column of letters); I've instead just shrunk the height of the genomic logo image so its overall importance appears similar to that of the uptake logo.
The two logos still look very different, even though their consensus sequences are both identical to that shown on the double helix above them. The genome logo has a block of nine 'core' bases on the left, all of roughly equal importance (indicated by their height, and two 'T-tracts' on the right. But the core bases in the DNA-uptake logo have very different importances, with four (GCGG) being much more important than the others. The T-tracts in the uptake logo also appear much less important than those in the genomic logo.
We think the sequences in the genome accumulated (over many millions of years) due to the sequence bias of the cells' DNA-uptake machinery, so we don't understand why the two patterns are so different. Maybe other cellular processes contribute additional sequence biases, or maybe the difference is just an artefact of the way the genome sequences were identified. One way to (maybe) clarify the issues is to simulate the accumulation process in a computer program. We already have such a program (described in this research paper), and have used it with the data matrix that specifies the genomic logo. So in principle all we need to do is run this program with the new uptake-based matrix.
In practice, not so easy. The model is quite complicated even though the processes it simulates are treated very simply, and I've forgotten all the details about how it works. Luckily it's quite well documented, and the paper describing it is very clearly written (I'm patting myself on the back for this). One thing I do remember is that the program ran very slowly when dealing with the big genomic matrix (29 positions) rather than the short fake matrix we used for most runs. I can help it out by specifying a fast mutation rate, a short genome and short DNA fragments, and by seeding the genome with some partial matches to the uptake sequence consensus.