One of the results of the postdoc's lovely analysis of DNA uptake specificity is that one block of four positions of the 32 nt 'uptake sequence' are critically important for uptake. This is 5'-GCGG-3' (and 5'-CCGC-3' in the other strand).
The H. influenzae genome contains 10,044 occurrences of this sequence, but a random sequence of the same length and base composition is expected to only contain 4107. This suggests that the molecular drive arising from biased DNA uptake may have caused the the excess ~6000 occurrences to accumulate in the genome. We know that about 2000 of them have strong matches to the full uptake sequence motif, but what about the rest? Might they also have more-or-less-weak matches, because they are under weaker drive?
The postdoc could do a thorough test of this using R, but he's busy with the final polishing of his uptake-motif paper (to be submitted by Monday, we hope). So I just did a quick test using Word.
I had used Word's Find/Replace function to count the GCGGs and CCGCs, so I did it again, this time highlighting the occurrences. I copied the sequences around the first 30 GCGGs in the genome sequence, and around 30 CCGCs from the middle of the genome, aligned them by hand, and used WebLogo to look for any patterns in the flanking sequences.
Bioplastic from weaver's broom
22 hours ago in Doc Madhattan