One issue that's just come up is how interactions between bases at different positions of the preferred sequence motif will affect what sequences accumulate in the genome.
The top part of the figure below is a drawing of a double helix of DNA, with a specific sequence drawn on it, and below that are two 'sequence logos'. The first one is the pattern derived from the uptake sequences in the genome, and below that is the pattern derived from the sequences that were preferentially taken up by the cells' uptake machinery. The overall difference in height of the two logos isn't significant (they use sequences derived in very different ways), but the differences in the relative heights of the individual positions are. For example, in the genomic logo all of the Gs on the left are about the same height, but in the uptake logo the first G is much smaller than the others.
One issue our paper needs to address is the reasons that these two logos are so different.
Both of these logos are derived by considering only how frequent each base (A, G, C or T) is at each position in the set of sequences being analyzed. The analysis doesn't consider the actual sequences. For example, the two sets of sequences in the figure below (made using WebLogo) give the same logo. But the two sets of sequences are different; in the left one we have only strings of six As or six Ts, whereas in the second the As and Ts are often interspersed or in strings of different lengths.
But one of the reviewers of the version we originally submitted said that we were wrong: "If the consensus in the genome reflects only the incoming DNA and the filtering at the outer membrane (as the authors state) then the two consensus should be similar with or without interaction effects because the genomic consensus is the simple result of the initial consensus." I've thought about this today, and I now think the reviewer is correct.
Let's consider two simple situations for an imaginary uptake machinery whose preferred sequences gave the A&T logo above. In Situation 1, the actual sequences were those in Set 1, and we would conclude that there were strong interaction effects between the positions because the machinery preferred a sequence where six Ts in one strand were basepaired with six As in the other strand. In Situation 2, the actual sequences were those in Set 2, and we would conclude that the uptake machinery preferred a string of six A:T basepairs but didn't care which base was in which strand at any position.
Now let's imagine that species exist with each of these uptake biases, and that each uptake bias is causing its preferred sequences to accumulate in its species' genome (because these sequences come in as part of longer DNA fragments that often replace homologous sequences in the genome by recombination - this is our molecular drive model). In Situation 1 the genome will accumulate strings of 6 As on one strand paired with six Ts on the other. In Situation 2 the genome will accumulate strings of six A:T pairs in various orders.
Now we sequence the evolved genomes, collecting sets of the overrepresented sequences in each, and make logos of the sequences. Both logos will look like the logo above. To see the how the interaction effects in the uptake bias affected the accumulated sequences in the genome, we'd have to do an interaction analysis of the genomic sequences.
Years ago we did an interaction analysis of the genome sequences; you can see them in the last figure in this post from 2006. It found only weak interactions, and only between adjacent or near-neighbour positions, very different from the interactions the postdoc has identified in the uptake bias. More recently he applied his interaction analysis to the set of genomic uptake sequences, and he's now repeating it (that's easier than digging through his notes to find what it showed).