I've been looking more at the spacings of the uptake sequences that arise in our model genomes and that are present in the real
H. influenzae genome. First, in the panel on the left, is what we see in our simulated genomes. I only have data for simulations done with relatively small recombination fragments because the others take a very long time to run.
The notable thing (initially I was surprised) is that almost all of the uptake sequences in these genomes are at least one fragment length apart. That is, if the fragments that recombined with the genome were 25 bp (top histogram), none of the resulting evolved uptake sequences are in the 0-10 and 10-20 bins. If the fragments were 500 bp (bottom histogram, not to the same scale), none are in any bin smaller than 500 bp.
For the 100 bp fragments I've plotted both the spacings between perfect US (middle green histogram) and the spacings between pooled perfect and singly mismatched US (blue histogram below the green one). When the mismatched US are included the distribution is tighter, but there are still very few closer than 500 bp.
This result makes sense to me, because the uptake bias only considers the one best-matched US in a fragment. So a fragment that contained 2 well-matched USs is no better off than a fragment with one, and selection will be very weak on USs that are within a fragment's length of another US.
I talked this result over with my statistics advisor, and we agreed that there's not much point in worrying about whether the distributions are more even than expected for random spacing, because these are so obviously not random spacing.
After getting these results I decided I should take my own look at the spacings of the real uptake sequences in the
H. influenzae genome, and these are shown in the graphs in the lower panel. I first checked the spacings of all the uptake sequences that are in the same orientation; the upper graph shows the pooled data for spacings between 'forward' USS and spacings between 'reverse' USS. Let's consider this graph after the one below it.
The lower graph shows the spacings between all of the USSs, regardless of which orientation each is in. This is the data that has been previously analyzed and found to be non-random (in a mathematical analysis that's too sophisticated for me to follow).
We don't see the same abrupt drop-off at close spacings that's in the simulated genomes, but the statistics advisor thought the drop-off at close spacings might be significant. Rather than making a theoretical prediction, he suggested just generating randomly spaced positions (same number of occurrences as the real USS, and in the same length of sequence as the real genome) and comparing their distributions to the real ones in my graph. I'm asking my co-author to write me a little perl script to do this (yes, I probably could write it myself, but she can write it faster). I'll use it to generate 10 random pseudo-genomes worth of spacings, and compare their distributions to my real genome.
There are good reasons to expect the real distribution to not be random. First, uptake sequences are preferentially found outside of genes. In
H. influenzae 30% of the USS are in the 10% of the genome that's non-coding, and this would be expected to cause some clustering, increasing the frequencies of close spacings. Second, USS often occur in closely spaced, oppositely oriented pairs that act as transcriptional terminators. This is also expected to increase the frequencies of close spacings. I can't easily correct for the first effect, but I can correct for the second one by only considering USS that are in the same orientation, and that's what I've done in the upper graph.
There we see a much stronger tendency to avoid close spacings. The numbers aren't really comparable because I'm only considering half as many USS in each orientation, but the overall shapes of the two distributions suggest fewer close spacings in the same-orientations graph.
So, two things to do: First, predict the spacings of random positions, using the perl script that isn't written yet. Second, I need to get better data for the simulated evolved genomes, because these were initiated with fake genomes consisting of pure strings of uptake sequences, and the uptake sequences in the resulting evolved genomes had not originated
de novo but instead had been preserved from some of the originating ones. I need to redo this with evolved genomes that began as random sequences, but these are taking longer to simulate.
Added later: And a third thing to do: Reread the paper by Treangen et al., who investigated the spacings of the Neisseria DUSs and compared them to the average lengths of recombination tracts.