The co-author only needed about 20 minutes to write the little perl script I'd asked for, and here are the results I generated with it. The blue bars show the spacing distribution of the real H. influenzae uptake sequence cores, and the red bars show the spacing distribution of the same number of random positions in a sequence of the same size (based on 10 replicates).
So the statistical advisor was correct; the real distributions have far fewer close spacings than expected for a random sequence. This is despite the the presence of two forces expected to increase close spacings - USS pairs serving as transcription terminators, and general clustering of USS in non-coding sequences. As hypothesized in yesterday's post, the discrepancy between real and random is stronger when we eliminate the terminator effect by only considering spacings between USS in the same orientation.
And the previous mathematical analysis was right, the real spacing distribution is more even than the random one - not only are there fewer close USS than expected, there are fewer far-apart USS too. (This is more obvious in the upper graph.)
Now I need to modify the perl script that locates uptake sequences so it will find the Neisseria DUS, and then do the same analysis for the N. meningitidis genome. The previously published analysis of these spacings simply compared the mean spacing of DUS with the mean length of apparent recombination tracts, concluding that the similar lengths was consistent with the hypothesis that DUS arise or spread by recombination.