The US-variation manuscript is coming together nicely at last, not so much the writing as the data and the overall organization. I've even figured out how to combine all the odds and ends of cool results that I was afraid would have to be left out (the BLAST analysis of USS variation, the covariation analysis, the DNA uptake data). One important part still to do is the analysis of spacing of uptake sequences in real and simulated genomes.
I wrote about this in a couple of posts last June (Spacing of uptake sequences, Real uptake sequence spacings are not random). I had first checked the spacings of the generic uptake sequences selected by our Perl model of uptake sequence evolution, and found that these were far from random, with uptake sequences very rarely found within one fragment length of each other. This was true for a wide range of fragment lengths. Then I compared the spacings of real uptake sequences in the H. influenzae genome with those expected for random locations, and found that both closely-spaced and far-spaced USSs were underrepresented. This had been previously reported for perfect-consensus USS cores, but I did it with the positions found by the Gibbs motif sampler.
What I need to do now: 1. I need to redo the analysis of the generic uptake sequences from the Perl model, because the model has changed slightly. I expect these results to be just about identical to the previous ones. 2. I need to analyze the N. meningitidis genome just as I did the H. influenzae one. Again, this should be done with the positions found by the Gibbs search, not just by counting perfects and one-offs. The only problem is that I need to find the original Gibbs output for the run whose sequences I've been using for various analyses in the manuscript. (I do have the file with all the sequences, but their positions have been deleted.)
The plan is to present the spacing analysis of the real genomes early in the manuscript, with the other Gibbs analyses, and then come back to this with the analysis of the simulated sequences.
Update: I found the missing N. meningitidis Gibbs data and extracted the spacings for both the uptake sequences in the same orientations and the uptake sequences in both orientations. The post-doc taught me how to make histograms in Excel. I found the program that creates the control data (the same-as-US number of random positions in a same-length sequence), and created a random set of both-orientation spacings and of same-orientation spacings. But I forgot that I need to create at least 10 random sets to smooth out the noise. Now I'd better start writing the first part of the text about this.
Leroy Hood and the tool-driven revolution in biology
1 day ago in The Curious Wavefunction