But let's assume that they do belong in this manuscript, and consider the next question of where would they fit best and how they would be connected with the other parts. They have to go after the first major section, which describes the analyses of the H. influenzae and N. meningitidis genomes with the Gibbs Motif Sampler, because they use datasets this generated. Should they go before the other major section, which describes the Perl model of uptake sequence evolution, or after it? These two major sections fit well together, because the model makes predictions that can be tested with the Gibbs datasets.
What if the model went first, using the generic uptake sequence? It could then be followed by the Gibbs analyses, applied both to the model's output and to the real genomes. And then by more versions of the model, using the position-weight matrices generated from the real genomes. We could then do the other analyses of the genomic data....
Here's an attempt at an outline using this order:
Emphasize the goal of understanding the cause of uptake sequences. Summarize what we know of their properties and the evidence that they arise by point mutation and spread by transformation more efficiently than other sequences because they are preferentially taken up (cartoon figure contrasting drive and beneficial-variation models). We have developed a model of this evolutionary process, which we describe and test below.
- The model (how it works, with a cartoon figure).
- We run the simulations to equilibrium (evaluated as score or US count), using a 10 bp generic position-weight matrix.
- Characterization of model: Effects of (i) genome length (each cycle takes longer); (ii) mutation rate (getting to equilibrium takes more cycles); and (iii) recombination rate (determines equilibrium score, below saturation).
- Properties of equilibrium sequences (i): Proportions of perfect and mismatched occurrences. Use Gibbs to find them? Compare with direct counts of perfects, one-offs and two-offs? Explain why Gibbs is, in principle, better?
- Properties of equilibrium sequences (ii): Spacings between occurrences (found by Gibbs. These are more even than random, as are the spacings between real uptake sequences. The spacing depends on the length of the recombining fragments.
- Use Gibb to reanalyze the Perl output sequences.
- Use Gibbs to reanalyze the genome sequences.
- Repeat the key Perl runs using the position-weight matrices from the Gibbs analyses of real genomes.
- Compare what we find.
- Do other reanalyses and new analyses of the Gibbs output of the genome analyses. (Variation in subsets of genome sequences (no interesting results but worthwhile anyway) This would include the covariation analysis and the BLAST analysis of within-species sequence variation.
A couple of hours later: I don't think it is. For now I'm going to treat the manuscript as two distinct parts. First, the Gibbs analysis of the genome and sub-genome sequences, and analyses using the occurrences these identify, including the covariation and within-species variation analyses. Second, the Perl simulation of uptake sequence evolution, and comparison of its results with the genome analyses.