I've spend the last 48 hours watching simulations of uptake sequence evolution run (very slowly) and fiddling with code and queue-submission files and settings, but now it all seems to be falling into place.
First the former post-doc said that she thought the int was indeed an error, and that it probably arose when the coding-expert undergrad copied the code from another probabilistic step to the recombination step. She also made the excellent suggestion that we use the additive matrix in the US-variation paper.
Now that the erroneous int has been eliminated from the simulation code, runs with additive scoring do indeed work. So do runs with the less-extreme tenfold bias, and runs using the Gibbs position weight matrices derived from the H. influenzae and N. meningitidis genomes. None of this give as strong accumulation as the 100-fold bias matrix I have been using, but that's OK.
Testing the additive scoring led to several other improvements. I increased the resolution of the recombination decision by introducing a lower-cutoff for the random number range, as I mentioned in the previous post. I also found that when using the additive scoring we no longer needed to have the bias gradually decrease during each cycle, but could just leave it fixed. And I rediscovered a problem with scoring the genome additively. [Aside: scoring is done in two situations - first to decide how well the different positions in an individual fragment correspond to the matrix, so a decision about recombination can be made, and second to track the accumulation of preferred sequences in the whole genome, to follow the progress of the simulation.] When individual fragments are scored, only the highest position-score is used, and the scores of all other positions are discarded. But when the genome is scored we add up the scores of all the positions in the genome. With multiplicative scoring the low scores are many orders of magnitude lower than the good scores, so including them in the sum has little effect. But with additive scoring even really lousy matches to the matrix have significant scores, and the sum of these swamps out the effect of the rarer good scores. The solution to this problem is to only use additive scoring for the individual fragments, and to always score the whole genomes by using the matrix multiplicatively.
I also spent some time relearning how to run the simulation on the computer cluster in Nova Scotia that a colleague kindly gave me access to, and modifying the program to run properly in this different situation (just trivial fiddly stuff). I kept queueing runs on this system, thinking they were running OK, and then looking at the error file )"just in case" and finding that they had aborted at the start because forgot to specify a file, or specified it wrong, or specified it in the wrong order. But now I have about 15 runs going overnight!
My new plan for this part of the US-variation manuscript (it's most of the Results) is to start with the additively scored matrix, both because that's how position-weight matrices are commonly used in other situations and because this means we don't have to explain about reducing the bias during the cycles. I'll start with the 100-fold matrix, and show how we detect equilibrium, and that genome length doesn't affect the outcome (score per kb). Then I'll explain about mutation rates (genomic and fragment-specific). The preliminary runs I did today indicated that the mutation rate of the fragments doesn't affect the equilibrium score, which I continue to find surprising. I had suspected that this was somehow due to using the bias-reduction feature with the multiplicative scoring, but now I see it with additive scoring and fixed bias. But I have a more detailed set of runs going overnight which should make the situation clearer. Then I'll consider how the amount of recombination (amount of genome recombined in each cycle) affects the outcome, and the effect of fragment length on number and spacing of uptake sequences
Then I'll consider the effects of different matrices, first the ten-fold matrix I tried today, and then the multiplicatively scored 100-fold and ten-fold matrices, and then the Gibbs matrices. This last step links the model back to the Gibbs analyses that the Results starts with.
Drones, Silicon Valley and biology: The future isn't here yet
1 hour ago in The Curious Wavefunction