Yes, the Perl model has progressed to the point where it's now a research tool. But I now need to focus my use of it, to get useful data rather than just noodling around to see what happens.
One remaining uncertainty is the decision that a simulation has reached an equilibrium, where forces increasing the frequency of USS-like sequences are balanced by forces decreasing it. So far I've been running simulations for a specified number of cycles instead of 'to equilibrium', because I'm not confident that they will indeed correctly identify equilibrium conditions. Now I guess I should take the settings I used for runs that did reach what I consider to be equilibrium, and rerun them 'to equilibrium' instead of to the specified number of cycles.
A problem is that the runs still take quite a long time. For example, last night I started a run using a 50kb genome, and taking up 100bp fragments. Although it was close to equilibrium after about 6 hours, the equilibrium criterion hasn't been met yet (because this criterion is quite conservative). Maybe we should use a less-conservative criterion, at least for now, because we're really mainly interested in order-of-magnitude differences at this initial stage.
One useful pair of runs I've done examined the effect of having a non-zero genome mutation rate. This is of course the only realistic treatment, but in the 'testing' runs we've had the genome mutation rate set to zero, with mutations occurring only in the fragments being considered for recombination, because otherwise USS-like sequences didn't accumulate. Both of the new runs considered a 20kb genome and 200bp fragments, with a fragment nutation rate of 0.05 per cycle. One of these runs had a genome mutation rate of zero; the equilibrium genome score was 3 x 10^9, 100-fold higher than the starting score. The other run had a genome mutation rate of 0.001; its final score was only 4 x 10^8.
This isn't surprising because mutations are much more likely to make good USS-matches worse than better, and this degeneration is only countered by selection for match-improving mutations (and against match-worsening mutations) in those fragments that recombine. So targeting mutation to fragments that might recombine increases the ratio of selected match-improving mutations to unselected mutations. Another way to look at it is that the whole genome gets mutated at its rate every generation, and none of these mutations is selected for or against unless it subsequently changes due to a new mutation arising in a fragment.
It may be that setting lower mutation rates for genomes than for fragments is equivalent to assuming that, on average, fragments are from genomes of close relatives separated by R generations from the genome under consideration (where R is the ratio of fragment rate to genome rate). This is probably a reasonable assumption.
Another issue is how much of the genome can be replaced by recombination each cycle. I've been keeping this down to about 10%, but any value can be justified by having each 'cycle' represent more or fewer generations. So it we want a cycle to represent 100 generations, we should have the amount of recombination equivalent to 100 times the amount of recombination we might expect in a single generation. As we don't even know what this number should be, I guess there's no reason not to have 100% of the genome replaced each cycle.
I don't think there's any benefit to having more than 100% replaced, as each additional recombination event would undo the effect of a previous one. Hmm, could this be viewed as a variant of the genome-coverage problems that arise in planning shotgun-sequencing projects? They want to maximize the genome coverage while minimizing the amount of sequencing they do. Here we want to maximize the amount of the genome replaced while minimizing the amount of wasteful multiple replacements. The difference is that, for genome projects, it's important to cover almost all the genome - covering 99% is MUCH better than covering only 90%, so it's worth doing a lot more sequencing. For us, the emphasis is on more on avoiding wasteful recombination, and the difference between replacing 99% and replacing 90% is worth only 9% more fragment screening. I guestimate that the best compromise will be replacing about 50-75% of the genome in each cycle.
I've raised this issue before (point 2 in this post): One problem is that, as the genome evolves to have more USS-like sequences, the number of fragments that pass the recombination criterion increases. So the above discussion applies mainly at equilibrium, when the genome will the most USS-like sequences. We control the number of fragments that recombine by specifying the number of fragments to be considered (F) and by explicitly setting a limit (M) (e.g. max of 10 fragments can recombine each cycle). Early in the run F needs to be high, or it will be many cycles before a significant number of fragments has recombined. But a high F late in the run has the simulation wasting time scoring many fragments that will never get a chance to recombine. At present I've been setting F to be 5 or 10 times larger than M, but maybe I should try reducing F and increasing M.
5 hours ago in The Phytophactor