The simplest version of our new Perl model of USS evolution has progressed to the state where it runs correctly. This afternoon I've been doing lots of runs, both with a 'positive control' version that replaces a random genome position with a single perfect USS core in every cycle, and with a test version that mutates random fragments and scores them for goodness of match to the USS motif, and then recombined the best-matched one back into the genome. Tomorrow the undergrad and I are going to create a modified version, to try a different way of having the fragments' scores determine whether they recombine with the genome.
With the positive-control version I've been examining the effect of changing the genomic mutation rate. If the mutation rate is zero, the only limit to USS accumulation is the way insertion of new USS disrupts existing USSs. (This happens only because each 10bp fragment is changed to a perfect USS before recombination, and so bears no relation to the original sequence at that position.) Not surprisingly, more USSs are present at equilibrium when the mutation rate is zero, and fewer when the mutation rate is 0.01 or 0.05 changes per position per cycle. The rate of increase in genome score is largely independent of the mutation rate. Because only a single USS is inserted per cycle, the number of cycles to equilibrium depends on the length of the genome.
Wait - good idea! I think we need to add some code to give us the frequency of each sliding-window score at the end of the run. This would let us make a histogram of how many USS-like sequences the genome has at the beginning of the run, and how many it ahs accumulated at the end. Basically, as the sliding-window is scoring match to the motif at each position, it should record the score in a tally (number of sites scoring 0, number scoring 1, number scoring 2, ...... number scoring 10). I could write some inefficient code to do this (barring about a thousand syntax errors - I really should go back and reread the first few chapters of Beginning Perl for Bioinformatics), but this sounds like something the undergrad might have learned an efficient way to do.
Did I learn anything else from the positive control runs? If the genome is very short the program runs very fast but the scores are noisy (no surprise there). I learned that I have no practical insight into how a sequence's USS 'score' reflects the quality of its matches to the motif - that's why we need the tally. I played around with the 'threshold' we use as a cutoff for insignificant matches to the USS consensus, but I think we can't really understand what this accomplishes until we have the score tally.
I also did some runs with the test version (not the positive control). The results of these mostly served to reinforced the importance of the genomic mutations. Under the present recombination system, USS can't accumulate in the genome because they mutate away faster than they're improved by recombination. I tried turning off the mutation of the genome, so that mutation only happens to fragments that are about to be scored for possible uptake. Even with this 'cheating', the genome's USS score crept up slowly and then plateaued at what looks (without the tally) to be a genome with only weak USS sites.
The mathematics and geometry in John Hejduck
4 hours ago in Doc Madhattan