Our Perl model of USS evolution in a genome runs well, but USS-like sequences accumulate only slightly. I've been playing around with various factors that might be responsible but haven't really gotten anywhere. I need to get these factors clear in my mind, so it's time for a blog post about them.
What the model does (before I started fiddling with it on Friday): (I should clarify that this version (USSv5.pl) doesn't yet have all the formatting bells and whistles that would make it graceful to use, but it does have the basic functionality we think we want.) It first creates a random 'genome' sequence of specified length and base composition, whose evolution the model is going to simulate. In each evolutionary cycle it first takes random segments of this genome, mutates them according to a specified mutation rate, and scores them for sequences similar to a sequence motif specified in the form of a matrix (see figure in this post for examples). This version of the model uses a new multiplicative scoring procedure rather than our original additive procedure. Each segment's sequence has a chance to replace its original sequence, with probability proportional to its score expressed as a fraction of some preset maximum score. The modified genome is scored by the same procedure used for the segments, and then becomes the starting sequence for the next cycle. (We had intended that the genome would undergo mutation in each cycle but this step has been bypassed because it was causing the USS-like motifs to degenerate faster than they accumulated.)
We first tested the additive and multiplicative scoring procedures to see how much difference a perfect USS sequence made to the score of otherwise random-sequence genomes. As we already knew, the additive procedure gave sequences with USS scores that were at best only about 1% higher than sequences without USS - the precise difference depends on the length of the sequence (we tested 100, 1000 and 10000 bp) and on the numbers in the scoring matrix .
The scores obtained with the multiplicative procedure were far more sensitive to the presence of a USS. For the 100bp test sequences, scores with USS were from 2-fold to 700-fold higher than for the same sequence without a USS, depending on how much higher USS-consensus bases scored than non-consensus bases. The lowest sensitivities were seen when this ratio was 2 for all positions, with higher sensitivities when the ratios were from 5-fold to 50-fold.
So this looked very promising - with such a sensitive scoring system I expected USS-like sequences to accumulate rapidly and to a high frequency. But this didn't happen. The genome scores did go up dramatically, but this turned out to be due to the much more sensitive scoring system acting on only a few base changes.
I played around with different genome sizes and fragment sizes and numbers and mutation rates and matrix values, but nothing seemed to make much difference. "Seemed" is the right word here, as I didn't keep careful records. I created or reinstated various reporting steps, so I could get a better idea of what was going on. I also replaced the preset maximum segment score with a variable score (= 5% of the previous cycle's genome score), so that the strength of the simulated bias would increase ass USS-like sequences accumulated in the genome.
But USS-like sequences didn't accumulate much at all, and I don't know why. There could be a problem with the code, but none of the informal checks I've done has set off any alarm bells. There could instead be a fundamental problem with the design of the simulation, so that what we are telling the simulation to do cannot lead to the outcome we expect. Or perhaps only a very small portion of 'parameter-space' allows USS-like sequences to accumulate.
The post-doc and I came up with two approaches. One is to meticulously investigate the effects of the various factors we can manipulate, keeping everything else as simple and constant as possible. The other is to use our brains to think through what should be happening. While we're doing this our programming assistant will be adding a few more sophisticated tricks to the program, to make our analysis easier.
I'll end with a list of factors we need to method
2016 Nobel Prize predictions
27 minutes ago in The Curious Wavefunction