Speedy simulations

Our programming assistant has finished his work for us.  At least, he's finished the period when he's employed by us, but now he's gotten the research bug and wants to stay involved to see how the program he's developed for us works out.

At present it works great - MUCH faster than before.  One of the reasons is that it no longer scores the genome in every cycle.  I'm doing a run now with a big genome (100 kb) and I can see it pause at every 100th cycle, as it takes the time to score the genome.  This is a sensible improvement, as the genome score isn't needed for the progress of the simulation - it just lets us the user monitor what's happening, and provides information used if the run needs to decide whether it has reached equilibrium.

I'm using this run to test whether the final accumulated USS motif precisely matches the biases specified in the recombination decisions (i.e. in the matrix used to score each fragment), by using a matrix where one position has substantially weaker bias than the others.  If the final motif matches the matrix bias, this position should show a weaker consensus in the accumulated uptake sequences.  

Of course to test this I may have to recall how to run the Gibbs Motif Sampler on the Westgrid server.  The alternative is to chop my 100 kb output genome into 10 pieces and run each separately on the Gibbs web server, which has a sequence limit of 10 kb.

Sex and Recombination in Minneaopolis?

The meeting on Sex and Recombination that was to be held in Iowa City next week has been canceled because of severe flooding! (And just when I was getting my talk pulled into shape too...)

I suspect I'm not the only one who planned to combine that meeting with the big Evolution meetings in Minneapolis, June 20-24. I'm flying into Minneapolis on June 15. If you'd like to get together with me or other sex-refugees, post a comment below and we'll see what we can organize.

Preparing talks

I've been struggling to pull together ideas and data for the two talks I'm giving (next week and the week after) at evolution meetings.  Yesterday was my turn to give lab meeting, and I used it to get help from the post-docs.  My draft talks were a mess, but the post-docs had lots of excellent suggestions.  Both talks will use the same introduction to the evolutionary significance of bacterial DNA uptake, and will then diverge. 

On the USS-evolution simulation front, the model is running nicely and I'm using it to quickly collect data for my first talk (20 minutes, for the Sex and Recombination meeting).  But I have to compromise statistical significance with run time, as the large genomes needed to get lots of sequence data take a long time to simulate.  

On the proteome-evolution front, my bioinformatics collaborator just sent me the last of the data, including a control set needed for comparison with the analysis of how divergent USS-encoded peptides are to their homologs.

Creeping up to equilibrium?

One other issue about equilibrium:

In our previous (unrealistic model) we found that USS initially accumulated very quickly, as singly and doubly mismatched sites were converted to perfectly matched sites. But this happened at the expense of depleting the genome of those mismatched sites, and further accumulation of perfect sites required waiting a long time for mutation of worse sites to regenerate the singly and doubly mismatched ones, which would then slowly allow further increase in the number of perfect matches. So achieving true equilibrium took a long time.

I expect this phenomenon to also apply in this new model. So I'm not at all confident that an early leveling-off of the rate of increase indicates closeness to the true equilibrium.

In the graphs to the left, the upper simulation probably hasn't reached equilibrium after 90,000 cycles (because the blue points indicating genome scores are still increasing), but the lower one has (because the blue points seem to be scattered around a stable mean).

I'm not sure why the lower run reached equilibrium so much faster than the upper one. Several factors differed - this is why I need to be more systematic. My excuse is that it's easier to motivate a systematic approach when individual tests are fast to do, and there are so many variables to test that I hate to spend a lot of time on just one. But it's time to treat this like real science.

Simulating USS evolution

Yes, the Perl model has progressed to the point where it's now a research tool. But I now need to focus my use of it, to get useful data rather than just noodling around to see what happens.

One remaining uncertainty is the decision that a simulation has reached an equilibrium, where forces increasing the frequency of USS-like sequences are balanced by forces decreasing it. So far I've been running simulations for a specified number of cycles instead of 'to equilibrium', because I'm not confident that they will indeed correctly identify equilibrium conditions. Now I guess I should take the settings I used for runs that did reach what I consider to be equilibrium, and rerun them 'to equilibrium' instead of to the specified number of cycles.

A problem is that the runs still take quite a long time. For example, last night I started a run using a 50kb genome, and taking up 100bp fragments. Although it was close to equilibrium after about 6 hours, the equilibrium criterion hasn't been met yet (because this criterion is quite conservative). Maybe we should use a less-conservative criterion, at least for now, because we're really mainly interested in order-of-magnitude differences at this initial stage.

One useful pair of runs I've done examined the effect of having a non-zero genome mutation rate. This is of course the only realistic treatment, but in the 'testing' runs we've had the genome mutation rate set to zero, with mutations occurring only in the fragments being considered for recombination, because otherwise USS-like sequences didn't accumulate. Both of the new runs considered a 20kb genome and 200bp fragments, with a fragment nutation rate of 0.05 per cycle. One of these runs had a genome mutation rate of zero; the equilibrium genome score was 3 x 10^9, 100-fold higher than the starting score. The other run had a genome mutation rate of 0.001; its final score was only 4 x 10^8.

This isn't surprising because mutations are much more likely to make good USS-matches worse than better, and this degeneration is only countered by selection for match-improving mutations (and against match-worsening mutations) in those fragments that recombine. So targeting mutation to fragments that might recombine increases the ratio of selected match-improving mutations to unselected mutations. Another way to look at it is that the whole genome gets mutated at its rate every generation, and none of these mutations is selected for or against unless it subsequently changes due to a new mutation arising in a fragment.

It may be that setting lower mutation rates for genomes than for fragments is equivalent to assuming that, on average, fragments are from genomes of close relatives separated by R generations from the genome under consideration (where R is the ratio of fragment rate to genome rate). This is probably a reasonable assumption.

Another issue is how much of the genome can be replaced by recombination each cycle. I've been keeping this down to about 10%, but any value can be justified by having each 'cycle' represent more or fewer generations. So it we want a cycle to represent 100 generations, we should have the amount of recombination equivalent to 100 times the amount of recombination we might expect in a single generation. As we don't even know what this number should be, I guess there's no reason not to have 100% of the genome replaced each cycle.

I don't think there's any benefit to having more than 100% replaced, as each additional recombination event would undo the effect of a previous one. Hmm, could this be viewed as a variant of the genome-coverage problems that arise in planning shotgun-sequencing projects? They want to maximize the genome coverage while minimizing the amount of sequencing they do. Here we want to maximize the amount of the genome replaced while minimizing the amount of wasteful multiple replacements. The difference is that, for genome projects, it's important to cover almost all the genome - covering 99% is MUCH better than covering only 90%, so it's worth doing a lot more sequencing. For us, the emphasis is on more on avoiding wasteful recombination, and the difference between replacing 99% and replacing 90% is worth only 9% more fragment screening. I guestimate that the best compromise will be replacing about 50-75% of the genome in each cycle.

I've raised this issue before (point 2 in this post): One problem is that, as the genome evolves to have more USS-like sequences, the number of fragments that pass the recombination criterion increases. So the above discussion applies mainly at equilibrium, when the genome will the most USS-like sequences. We control the number of fragments that recombine by specifying the number of fragments to be considered (F) and by explicitly setting a limit (M) (e.g. max of 10 fragments can recombine each cycle). Early in the run F needs to be high, or it will be many cycles before a significant number of fragments has recombined. But a high F late in the run has the simulation wasting time scoring many fragments that will never get a chance to recombine. At present I've been setting F to be 5 or 10 times larger than M, but maybe I should try reducing F and increasing M.

Means, arithmetic and geometric (= additive and multiplicative)

Now that we've replaced our old additive scoring algorithm with a multiplicative one (see this post), the algorithm that decides whether the simulation has reached its equilibrium isn't working well. The post-doc suggests that this is because we need to track the changing score of the genome using a multiplicative mean rather than an additive mean.

The test for equilibrium works as follows: The USS-score of the genome is calculated each generation, and is used to calculate both a "recent" mean score (over the interval between status updates, usually specified as a percent of the elapsed cycles) and a "grand" mean score (over the entire run). Both means are calculated as simple averages (sum of the scores divided by the number of scores). Early in the run the grand mean is much smaller than the recent mean. Equilibrium conditions are satisfied when the % difference between these means becomes smaller than some pre-specified threshold.

With additive scoring this worked well regardless of the length of the genome. But with multiplicative scoring, a single base change can cause a dramatic change in the score (ten-fold or more, depending on the scoring matrix used), especially when using short genomes. The post-doc, who is much more statistically sophisticated than I am, says that when numbers differ by large values, their means should be calculated geometrically rather than arithmetically.

Wikipedia explains that arithmetic means are the typical 'average' values, and geometric means are calculated using products and roots rather than sums and division. I'll say that another way: a geometric mean is calculated by first calculating the product of all the n values (rather than their sum) and then taking the n-th root of this product. Luckily this can be easily done using logarithms.

The post-doc has modified the code to do this, but she's now gone off to Barcelona (!) for the Society for Molecular Biology and Evolution meeting. I haven't yet tried out her version, but checking whether it finds better equilibria will be much easier now that the runs go so much faster.

Thank you for the comments!

The profiling I did yesterday, using DProf as suggested in a comment from Keith, showed that most of the runtime was spent in the Switch statements that are the heart of the sliding-window scoring algorithm. In new comments, Keith and Conrad explained that 'Switch' is not the fastest way to do the scoring, and that replacing it with a cascade of if/else statements could be a lot faster. (Faithful commenter Neil had also pointed out that Switch is buggy, but it wasn't setting off any alarms.)

So I've just replaced all three of the switch scoring steps with if/else cascades, and here's the spectacular results. Thanks, guys!