Our Perl model of USS evolution now runs fine and does just about everything we want it to. Unfortunately it takes a long time to do this. To date we haven't bothered much about speed, being more concerned with just being able to do the job. But now it's time to improve its efficiency. I see several points where big improvements can be made, but first I should describe the steps the simulation presently takes.
1. Create genome sequence. T
his is done only once per run so speed isn't important.In each cycle: (
Efficiency is much more important for steps done every cycle.)
2. Mutate the genome, and choose and mutate a specified number of fragments of the genome.
I think the mutation step is already quite efficient. The number of fragments that need to be processed each cycle can probably be reduced - I'll describe this below.3. Score each fragment using a sliding window. Score the whole genome too.
This scoring is probably by far the most time-consuming step.4. Decide whether a fragment should recombine back into the genome, and do the recombination.
Not a big time-suck, I think.
5. Make decisions about whether results should be reported and whether the simulation has reached equilibrium. Then report, etc. Not a big time-suck, I think.
I can see three ways to reduce the amount of time used by the sliding-window scoring:
1. Don't score the genome every cycle. The genome score is used to give the operator a sense of how the run is progressing - this needs only be done at the reporting intervals, rather than after every cycle - and to decide whether equilibrium has been reached. To make this decision the simulation calculates a mean score over specified intervals and over the whole run. I think we could devise a way to get most of the information the equilibrium decision needs using genome scores calculated much less often, perhaps even calculated only at the reporting intervals.
2. Score only the number of fragments needed for recombination. The specified number of fragments to be mutated and scored in each run (F) is usually much larger than the maximum number of fragments allowed to recombine with the genome (M); at present these values are set at F=100 and M=10. When the genome has few or no strong matches to the USS consensus (at the start of a run), few or no fragments meet the recombination criterion, so every fragment needs to be considered. But later in the run much time is wasted scoring fragments that won't be considered for recombination.
Rather than mutating and scoring the full F fragments, then considering them for recombination until the M limit is reached, the simulation could just get M fragments, mutate, score and consider those, and then get another M fragments, until a total of M fragments have recombined.
3. Stop scoring each fragment once its score reaches a preset maximum. The program already has an implicit maximum score for fragments, set as the range of the probability a fragment will recombine. But at present the sliding window scores the whole fragment. It could instead stop scoring each fragment once this limit is reached. As in improvement 2, this won't have much effect at the start of a run, when few USS-like sequences are present, but should speed things up a lot later in the run.
4. Improve the efficiency of the sliding-window scoring algorithm. This is where we'd appreciate advice from any programing masters who read this blog. I'm going to paste the code for this bit below, but unfortunately I don't know how to format it so it's easily readable (all the tabs collapse). Maybe I'll first put the code without comments (tidy and compact) and then put it again with the comments.
while( $slideCount> 0)
{
my $t = 0;
my $forwardScore = 1;
while($t < $windowSize) {
my $currentBaseInWindow = substr($fragment,($counter+$t),1);
switch ($currentBaseInWindow)
{
case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }
case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }
case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }
case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }
}
$t++;
}
$windowScore = $forwardScore;
if($windowScore> $fragmentScore)
{
$fragmentScore = $windowScore;
}
$slideCount--;
$counter++;
Here's the version with comments:
while( $slideCount > 0) # Keep scoring window positions until slideCount falls to zero .
{ # Prepare to score the 10 bases in the window at its present position.
my $t = 0; # t is the position of each base contained in the window, with respect to the 5' end of the window.
my $forwardScore = 1; # Score for the 10 bp sequence at each window position, compared to forward USS.
# This code scores the 10 bases of the fragment that are in the window, from t=0 to t=9.
# It first gets the base at each position t, and then gets a score for this base from the scoring matrix.
while($t < $windowSize)
{
my $currentBaseInWindow = substr($fragment,($counter+$t),1); # The base will be A or T or G or C.
switch ($currentBaseInWindow)
{
case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }
case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }
case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }
case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }
}
$t++; # Prepare to consider the next base in the window.
}
$windowScore = $forwardScore;
if($windowScore > $fragmentScore)
{
$fragmentScore = $windowScore; # Save only the highest score.
}
$slideCount--;
$counter++; # Now we slide the window to the next position in the fragment sequence.