Benchmarking(??) the USS model

(I don't think 'benchmarking' is actually the right term for what I've been doing...)

The comments on my last post recommended using the Perl profiling function to find out which parts of our simulation are indeed responsible for its slow runtime. So I Googled "DProf" and ran it on the simulation, and then poked around to find out how to use 'dprofpp' to turn the DProf output into a table listing where most of the time was being spent. Pretty soon I discovered that I just needed to type 'dprofpp' at the usual prompt. (duh).

Then I spent quite a while trying to figure out what the profile entries meant. Here's an example of the output.

  • Total Elapsed Time = 65.63472 Seconds
  • User+System Time = 63.53472 Seconds
  • Exclusive Times
  • %Time ExclSec CumulS #Calls sec/call Csec/c Name
  • 41.6 26.44 37.689 543882 0.0000 0.0000 Switch::case
  • 28.9 18.37 18.373 241893 0.0000 0.0000 Switch::switch
  • 17.7 11.24 11.243 543882 0.0000 0.0000 Switch::__ANON__
  • 16.5 10.52 38.066 12 0.8768 3.1722 main::tallyScores
  • 0.25 0.156 0.195 3065 0.0001 0.0001 Text::Balanced::_match_quotelike
  • 0.23 0.146 0.227 3692 0.0000 0.0001 Text::Balanced::_match_variable
  • 0.23 0.143 0.608 44 0.0032 0.0138 Switch::filter_blocks
  • 0.15 0.095 0.128 1599 0.0001 0.0001 Text::Balanced::_match_codeblock
  • 0.06 0.040 0.040 7432 0.0000 0.0000 Text::Balanced::_failmsg
  • 0.04 0.027 0.027 13146 0.0000 0.0000 Regexp::DESTROY
  • 0.03 0.020 0.029 9 0.0022 0.0032 Switch::BEGIN
  • 0.02 0.010 0.010 1 0.0100 0.0100 main::createRandomGenome
  • 0.02 0.010 0.010 1 0.0100 0.0100 DynaLoader::dl_load_file
  • 0.00 0.000 0.000 1 0.0000 0 Exporter::Heavy::heavy_export_ok_tags
  • 0.00 0.000 0.000 1 0.0000 0.0000 Exporter::Heavy::heavy_export

(I tried putting the above into a monospaced font for clarity, but the editor collapsed all the extra spaces so that wasn't much help. Putting it in point form stops the editor from collapsing all the single line breaks into one big paragraph.)

The problem is that our program wasn't written with this profiler in mind, and the big time hogs aren't separate subroutines but loops within the main program. We could change this, but here 'we' means our programming assistant, not me.

I did figure out that the various 'switch' entries are steps in the sliding-window scoring. But this scoring happens to different components of the program and in different ways. I tried running the profiler on different versions of the simulation (large or small genome sequences, large or small fragments, many or few fragments) but I couldn't disentangle the effects in the resulting profiles.
So instead I began just recording the runtimes taken by sets of runs in which I systematically varied only one of the input parameters, with the simulation set to run for only 10 cycles. The first graph shows the effect of changing the genome size but holding the fragment size and number of fragments constant (at values so low that they don't contribute much to the runtime).

The first discovery is that the 'tally' we added to help us see how the genome evolves almost doubles the runtime for each genome length (compare the blue line to the red line). This is because obtaining the tally requires a sliding-window scoring step. Even though this scoring doesn't use the position-weight matrix, the time it takes is similar to the time needed for the formal scoring of the genome. The good news is that this step is completely dispensable.

The second discovery is that the steps that mutate the genome and the fragments add almost nothing to the runtime (compare the green and red lines). So we won't bother trying to further optimize the mutagenesis.

The third discovery is that the runtime is proportional to the length of the genome. Because the fragments are few and small, this is almost certainly due to the time needed for the sliding-window scoring. This tells us that reducing the number of times the genome needs to be scored should dramatically speed up the simulations.

The second graph shows how runtime depends on the number of fragments being tested. The fragment size was held at 200bp (ten times more than in the previous set of runs), and the genome length was 10kb.

Here we see that including the tally step increases all the run's times by about 20 seconds (compare blue and red lines). This is just what we expect for a 10kb genome from the previous graph. And again the mutation steps in each cycle contribute almost nothing to the runtime.

The runtime is directly proportional to the number of fragments, consistent with the sliding-window scoring being the major factor.

To confirm that other steps in processing fragments aren't a big problem, I also varied fragment length while holding fragment number constant. This third graph shows that runtime is linearly dependent on fragment length.

Finally, I held the total length of fragments to be screened constant (the product of fragment length and number of fragments screened each cycle). If the time needed for the sliding-window scoring of the fragments is the main factor responsible for the increasing runtime, then holding the total length constant should give a constant runtime even when fragment length and number are reciprocally varied.

That's exactly what we see, provided the fragments are larger than about 100bp. (As the fragment size decreases to approach the size of the window, the number of positions the window takes decreases faster than fragment size (a 10bp fragment can only be scored at one window position). The runtime is proportional to the length of sequence being scored each cycle.

This suggests one more test. I've been keeping either the genome length or the total fragment length constant. What if I plot runtime as a function of the total length of sequence (genome plus fragments) being scored each cycle.

If the sliding-window scoring is indeed the major factor affecting runtime, then there should be a linear relationship between total sequence and runtime. And voila, here it is.

6 comments:

  1. Your DProf results point to the culprit. Notice that the majority of the time is spent in code that is not yours! This is what I mean about profiling showing up unexpected bottlenecks. Your graphical approach is useful too, although as it treats the program as a "black box" and so cannot give you the complete answer that Switch seems to be devastating your efficiency.

    The vanilla alternative to Switch is if/else

    if ($currentBaseInWindow eq "A") {
    $forwardScore = $forwardScore*$matrix[$t][0];
    } elsif ($currentBaseInWindow eq "T") {
    $forwardScore = $forwardScore*$matrix[$t][1];
    } elsif ($currentBaseInWindow eq "C") {
    $forwardScore = $forwardScore*$matrix[$t][2];
    } elsif ($currentBaseInWindow eq "G") {
    $forwardScore = $forwardScore*$matrix[$t][3];
    } else {
    die "Unknown base $currentBaseInWindow\n";
    }

    Notice that I check for bases that are not ACGT rather than falling through the statement.

    As a quick test I went back to that code I posted on your blog a few weeks ago and inserted into the scoring loop a similar statement using Switch that calculates an arbitrary, meaningless score. Then I replaced the Switch with an equivalent if/else. You may find the results interesting!

    Genome size: 1000000 bp
    Time is the average of 3 runs

    Switch version:
    1 min 46.31 sec

    if/else version:
    0 min 9.83 sec

    ReplyDelete
  2. This is a very nice analysis that shows why you want to profile your code before attempting to optimize it. You're making excellent progress.

    I would follow keith's advice by eliminating the use of the Switch module. Using if/else statements is considered inelegant by most programmers (because it makes the code more difficult to understand), but it appears here that elegance and clarity are having a huge negative impact on the efficiency of your program.

    Regarding including code in your posts: The <pre> tag is useful for this, if blogger allows it. I just tested this, however, and the <pre> tag is not allowed in comments.

    ReplyDelete
  3. Keith, what do you mean by 'code that is not yours'? I'm afraid I don't even know what 'switch' does - the programming undergrad is responsible for it.

    But I gather that switch may be a very inefficient way to do the sliding-window scoring, and that an if/else approach may be more efficient. I'll try implementing it now, based on your example.

    ReplyDelete
  4. By "not yours" I mean code that is part of a library that is used (directly or indirectly) by your code.

    For example, the top line in the profiler output shows that a lot of time is spent in the 'case' function within the 'Switch' package. This is code written by Damian Conway.

    Switch is a library that aims to allow Perl programmers to use 'switch' or 'case' conditional statements in their code. Perl doesn't support such syntax, so the Switch library acts as what is known as a "source filter" and effectively rewrites your code into legal Perl.

    Or, should I say, ineffectively... Ba-dum Tish! Thankyou, I'll be here all week.

    ReplyDelete
  5. You can get rid of all of that switching with a hash lookup:

    my %base_indices = (
    A => 0,
    T => 1,
    C => 2,
    G => 3,
    );

    my $index = $base_indices{$currentBaseInWindow};
    die "Unknown base $currentBaseInWindow\n" unless $index;

    $forwardScore *= $matrix[$t][$index];

    ReplyDelete
  6. @chromatic - nice suggestion. It's tempting to use arrays of arrays to code matrices in Perl, as the programming guides tell you that it's the most similar data structure. However, I often use hashes of hashes, e.g.:

    %matrix = (0 => {'A' => 0.39, 'C' => 0.17, 'G' => 0.28, 'T' => 0.15},
    ...)

    where the first key (0) is the first column, the other keys (A..T) are the rows. This way, you can make an array from your sequence window and do:

    $score += $matrix{$i}{$oligo[$i]}

    i.e. $i is the column position and $oligo[$i] maps directly to $matrix{$i}.

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS