Benchmarking(??) the USS model

(I don't think 'benchmarking' is actually the right term for what I've been doing...)

The comments on my last post recommended using the Perl profiling function to find out which parts of our simulation are indeed responsible for its slow runtime. So I Googled "DProf" and ran it on the simulation, and then poked around to find out how to use 'dprofpp' to turn the DProf output into a table listing where most of the time was being spent. Pretty soon I discovered that I just needed to type 'dprofpp' at the usual prompt. (duh).

Then I spent quite a while trying to figure out what the profile entries meant. Here's an example of the output.

  • Total Elapsed Time = 65.63472 Seconds
  • User+System Time = 63.53472 Seconds
  • Exclusive Times
  • %Time ExclSec CumulS #Calls sec/call Csec/c Name
  • 41.6 26.44 37.689 543882 0.0000 0.0000 Switch::case
  • 28.9 18.37 18.373 241893 0.0000 0.0000 Switch::switch
  • 17.7 11.24 11.243 543882 0.0000 0.0000 Switch::__ANON__
  • 16.5 10.52 38.066 12 0.8768 3.1722 main::tallyScores
  • 0.25 0.156 0.195 3065 0.0001 0.0001 Text::Balanced::_match_quotelike
  • 0.23 0.146 0.227 3692 0.0000 0.0001 Text::Balanced::_match_variable
  • 0.23 0.143 0.608 44 0.0032 0.0138 Switch::filter_blocks
  • 0.15 0.095 0.128 1599 0.0001 0.0001 Text::Balanced::_match_codeblock
  • 0.06 0.040 0.040 7432 0.0000 0.0000 Text::Balanced::_failmsg
  • 0.04 0.027 0.027 13146 0.0000 0.0000 Regexp::DESTROY
  • 0.03 0.020 0.029 9 0.0022 0.0032 Switch::BEGIN
  • 0.02 0.010 0.010 1 0.0100 0.0100 main::createRandomGenome
  • 0.02 0.010 0.010 1 0.0100 0.0100 DynaLoader::dl_load_file
  • 0.00 0.000 0.000 1 0.0000 0 Exporter::Heavy::heavy_export_ok_tags
  • 0.00 0.000 0.000 1 0.0000 0.0000 Exporter::Heavy::heavy_export

(I tried putting the above into a monospaced font for clarity, but the editor collapsed all the extra spaces so that wasn't much help. Putting it in point form stops the editor from collapsing all the single line breaks into one big paragraph.)

The problem is that our program wasn't written with this profiler in mind, and the big time hogs aren't separate subroutines but loops within the main program. We could change this, but here 'we' means our programming assistant, not me.

I did figure out that the various 'switch' entries are steps in the sliding-window scoring. But this scoring happens to different components of the program and in different ways. I tried running the profiler on different versions of the simulation (large or small genome sequences, large or small fragments, many or few fragments) but I couldn't disentangle the effects in the resulting profiles.
So instead I began just recording the runtimes taken by sets of runs in which I systematically varied only one of the input parameters, with the simulation set to run for only 10 cycles. The first graph shows the effect of changing the genome size but holding the fragment size and number of fragments constant (at values so low that they don't contribute much to the runtime).

The first discovery is that the 'tally' we added to help us see how the genome evolves almost doubles the runtime for each genome length (compare the blue line to the red line). This is because obtaining the tally requires a sliding-window scoring step. Even though this scoring doesn't use the position-weight matrix, the time it takes is similar to the time needed for the formal scoring of the genome. The good news is that this step is completely dispensable.

The second discovery is that the steps that mutate the genome and the fragments add almost nothing to the runtime (compare the green and red lines). So we won't bother trying to further optimize the mutagenesis.

The third discovery is that the runtime is proportional to the length of the genome. Because the fragments are few and small, this is almost certainly due to the time needed for the sliding-window scoring. This tells us that reducing the number of times the genome needs to be scored should dramatically speed up the simulations.

The second graph shows how runtime depends on the number of fragments being tested. The fragment size was held at 200bp (ten times more than in the previous set of runs), and the genome length was 10kb.

Here we see that including the tally step increases all the run's times by about 20 seconds (compare blue and red lines). This is just what we expect for a 10kb genome from the previous graph. And again the mutation steps in each cycle contribute almost nothing to the runtime.

The runtime is directly proportional to the number of fragments, consistent with the sliding-window scoring being the major factor.

To confirm that other steps in processing fragments aren't a big problem, I also varied fragment length while holding fragment number constant. This third graph shows that runtime is linearly dependent on fragment length.

Finally, I held the total length of fragments to be screened constant (the product of fragment length and number of fragments screened each cycle). If the time needed for the sliding-window scoring of the fragments is the main factor responsible for the increasing runtime, then holding the total length constant should give a constant runtime even when fragment length and number are reciprocally varied.

That's exactly what we see, provided the fragments are larger than about 100bp. (As the fragment size decreases to approach the size of the window, the number of positions the window takes decreases faster than fragment size (a 10bp fragment can only be scored at one window position). The runtime is proportional to the length of sequence being scored each cycle.

This suggests one more test. I've been keeping either the genome length or the total fragment length constant. What if I plot runtime as a function of the total length of sequence (genome plus fragments) being scored each cycle.

If the sliding-window scoring is indeed the major factor affecting runtime, then there should be a linear relationship between total sequence and runtime. And voila, here it is.

Speeding up the simulation of USS evolution

Our Perl model of USS evolution now runs fine and does just about everything we want it to. Unfortunately it takes a long time to do this. To date we haven't bothered much about speed, being more concerned with just being able to do the job. But now it's time to improve its efficiency. I see several points where big improvements can be made, but first I should describe the steps the simulation presently takes.

1. Create genome sequence. This is done only once per run so speed isn't important.

In each cycle: (Efficiency is much more important for steps done every cycle.)

2. Mutate the genome, and choose and mutate a specified number of fragments of the genome. I think the mutation step is already quite efficient. The number of fragments that need to be processed each cycle can probably be reduced - I'll describe this below.

3. Score each fragment using a sliding window. Score the whole genome too. This scoring is probably by far the most time-consuming step.

4. Decide whether a fragment should recombine back into the genome, and do the recombination.  Not a big time-suck, I think.

5. Make decisions about whether results should be reported and whether the simulation has reached equilibrium. Then report, etc.  Not a big time-suck, I think.

I can see three ways to reduce the amount of time used by the sliding-window scoring:

1.  Don't score the genome every cycle.  The genome score is used to give the operator a sense of how the run is progressing - this needs only be done at the reporting intervals, rather than after every cycle - and to decide whether equilibrium has been reached.  To make this decision the simulation calculates a mean score over specified intervals and over the whole run.  I think we could devise a way to get most of the information the equilibrium decision needs using genome scores calculated much less often, perhaps even calculated only at the reporting intervals.

2.  Score only the number of fragments needed for recombination.  The specified number of fragments to be mutated and scored in each run (F) is usually much larger than the maximum number of fragments allowed to recombine with the genome (M); at present these values are set at F=100 and M=10.  When the genome has few or no strong matches to the USS consensus (at the start of a run), few or no fragments meet the recombination criterion, so every fragment needs to be considered.  But later in the run much time is wasted scoring fragments that won't be considered for recombination.  

Rather than mutating and scoring the full F fragments, then considering them for recombination until the M limit is reached, the simulation could just get M fragments, mutate, score and consider those, and then get another M fragments, until a total of M fragments have recombined.

3.  Stop scoring each fragment once its score reaches a preset maximum.  The program already has an implicit maximum score for fragments, set as the range of the probability a fragment will recombine.  But at present the sliding window scores the whole fragment.  It could instead stop scoring each fragment once this limit is reached.  As in improvement 2, this won't have much effect at the start of a run, when few USS-like sequences are present, but should speed things up a lot later in the run.

4.  Improve the efficiency of the sliding-window scoring algorithm.  This is where we'd appreciate advice from any programing masters who read this blog.  I'm going to paste the code for this bit below, but unfortunately I don't know how to format it so it's easily readable (all the tabs collapse).  Maybe I'll first put the code without comments (tidy and compact) and then put it again with the comments.

while( $slideCount> 0) 

my $t = 0; 
my $forwardScore = 1; 
while($t < $windowSize) 
my $currentBaseInWindow = substr($fragment,($counter+$t),1); 
switch ($currentBaseInWindow)
 { 
case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }
case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }
case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }
case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }
$t++;
}
$windowScore = $forwardScore;
if($windowScore> $fragmentScore)
{
$fragmentScore = $windowScore; 
}
$slideCount--;
$counter++; 

Here's the version with comments:

while( $slideCount > 0) # Keep scoring window positions until slideCount falls to zero .
{ # Prepare to score the 10 bases in the window at its present position.

my $t = 0; # t is the position of each base contained in the window, with respect to the 5' end of the window.
my $forwardScore = 1; # Score for the 10 bp sequence at each window position, compared to forward USS.

# This code scores the 10 bases of the fragment that are in the window, from t=0 to t=9.
# It first gets the base at each position t, and then gets a score for this base from the scoring matrix.

while($t < $windowSize) 
my $currentBaseInWindow = substr($fragment,($counter+$t),1); # The base will be A or T or G or C. 
switch ($currentBaseInWindow) 
case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }
case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }
case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }
case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }
$t++; # Prepare to consider the next base in the window. 
}
$windowScore = $forwardScore;
if($windowScore > $fragmentScore)
{
$fragmentScore = $windowScore; # Save only the highest score.
}

$slideCount--;
$counter++; # Now we slide the window to the next position in the fragment sequence.


Why won't USSs accumulate in our model?

Our Perl model of USS evolution in a genome runs well, but USS-like sequences accumulate only slightly. I've been playing around with various factors that might be responsible but haven't really gotten anywhere. I need to get these factors clear in my mind, so it's time for a blog post about them.

What the model does (before I started fiddling with it on Friday): (I should clarify that this version (USSv5.pl) doesn't yet have all the formatting bells and whistles that would make it graceful to use, but it does have the basic functionality we think we want.) It first creates a random 'genome' sequence of specified length and base composition, whose evolution the model is going to simulate. In each evolutionary cycle it first takes random segments of this genome, mutates them according to a specified mutation rate, and scores them for sequences similar to a sequence motif specified in the form of a matrix (see figure in this post for examples). This version of the model uses a new multiplicative scoring procedure rather than our original additive procedure. Each segment's sequence has a chance to replace its original sequence, with probability proportional to its score expressed as a fraction of some preset maximum score. The modified genome is scored by the same procedure used for the segments, and then becomes the starting sequence for the next cycle. (We had intended that the genome would undergo mutation in each cycle but this step has been bypassed because it was causing the USS-like motifs to degenerate faster than they accumulated.)

We first tested the additive and multiplicative scoring procedures to see how much difference a perfect USS sequence made to the score of otherwise random-sequence genomes. As we already knew, the additive procedure gave sequences with USS scores that were at best only about 1% higher than sequences without USS - the precise difference depends on the length of the sequence (we tested 100, 1000 and 10000 bp) and on the numbers in the scoring matrix .

The scores obtained with the multiplicative procedure were far more sensitive to the presence of a USS. For the 100bp test sequences, scores with USS were from 2-fold to 700-fold higher than for the same sequence without a USS, depending on how much higher USS-consensus bases scored than non-consensus bases. The lowest sensitivities were seen when this ratio was 2 for all positions, with higher sensitivities when the ratios were from 5-fold to 50-fold.

So this looked very promising - with such a sensitive scoring system I expected USS-like sequences to accumulate rapidly and to a high frequency. But this didn't happen. The genome scores did go up dramatically, but this turned out to be due to the much more sensitive scoring system acting on only a few base changes.

I played around with different genome sizes and fragment sizes and numbers and mutation rates and matrix values, but nothing seemed to make much difference. "Seemed" is the right word here, as I didn't keep careful records. I created or reinstated various reporting steps, so I could get a better idea of what was going on. I also replaced the preset maximum segment score with a variable score (= 5% of the previous cycle's genome score), so that the strength of the simulated bias would increase ass USS-like sequences accumulated in the genome.

But USS-like sequences didn't accumulate much at all, and I don't know why. There could be a problem with the code, but none of the informal checks I've done has set off any alarm bells. There could instead be a fundamental problem with the design of the simulation, so that what we are telling the simulation to do cannot lead to the outcome we expect. Or perhaps only a very small portion of 'parameter-space' allows USS-like sequences to accumulate.

The post-doc and I came up with two approaches. One is to meticulously investigate the effects of the various factors we can manipulate, keeping everything else as simple and constant as possible. The other is to use our brains to think through what should be happening. While we're doing this our programming assistant will be adding a few more sophisticated tricks to the program, to make our analysis easier.

I'll end with a list of factors we need to method

Progress on USS bioinformatics

Yesterday my bioinformatics collaborator came by for a collaborational visit - she's in town for a few days. I've been quite frustrated by our slow sporadic email correspondence so it was great to discuss the work face-to-face.

We clarified a number of issues about the data - because she's generating it and I'm doing the writing about it, this is where all our problems lie. There's only one data set still to generate, a comparison of the peptides encoded by uptake sequences with their homologs in species that don't have uptake sequences. But there's still lots of writing to be done. I'm hoping she'll be able to come work with us for a couple of weeks early in the summer, so we can work on the writing together.

Biological relevance of USS scoring systems

In response to a recent post about how our USS-evolution model will score USS-like sequences, a commenter (Neil) says "I see ROC curves and cross-validation in your future." Google tells me that ROC (receiver operating characteristics) curves are graphs representing the relationship between a signal receiver's sensitivity and its specificity.  They thus represent the receiver's ability to detect true positive signals and its tendency to falsely report events that aren't true signals.

Does this apply to USS?  That's actually a scientific question about the nature of USS - are they really signals?  USS stands for 'uptake signal sequence', a name chosen because they were assumed to have evolved so competent bacteria could distinguish closely related 'self' DNA fragments from 'foreign' DNA.  Under this view the uptake machinery could be viewed as a signal receiver that needs to distinguish true USS (those on self DNA) from false-positive USS (chance occurrences of USS-like sequences in foreign DNA.  (Note for new readers: the conventional 'core' USS of Haemophilus influenzae is the sequence AAGTGCGGT.)

But we don't think that USS are signals, at least not in this sense.  Rather, our working hypothesis (null hypothesis) is that USS-like sequences accumulate in genomes as an accidental consequence of biases in the DNA-uptake machinery and recombination of 'uptaken' DNA with the genome.  (I put 'uptaken' in quotes because it's not a real word; I'm using it because it's clearer than 'incoming', the word we usually use.)  So rather than wanting a perfect scoring system to distinguish 'true' USS from 'false' ones, we would want it to reflect the real properties of the receiver (the DNA uptake machinery of real cells).

Unfortunately we don't know nearly enough about the uptake machinery to accurately describe it in a model.  We know that some positions of the USS are more important than others, and that sequences outside the core matter.  We have detailed analyses of the USS-like sequences that have accumulated in real genomes (see all my old posts about the Gibbs motif sampler), but we don't know how these sequences reflect the properties of the uptake machinery that caused them to accumulate.  That's one of the things we hope to use the model to clarify. 

For now, we don't really want to use a 'perfect' scoring system in our model.  Instead, we can treat different scoring systems as different hypotheses about how differences in DNA sequences affect uptake (different hypotheses about how the machinery works).  So we will start with very simple systems, such as those described in the previous post.  Once we know how those affect simulated uptake and the long-term accumulation of high-scoring sequences in the genome, we can then clarify how what we know about the real uptake machinery constrains our ideas of how these sequences evolve.

Testing the new scoring system

The undergrad is creating a new version of the USS model program with the scoring done multiplicatively rather than additively (see previous post). I was originally thinking we'd start by just playing around with it, to see what happens. But after a conversation with the post-doc I realize that we should start out being more systematic.

What we need to do first is just find out what scores are produced by different sequences using this system:
  1. What score does a random sequence of a specified length (and base composition) produce? We'll test 100, 1000 and 10000bp.
  2. What score does a sequence produce that differs only in containing one USS perfectly matched to the matrix consensus?
And we'll do the same tests with differently weighted matrices, using both additive and multiplicative scoring:
  1. Additively scored matrix with all consensus bases worth 1 and all non-consensus bases worth 0 (like the yellow one in the previous post).
  2. Additively scored matrix with consensus bases at different positions weighted differently, according to our measures of their contribution to uptake. For example, some consensus bases might be worth 1 and some 3 or 5 or 10.
  3. Multiplicatively scored matrix with all consensus bases worth the same value (say 2 or 5), and all non-consensus bases worth 1.
  4. Multiplicatively scored matrix with consensus bases at different positions weighted differently, but all non-consensus bases still weighted 1.
  5. Multiplicatively scored matrix with the different non-consensus bases also weighted differently, perhaps with some values smaller than 1.
Only after we've done these basic tests of the different scoring systems will we decide whether to test their effects on USS accumulation. These preliminary tests shouldn't take very long. It's just a matter of generating the 6 test sequences and the 5-10 test matrices in a text file (needing maybe half an hour), and then pasting the different permutations into the two (additive and multiplicative) test versions of the program. The actual scoring runs will only take a few seconds each.

Scoring USS-like sequences in our model (so blind!)

Months ago (last fall?) a post-doc and I spent what seemed like a lot of time at the whiteboard in the hall, considering different ways our planned USS model might score the sequences it was considering for their similarity to a USS motif.

We eventually settled on the crude system shown on the left (yellow table). It evaluates how well the DNA sequence in a 10-base window matches the USS core consensus. Each match to the consensus earns a point, with the total score for the sequence being the sum of the points it's earned. At the time, we realized that this way of scoring had two (or three?) big problems, but we needed something simple to get the model working so we settled for this.

The first problem is that the score is not very sensitive to how good the match is. The yellow numbers beside the table show the scores earned by specific sequences. A sequence matching at all 10 positions is only 11% better than a sequence matching at 9 positions, even though we know from real uptake experiments that some single base changes can reduce uptake by more than 95%. The second problem is that this method treats all 10 positions in the motif equally. But again our uptake experiments have shown that some positions in the motif affect uptake much more strongly than others.
The third problem is that random sequences have very high scores, and adding a single perfect-match USS to it increases this baseline score only slightly.

This morning the post-doc and I reconsidered the scoring system. We expected that finding a solution to these problems would be very difficult, but we quickly came up with a much better way, illustrated by the blue table on the right of the figure. The new method is to multiply the scores of the individual positions rather than summing them. This causes the scores of well-matched sequences to be dramatically higher than those of poorer matches. And we expect (though we haven't tested this yet), that the baseline score of a random sequence will be much smaller. For now we've given all but the consensus base scores of 1, but these could be larger or smaller; for example some bases at some positions could be worth only 0.1 of a point.

Now that the program is working, implementing a multiplicative scoring system should be simple. I'm tempted to try it right now, but I have lots of other things I should be doing, and I'd probably just get bogged down in technical problems anyway.

The Perl code had a bug, but I found it!

The bells-and-whistles version of the Perl model of USS evolution still had a bug, which became apparent once I fiddled the fragment scoring system to strongly favour good matches, and turned off mutation of the genome sequence (so only the fragments mutated). The bug manifested itself in the program cycles stopping, at fairly random points in the run (never stopping twice at the same cycle number or genome score, as far as I could tell).

After a LOT of careful detective work on my part, entirely unencumbered by knowledge of any Perl debugging tools, I found that a 'while' counter was being incremented at the wrong place (inside an 'if' instruction that was inside its 'while' loop, instead of just inside its 'while' loop). I still don't understand why this would cause the runs to stick at random points, but maybe the undergrad can explain it to me tomorrow.

(Confession added later: Solving the problem was not just the result of my careful detective work. The final discovery was helped by luck. I had added an 'else' statement to print a report that the next step had happened, but had incorrectly inserted one too many } brackets. In solving this I accidentally removed a different bracket than the one I had incorrectly inserted, which moved the while counter outside of the 'if' loop and, I discovered, eliminated the stopping problem.)

As usual, the Perl problem was the line feeds

Somehow, in being emailed to me, the USSv4a.pl and settings.txt files both acquired nasty Mac carriage returns instead of nice well-behaved Unix line feeds. This kind of problem has arisen often enough in the past that I knew to suspect it, but I had to figure out how Komodo deals with line feed issues before I could confirm that this was the problem and correct it.

Now the program runs fine, but it prints some reporting lines probably created by the undergrad while he was debugging the new bells and whistles....

...Three hours later.... I've found, understood and removed the unwanted reporting lines, and found and fixed a big mistake in how the program decided whether it was time to print out a report. And found and fixed several minor problems....

Not as simple a task as it should have been

Before he left last night the undergrad sent me what should be a fully functional version of our USS model, with the desired new features all working (they're described here).  But I just tried to run it on my home computer and absolutely nothing happens.  I enter "perl USSv4a.pl" at the Terminal prompt, and I just get another terminal prompt.

The problem is likely with the computer, not the program.  This is the first program I've tried to run on it since I replaced the hard drive a few months ago - as everything on the old drive was lost I was effectively starting from scratch.  I don't think I need to install perl - I've never had to do that before.  The USSv4a.pl file is in the Rosie directory, as is the settings.txt file it calls.  I could create a 'Hello World' perl file to test that perl is working but my perl book is at my office; maybe I can find one online ('Hello World' file, not perl book).

Yes I can (it's just one line), but now I discover that TextEdit will only save files as rtf, html, word and wordxml!  Wtf?  Guess I'd better install Komodo Edit.  OK, I remember that finding a Komodo Edit download site took a bit of looking (Komodo IDE was easy to find but costs a bundle), but Google found a site for downloading Komodo Edit.  Unfortunately what this has given me has a '.msi' suffix, and my Mac doesn't know what to do with it.  On looking more carefully at the source site, I see that it says something about bittorrent, which I don't know how to use.  Back to Google - OK, a site offering a Mac version, which is coming in (very slowly) as '.dmg'.

OK, got Komodo Edit, in Applications folder, opening for first time involves 'pre-loading' various bells and whistles, Mac gets VERY HOT (Fan Control maxed out).

OK, I created the Hello World file, and it ran fine.  Hmm, maybe there IS something wrong with the USSv4a.pl program.  Well, now I have Komodo Edit so I can take a look at it.  It looks fine, so I created a 'test' version with Hello World print lines at various places.  None of them gets printed.  I see that the undergrad has inserted a 'die' reporting step if the program can't open its settings file, so I think that's not the problem.  Instead Terminal acts as if the USSv4a.pl file and the new USSv4atest.pl file don't exist.

Back to the original USSv4a.pl file; now Komodo is claiming that every line fails to end with the expected "LF" end-of-line marker, but when I have it show the markers they all look fine.  The USSv4atest.pl file markers look exactly the same, and Komodo has no problem with them.

Time to quit working on this for now.  Maybe I'll send an email to the undergrad, or maybe I'll just wait till I get to my office (maybe not till tomorrow) and try to run it there.

Time to start preparing some talks

I have three talks to give in the next month and a half, so I need to start preparing them now.

First a 20- or 25-minute one at the annual workshop of the new CIfAR Program in Integrated Microbial Diversity, held somewhere not far from here, sometime close to the end of May. The guy in the next office invited me but he's out of town so I can't recheck the details. This talk will describe what we know about how natural selection has acted on the genes that cause bacterial genetic exchange. I think I can probably do this with slides I already have prepared.

Next, a 20-minute talk at a conference titled "Sex and Recombination: In Theory and In Practice", at the University of Iowa in mid-June. This talk will begin by introducing everything that the above talk will take 20 minutes to cover, and will then go on to explain how we are using computer simulations to understand how uptake sequences can accumulate in the genomes of competent bacteria. I hope to discuss results from our Perl model; the undergraduate is adding necessary embellishments (an oxymoron). (He thought he would have them all in place by late today but he left without passing the improved model on to me, so I suspect they're not quite debugged yet.) I don't yet have any model-specific slides for the talk, nor even a good idea of what the results will be.

And a few days after that, a 15-minute talk at the big Evolution meeting at the University of Minnesota. This will be on my work with the bioinformatician, on how the accumulation of uptake sequences in bacterial genomes has affected the ability of their genes to code for well-adapted proteins. Almost all the work is done here, and we have nice figures prepared for our manuscript. Unfortunately my collaborator has just redone some of the analysis and sent me new figures I don't completely understand. And her email response time has gotten very slow (though I guess this is only fair payback for my long silence while I was swamped with teaching).

My very old 'analytical' USS evolution model

I found my old USS evolution model in the files with my 1999 NIH grant proposal. It's not a computer simulation of evolution but analysis of equations describing an equilibrium. It uses only very simple algebra, so calling it an 'analytical' model is probably giving it more credit than it deserves. The introductory text gives an excellent description of the background, so here it is:
This model starts with the following assumptions:
  1. H. influenzae cells have a preexisting DNA uptake system that preferentially transports fragments containing a USS0 (the 9bp core: 5'AAGTGCGGT). Fragments with imperfect (USS1) sites are not favoured.
  2. Fragments of H. influenzae DNA are frequently brought into cells by this system.
  3. Once inside the cell these fragments recombine with and replace homologous regions of the chromosome.
  4. The DNA in the cells' environment comes from lysed cells ('donors') having the same distribution of USS0 and USS1 sites as the cells that are taking up DNA (the 'recipients').
  5. Random mutation acts on both USS0 and USS1 sites to destroy and create new USS.
How these assumptions cause USS0 to accumulate:
Because the DNA uptake mechanism is biased in favour of fragments containing USS0 sites, any donor DNA fragment containing a new mutation that has created a USS0 will be taken up more efficiently than the wildtype version of the fragment. Similarly, donor fragments containing new mutations that have eliminated a USS0 will be taken up less efficiently than their wildtype counterparts. Consequently the DNA fragments within cells will be enriched for USSs relative to the external DNA, and recombination between these fragments and the resident chromosome will increase the number of genomic USS0s more frequently than it will decrease it. The bias will similarly affect the fate of new mutations arising within the recipient cell; mutations removing a USS0 will often be replaced by donor fragments carrying the wildtype USS0, whereas mutations creating new USS0s in the recipient will less frequently be replaced. This biased gene conversion can thus both compensate for mutational loss and amplify mutational gain of USS0s, and will cause them to be maintained at a higher equilibrium number than would be expected for a random sequence of the same base composition.

How the model works:
This model addresses processes acting within a single genome. The model uses the observed equilibrium numbers of perfect and imperfect USS sites to derive an equation relating transformation frequency to the bias of the DNA uptake system. This equation tells us the values that the transformation frequency and uptake bias parameters would have to take in order to be responsible for the maintenance of perfect and imperfect USS at the high equilibrium ratio that we observe.
The model then defines its terms ('variables'? 'parameters'?). It assumes that the frequencies of USS0 and USS1 observed in the real H. influenzae genome represent an equilibrium between forces that create USS0s and forces that convert them into USS1s. It then derives an equation relating the bias of the DNA uptake machinery (enrichment of USS0 over USS1) to the transformation frequency (the probability that a USS site will be replaced a fragment brought in by the uptake machinery in each generation). Conveniently, the mutation rate drops out of the equation.

Again quoting from what I wrote ten years ago:
What the final equation means: This equation tells us how frequent transformation must be (T), given a specified bias B of the DNA uptake system in favor of the USS, in order to fully account for the observed numbers of USS0 and USS1 sites in the Rd genome. The range of values is given below.












This is a pleasingly tidy result. It makes good biological sense, and the values are not unreasonable. Estimates of the actual bias vary, no doubt partly because they have been determined using USSs with different flanking sequences, but are usually between 10 and 100. We have no good estimate of actual transformation frequencies for H. influenzae in its natural environment, but if cells grow slowly, and frequently take up DNA as food then an average of 1% transformation per generation seems plausible, and even 10% not impossible.
I'm not very sure that this work is sound. The mutation rate dropping out is a bit suspicious, and I'm not sure how to interpret the transformation frequency when there are no other terms depending on generations (such as mutation rate).

Gene Transfer Agent evolution

My GTA colleague suggests that GTA may persist because of species-level selection. 'Species' is a tricky concept because these are bacteria, but we can simplify this to consider selection of lineages.

The basic idea is like that proposed to explain the surprisingly high frequency of bacterial lineages with defective mismatch-repair genes. Like most mutations, most GTA-mediated recombinational events will probably be deleterious. But some will be beneficial. Each time a beneficial recombination event occurs the lineage of cells descending from it will all contain GTA as well as the new combination. Provided the short-term and long-term costs of GTA don't cause the new lineage to go extinct or lose its GTA genes before the GTA-mediated beneficial change, lineages with GTA could take over.

Evolution-of-sex theorists have produced theoretical evaluations of such 'hitchhiking' processes, treating the allele in question as a 'recombination-modifier' (in this case the 'allele' would be the block of GTA genes). I haven't looked at the literature recently, but I think the general conclusion is that hitchhiking is common but weak; it is very unlikely to push an otherwise harmful allele to fixation. But these analyses weren't done for bacteria but for sexually reproducing eukaryotes, with the modifier controlling the frequency of meiotic crossovers between two other loci. I don't know how they would apply to bacteria.

Nevertheless, we know the ability of relatively rare beneficial events to preserve the GTA gene cluster must depend on how frequent the beneficial events are, how beneficial they are, how often GTA genes themselves undergo mutations that block their function, and how much (if any) harm the GTA genes cause. For example, if beneficial events are very rare, functional GTA genes may be lost by random mutation in the interval between beneficial events. Subsequent selection might cause the lineage to then go extinct, but it wouldn't bring the GTA genes back.

The important question is, how could we tell if this sort of thing is responsible for the persistence of the GTA genes? Short answer: I don't know.

USS don't accumulate because the bias is much too weak

In our very-preliminary version of the USS evolution model, we've been using a very simple scheme to score the similarity of DNA sequences to the USS motif. We just count the number of matches to the 10bp core USS sequence. Right now I'm keeping everything simple by running the model with DNA fragments that are only 13 bp long (and a genome that's 200-1000 bp long).

So a fragment with a perfect 10 bp match to the motif is only twice as likely to recombine back into the genome as a fragment with only 5 bp matching. We know from our earlier model, and from a calculation I did years ago, that the bias favouring USS needs to be much stronger than this if it is to overcome the randomizing effects of mutation. For example, (ignoring the effects of base composition) a sequence that matches the motif at 9 positions has 27 different ways to mutate into a fragment that matches at only 8 positions, and only one way to mutate into a fragment that matches at 10 positions. To overcome this disproportion, the bias favouring 9 over 8 (or is it 10 over 9?) has to be proportionally strong (i.e. 27-fold).

Now I need to find that old calculation - I think it might be with my 1999 NIH proposal.

Why don't USS accumulate in our model?

The undergraduate is working at home, making improvements to the use-ability of our Perl model of USS evolution, while I'm here doing some test runs with the slightly unwieldy version we have now.

This is the version that incorporates the first major improvement. Instead of a single "best-score" fragment recombining with the genome in each cycle, each of the scored fragments can recombine with the genome, with a probability proportional to its score. In principle this should allow the recombination to introduce new USS-like sequences faster than they are lost from the genome by mutation. But in practice the USS-score of the genome drifts up and down but doesn't consistently increase.

"When in doubt, run more controls."

So I've made a modified version of this program that has the same feature as our PositiveControl.pl program (it's name is PositiveControlv2.pl). Instead of mutating random fragments and recombining them back into the genome, it replaces their sequences with perfect USS cores. Of course the fragments need to be the same length as the core.

OK, this positive control tells me that the program is doing what it should. 10 fragments are recombining into the genome each cycle, which is what should happen to the 10 fragments the program considers each generation. Now I'll test PositiveControlv3, which puts in an imperfect USS instead of a perfect one. If the program is doing what I want, on average 8 of the 10 fragments will recombine each generation, and the genome score will not get as high as with perfect USSs.

OK, that worked too. So I think the problem may not be with the implementation of the code, but with the design of this version of the model. Back to the drawing board... (literally, back to the big whiteboard in the hall outside my office).