Field of Science

Base composition analysis supports gene transfer

Both our results and those recently published by another group suggest that uptake sequences are less common in genes that don't have homologs in related genomes.  I'm using 'related' loosely here, as our analysis searched for homologs in genes from other families of bacteria whereas the other group searched in members of the same genus.  But both results are best explained by the hypothesis that genes lacking uptake sequences are often those that a genome has recently acquired by lateral transfer from genomes that don't share that species' uptake sequence. 

To test this, we looked at gene properties that might give evidence of lateral transfer.  The simplest such property is base composition (%G+C), a property that differs widely between different bacterial groups and changes only slowly after a sequence has been transferred to a different genome.  My bioinformatics collaborator had already tabulated the %G+C of all the genes in the H. influenzae and N. meningitidis genomes, noting whether each gene had (1) uptake sequences and (2) homologs in the test genomes (see previous two posts for more explanation).  She'd then calculated the means and standard deviations of the base compositions of genes in each of the four classes.  

This analysis showed that, on average, the genes with no uptake sequences and no homologs also had anomalously low base compositions and larger standard deviations.  But I thought the effect might be clearer with a graphical presentation, so I plotted the actual %G+C for each gene as a function of which class it is in (presence/absence of uptake sequence and homologs).  I also noted sample sizes, as some classes have a lot more genes than others.  This gave a nice visualization of the effects, showing, not surprisingly, that genes with homologs in the control genomes have more homogeneous base compositions than those without homologs, and that the effect is quite a bit stronger for those genes that don't have uptake sequences.

Order of Results

We're still working out the order in which we present the different results in our draft manuscript.  The first parts of the Results are now the analysis of how uptake sequence accumulation has changed the frequencies of different tripeptides in the the proteome, and of why some reading frames are preferred.  

The next decision is how we present the results that use alignments of uptake-sequence-encoded proteins with homologs from three 'control'  genomes that do not have uptake sequences.  We had tentatively planned to first describe the finding that genes that don't have uptake sequences are less likely to have homologs in the control genomes (call this analysis A), then move to the genes that do have homologs in all three control genomes, first considering the overall degree of similarity (call this analysis B) and then considering the differences between the places where the uptake sequences are and the rest of the proteins (call this analysis C). 

But I was having a hard time seeing how to motivate the first analysis - it didn't easily connect with what came before. Now I think a better plan is to first describe analysis B, which shows that genes with more uptake sequences have lower similarity scores (actually identity scores - my collaborator and I disagree about the relative value of using similarity and identity scores). 

By describing analysis B first, we can more logically introduce the strategy of using alignments to homologs from a standard set of control genomes, before discussing the genes that lack homologs.  And the results of analysis B then motivate analysis A, looking at the genes that had to be omitted from analysis B because they didn't have all three homologs.

Organization of the Results

My bioinformatics collaborator and I are making good progress on both the writing of our manuscript and on improving the data that's going into it.  One of the latter came when I redid an analysis I first did about a year ago, this time being more meticulous about the input data and more thoughtful about how I used it.

The question the data address is why uptake sequences in coding regions are preferentially located in particular reading frames, specifically the effect of the codons they use on how efficiently the mRNA can be translated.  I had used the overall frequencies of different codons in the proteome (all the proteins the genome encodes) to estimate codon 'quality', by summing the frequencies of the three codons specified by the uptake sequence in a specific reading frame, and dividing this by the sum of the frequencies of the best (most-used) codons for the same amino acids.  This analysis showed no relationship between codon quality and the reading frame bias.  

In retrospect I should have multiplied the frequencies rather than adding them.  This both gives a more realistic approximation of how the codons' effects interact, and eliminates the differences caused by the differing frequencies of different amino acids in the proteome.  I should also have excluded those uptake sequence positions that allowed more than one codon. 
And I probably should have done the analysis for different species rather than only for H. influenzae (though, to be fair on myself, at the time I was working on a H. influenzae USS analysis). 

So now I've redone the analysis for the three species that exemplify the different uptake sequence types (H. influenzae, A. pleuropneumoniae and N. meningitidis), and find a substantial effect of codon quality.  That is, in each species, uptake sequences are mostly found in reading frames that allow use of the most common codons for those amino acids.  As codon frequency is thought to reflect the abundance of the corresponding tRNAs, this means that differences in the efficiency of translation explain some of the differences in reading frame use.

Collaborative writing

My bioinformatics collaborator is in town till next Friday, so we can work together to finish up our uptake sequence manuscript.  We have done (she has done) a lot of different analyses addressing various questions about how uptake sequence accumulation has affected protein evolution and vice versa, and I'm having a hard time keeping them all straight (partly because I haven't been thinking about this project for a while).  

The manuscript is mostly written already; but some parts of the Results are still up in the air because they were waiting for some final data.  It also needs polishing and some revising to incorporate results from a Neisseria uptake sequence paper that came out a few months ago.

To cope with all the data we've been occasionally creating 'Flow of Results' pages that summarize our results and conclusions, in the order we think the paper should present them. Yesterday we began our final push by going through the most recent version of the Flow of Results'.  I'm not sure we have everything in the perfect order, but we have an order that's good enough, with each result triggering a question that's addressed by the next result.  

Speedy simulations

Our programming assistant has finished his work for us.  At least, he's finished the period when he's employed by us, but now he's gotten the research bug and wants to stay involved to see how the program he's developed for us works out.

At present it works great - MUCH faster than before.  One of the reasons is that it no longer scores the genome in every cycle.  I'm doing a run now with a big genome (100 kb) and I can see it pause at every 100th cycle, as it takes the time to score the genome.  This is a sensible improvement, as the genome score isn't needed for the progress of the simulation - it just lets us the user monitor what's happening, and provides information used if the run needs to decide whether it has reached equilibrium.

I'm using this run to test whether the final accumulated USS motif precisely matches the biases specified in the recombination decisions (i.e. in the matrix used to score each fragment), by using a matrix where one position has substantially weaker bias than the others.  If the final motif matches the matrix bias, this position should show a weaker consensus in the accumulated uptake sequences.  

Of course to test this I may have to recall how to run the Gibbs Motif Sampler on the Westgrid server.  The alternative is to chop my 100 kb output genome into 10 pieces and run each separately on the Gibbs web server, which has a sequence limit of 10 kb.

Sex and Recombination in Minneaopolis?

The meeting on Sex and Recombination that was to be held in Iowa City next week has been canceled because of severe flooding! (And just when I was getting my talk pulled into shape too...)

I suspect I'm not the only one who planned to combine that meeting with the big Evolution meetings in Minneapolis, June 20-24. I'm flying into Minneapolis on June 15. If you'd like to get together with me or other sex-refugees, post a comment below and we'll see what we can organize.

Preparing talks

I've been struggling to pull together ideas and data for the two talks I'm giving (next week and the week after) at evolution meetings.  Yesterday was my turn to give lab meeting, and I used it to get help from the post-docs.  My draft talks were a mess, but the post-docs had lots of excellent suggestions.  Both talks will use the same introduction to the evolutionary significance of bacterial DNA uptake, and will then diverge. 

On the USS-evolution simulation front, the model is running nicely and I'm using it to quickly collect data for my first talk (20 minutes, for the Sex and Recombination meeting).  But I have to compromise statistical significance with run time, as the large genomes needed to get lots of sequence data take a long time to simulate.  

On the proteome-evolution front, my bioinformatics collaborator just sent me the last of the data, including a control set needed for comparison with the analysis of how divergent USS-encoded peptides are to their homologs.

Creeping up to equilibrium?

One other issue about equilibrium:

In our previous (unrealistic model) we found that USS initially accumulated very quickly, as singly and doubly mismatched sites were converted to perfectly matched sites. But this happened at the expense of depleting the genome of those mismatched sites, and further accumulation of perfect sites required waiting a long time for mutation of worse sites to regenerate the singly and doubly mismatched ones, which would then slowly allow further increase in the number of perfect matches. So achieving true equilibrium took a long time.

I expect this phenomenon to also apply in this new model. So I'm not at all confident that an early leveling-off of the rate of increase indicates closeness to the true equilibrium.

In the graphs to the left, the upper simulation probably hasn't reached equilibrium after 90,000 cycles (because the blue points indicating genome scores are still increasing), but the lower one has (because the blue points seem to be scattered around a stable mean).

I'm not sure why the lower run reached equilibrium so much faster than the upper one. Several factors differed - this is why I need to be more systematic. My excuse is that it's easier to motivate a systematic approach when individual tests are fast to do, and there are so many variables to test that I hate to spend a lot of time on just one. But it's time to treat this like real science.

Simulating USS evolution

Yes, the Perl model has progressed to the point where it's now a research tool. But I now need to focus my use of it, to get useful data rather than just noodling around to see what happens.

One remaining uncertainty is the decision that a simulation has reached an equilibrium, where forces increasing the frequency of USS-like sequences are balanced by forces decreasing it. So far I've been running simulations for a specified number of cycles instead of 'to equilibrium', because I'm not confident that they will indeed correctly identify equilibrium conditions. Now I guess I should take the settings I used for runs that did reach what I consider to be equilibrium, and rerun them 'to equilibrium' instead of to the specified number of cycles.

A problem is that the runs still take quite a long time. For example, last night I started a run using a 50kb genome, and taking up 100bp fragments. Although it was close to equilibrium after about 6 hours, the equilibrium criterion hasn't been met yet (because this criterion is quite conservative). Maybe we should use a less-conservative criterion, at least for now, because we're really mainly interested in order-of-magnitude differences at this initial stage.

One useful pair of runs I've done examined the effect of having a non-zero genome mutation rate. This is of course the only realistic treatment, but in the 'testing' runs we've had the genome mutation rate set to zero, with mutations occurring only in the fragments being considered for recombination, because otherwise USS-like sequences didn't accumulate. Both of the new runs considered a 20kb genome and 200bp fragments, with a fragment nutation rate of 0.05 per cycle. One of these runs had a genome mutation rate of zero; the equilibrium genome score was 3 x 10^9, 100-fold higher than the starting score. The other run had a genome mutation rate of 0.001; its final score was only 4 x 10^8.

This isn't surprising because mutations are much more likely to make good USS-matches worse than better, and this degeneration is only countered by selection for match-improving mutations (and against match-worsening mutations) in those fragments that recombine. So targeting mutation to fragments that might recombine increases the ratio of selected match-improving mutations to unselected mutations. Another way to look at it is that the whole genome gets mutated at its rate every generation, and none of these mutations is selected for or against unless it subsequently changes due to a new mutation arising in a fragment.

It may be that setting lower mutation rates for genomes than for fragments is equivalent to assuming that, on average, fragments are from genomes of close relatives separated by R generations from the genome under consideration (where R is the ratio of fragment rate to genome rate). This is probably a reasonable assumption.

Another issue is how much of the genome can be replaced by recombination each cycle. I've been keeping this down to about 10%, but any value can be justified by having each 'cycle' represent more or fewer generations. So it we want a cycle to represent 100 generations, we should have the amount of recombination equivalent to 100 times the amount of recombination we might expect in a single generation. As we don't even know what this number should be, I guess there's no reason not to have 100% of the genome replaced each cycle.

I don't think there's any benefit to having more than 100% replaced, as each additional recombination event would undo the effect of a previous one. Hmm, could this be viewed as a variant of the genome-coverage problems that arise in planning shotgun-sequencing projects? They want to maximize the genome coverage while minimizing the amount of sequencing they do. Here we want to maximize the amount of the genome replaced while minimizing the amount of wasteful multiple replacements. The difference is that, for genome projects, it's important to cover almost all the genome - covering 99% is MUCH better than covering only 90%, so it's worth doing a lot more sequencing. For us, the emphasis is on more on avoiding wasteful recombination, and the difference between replacing 99% and replacing 90% is worth only 9% more fragment screening. I guestimate that the best compromise will be replacing about 50-75% of the genome in each cycle.

I've raised this issue before (point 2 in this post): One problem is that, as the genome evolves to have more USS-like sequences, the number of fragments that pass the recombination criterion increases. So the above discussion applies mainly at equilibrium, when the genome will the most USS-like sequences. We control the number of fragments that recombine by specifying the number of fragments to be considered (F) and by explicitly setting a limit (M) (e.g. max of 10 fragments can recombine each cycle). Early in the run F needs to be high, or it will be many cycles before a significant number of fragments has recombined. But a high F late in the run has the simulation wasting time scoring many fragments that will never get a chance to recombine. At present I've been setting F to be 5 or 10 times larger than M, but maybe I should try reducing F and increasing M.

Means, arithmetic and geometric (= additive and multiplicative)

Now that we've replaced our old additive scoring algorithm with a multiplicative one (see this post), the algorithm that decides whether the simulation has reached its equilibrium isn't working well. The post-doc suggests that this is because we need to track the changing score of the genome using a multiplicative mean rather than an additive mean.

The test for equilibrium works as follows: The USS-score of the genome is calculated each generation, and is used to calculate both a "recent" mean score (over the interval between status updates, usually specified as a percent of the elapsed cycles) and a "grand" mean score (over the entire run). Both means are calculated as simple averages (sum of the scores divided by the number of scores). Early in the run the grand mean is much smaller than the recent mean. Equilibrium conditions are satisfied when the % difference between these means becomes smaller than some pre-specified threshold.

With additive scoring this worked well regardless of the length of the genome. But with multiplicative scoring, a single base change can cause a dramatic change in the score (ten-fold or more, depending on the scoring matrix used), especially when using short genomes. The post-doc, who is much more statistically sophisticated than I am, says that when numbers differ by large values, their means should be calculated geometrically rather than arithmetically.

Wikipedia explains that arithmetic means are the typical 'average' values, and geometric means are calculated using products and roots rather than sums and division. I'll say that another way: a geometric mean is calculated by first calculating the product of all the n values (rather than their sum) and then taking the n-th root of this product. Luckily this can be easily done using logarithms.

The post-doc has modified the code to do this, but she's now gone off to Barcelona (!) for the Society for Molecular Biology and Evolution meeting. I haven't yet tried out her version, but checking whether it finds better equilibria will be much easier now that the runs go so much faster.

Thank you for the comments!

The profiling I did yesterday, using DProf as suggested in a comment from Keith, showed that most of the runtime was spent in the Switch statements that are the heart of the sliding-window scoring algorithm. In new comments, Keith and Conrad explained that 'Switch' is not the fastest way to do the scoring, and that replacing it with a cascade of if/else statements could be a lot faster. (Faithful commenter Neil had also pointed out that Switch is buggy, but it wasn't setting off any alarms.)

So I've just replaced all three of the switch scoring steps with if/else cascades, and here's the spectacular results. Thanks, guys!

Benchmarking(??) the USS model

(I don't think 'benchmarking' is actually the right term for what I've been doing...)

The comments on my last post recommended using the Perl profiling function to find out which parts of our simulation are indeed responsible for its slow runtime. So I Googled "DProf" and ran it on the simulation, and then poked around to find out how to use 'dprofpp' to turn the DProf output into a table listing where most of the time was being spent. Pretty soon I discovered that I just needed to type 'dprofpp' at the usual prompt. (duh).

Then I spent quite a while trying to figure out what the profile entries meant. Here's an example of the output.

  • Total Elapsed Time = 65.63472 Seconds
  • User+System Time = 63.53472 Seconds
  • Exclusive Times
  • %Time ExclSec CumulS #Calls sec/call Csec/c Name
  • 41.6 26.44 37.689 543882 0.0000 0.0000 Switch::case
  • 28.9 18.37 18.373 241893 0.0000 0.0000 Switch::switch
  • 17.7 11.24 11.243 543882 0.0000 0.0000 Switch::__ANON__
  • 16.5 10.52 38.066 12 0.8768 3.1722 main::tallyScores
  • 0.25 0.156 0.195 3065 0.0001 0.0001 Text::Balanced::_match_quotelike
  • 0.23 0.146 0.227 3692 0.0000 0.0001 Text::Balanced::_match_variable
  • 0.23 0.143 0.608 44 0.0032 0.0138 Switch::filter_blocks
  • 0.15 0.095 0.128 1599 0.0001 0.0001 Text::Balanced::_match_codeblock
  • 0.06 0.040 0.040 7432 0.0000 0.0000 Text::Balanced::_failmsg
  • 0.04 0.027 0.027 13146 0.0000 0.0000 Regexp::DESTROY
  • 0.03 0.020 0.029 9 0.0022 0.0032 Switch::BEGIN
  • 0.02 0.010 0.010 1 0.0100 0.0100 main::createRandomGenome
  • 0.02 0.010 0.010 1 0.0100 0.0100 DynaLoader::dl_load_file
  • 0.00 0.000 0.000 1 0.0000 0 Exporter::Heavy::heavy_export_ok_tags
  • 0.00 0.000 0.000 1 0.0000 0.0000 Exporter::Heavy::heavy_export

(I tried putting the above into a monospaced font for clarity, but the editor collapsed all the extra spaces so that wasn't much help. Putting it in point form stops the editor from collapsing all the single line breaks into one big paragraph.)

The problem is that our program wasn't written with this profiler in mind, and the big time hogs aren't separate subroutines but loops within the main program. We could change this, but here 'we' means our programming assistant, not me.

I did figure out that the various 'switch' entries are steps in the sliding-window scoring. But this scoring happens to different components of the program and in different ways. I tried running the profiler on different versions of the simulation (large or small genome sequences, large or small fragments, many or few fragments) but I couldn't disentangle the effects in the resulting profiles.
So instead I began just recording the runtimes taken by sets of runs in which I systematically varied only one of the input parameters, with the simulation set to run for only 10 cycles. The first graph shows the effect of changing the genome size but holding the fragment size and number of fragments constant (at values so low that they don't contribute much to the runtime).

The first discovery is that the 'tally' we added to help us see how the genome evolves almost doubles the runtime for each genome length (compare the blue line to the red line). This is because obtaining the tally requires a sliding-window scoring step. Even though this scoring doesn't use the position-weight matrix, the time it takes is similar to the time needed for the formal scoring of the genome. The good news is that this step is completely dispensable.

The second discovery is that the steps that mutate the genome and the fragments add almost nothing to the runtime (compare the green and red lines). So we won't bother trying to further optimize the mutagenesis.

The third discovery is that the runtime is proportional to the length of the genome. Because the fragments are few and small, this is almost certainly due to the time needed for the sliding-window scoring. This tells us that reducing the number of times the genome needs to be scored should dramatically speed up the simulations.

The second graph shows how runtime depends on the number of fragments being tested. The fragment size was held at 200bp (ten times more than in the previous set of runs), and the genome length was 10kb.

Here we see that including the tally step increases all the run's times by about 20 seconds (compare blue and red lines). This is just what we expect for a 10kb genome from the previous graph. And again the mutation steps in each cycle contribute almost nothing to the runtime.

The runtime is directly proportional to the number of fragments, consistent with the sliding-window scoring being the major factor.

To confirm that other steps in processing fragments aren't a big problem, I also varied fragment length while holding fragment number constant. This third graph shows that runtime is linearly dependent on fragment length.

Finally, I held the total length of fragments to be screened constant (the product of fragment length and number of fragments screened each cycle). If the time needed for the sliding-window scoring of the fragments is the main factor responsible for the increasing runtime, then holding the total length constant should give a constant runtime even when fragment length and number are reciprocally varied.

That's exactly what we see, provided the fragments are larger than about 100bp. (As the fragment size decreases to approach the size of the window, the number of positions the window takes decreases faster than fragment size (a 10bp fragment can only be scored at one window position). The runtime is proportional to the length of sequence being scored each cycle.

This suggests one more test. I've been keeping either the genome length or the total fragment length constant. What if I plot runtime as a function of the total length of sequence (genome plus fragments) being scored each cycle.

If the sliding-window scoring is indeed the major factor affecting runtime, then there should be a linear relationship between total sequence and runtime. And voila, here it is.

Speeding up the simulation of USS evolution

Our Perl model of USS evolution now runs fine and does just about everything we want it to. Unfortunately it takes a long time to do this. To date we haven't bothered much about speed, being more concerned with just being able to do the job. But now it's time to improve its efficiency. I see several points where big improvements can be made, but first I should describe the steps the simulation presently takes.

1. Create genome sequence. This is done only once per run so speed isn't important.

In each cycle: (Efficiency is much more important for steps done every cycle.)

2. Mutate the genome, and choose and mutate a specified number of fragments of the genome. I think the mutation step is already quite efficient. The number of fragments that need to be processed each cycle can probably be reduced - I'll describe this below.

3. Score each fragment using a sliding window. Score the whole genome too. This scoring is probably by far the most time-consuming step.

4. Decide whether a fragment should recombine back into the genome, and do the recombination.  Not a big time-suck, I think.

5. Make decisions about whether results should be reported and whether the simulation has reached equilibrium. Then report, etc.  Not a big time-suck, I think.

I can see three ways to reduce the amount of time used by the sliding-window scoring:

1.  Don't score the genome every cycle.  The genome score is used to give the operator a sense of how the run is progressing - this needs only be done at the reporting intervals, rather than after every cycle - and to decide whether equilibrium has been reached.  To make this decision the simulation calculates a mean score over specified intervals and over the whole run.  I think we could devise a way to get most of the information the equilibrium decision needs using genome scores calculated much less often, perhaps even calculated only at the reporting intervals.

2.  Score only the number of fragments needed for recombination.  The specified number of fragments to be mutated and scored in each run (F) is usually much larger than the maximum number of fragments allowed to recombine with the genome (M); at present these values are set at F=100 and M=10.  When the genome has few or no strong matches to the USS consensus (at the start of a run), few or no fragments meet the recombination criterion, so every fragment needs to be considered.  But later in the run much time is wasted scoring fragments that won't be considered for recombination.  

Rather than mutating and scoring the full F fragments, then considering them for recombination until the M limit is reached, the simulation could just get M fragments, mutate, score and consider those, and then get another M fragments, until a total of M fragments have recombined.

3.  Stop scoring each fragment once its score reaches a preset maximum.  The program already has an implicit maximum score for fragments, set as the range of the probability a fragment will recombine.  But at present the sliding window scores the whole fragment.  It could instead stop scoring each fragment once this limit is reached.  As in improvement 2, this won't have much effect at the start of a run, when few USS-like sequences are present, but should speed things up a lot later in the run.

4.  Improve the efficiency of the sliding-window scoring algorithm.  This is where we'd appreciate advice from any programing masters who read this blog.  I'm going to paste the code for this bit below, but unfortunately I don't know how to format it so it's easily readable (all the tabs collapse).  Maybe I'll first put the code without comments (tidy and compact) and then put it again with the comments.

while( $slideCount> 0) 

my $t = 0; 
my $forwardScore = 1; 
while($t < $windowSize) 
my $currentBaseInWindow = substr($fragment,($counter+$t),1); 
switch ($currentBaseInWindow)
 { 
case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }
case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }
case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }
case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }
$t++;
}
$windowScore = $forwardScore;
if($windowScore> $fragmentScore)
{
$fragmentScore = $windowScore; 
}
$slideCount--;
$counter++; 

Here's the version with comments:

while( $slideCount > 0) # Keep scoring window positions until slideCount falls to zero .
{ # Prepare to score the 10 bases in the window at its present position.

my $t = 0; # t is the position of each base contained in the window, with respect to the 5' end of the window.
my $forwardScore = 1; # Score for the 10 bp sequence at each window position, compared to forward USS.

# This code scores the 10 bases of the fragment that are in the window, from t=0 to t=9.
# It first gets the base at each position t, and then gets a score for this base from the scoring matrix.

while($t < $windowSize) 
my $currentBaseInWindow = substr($fragment,($counter+$t),1); # The base will be A or T or G or C. 
switch ($currentBaseInWindow) 
case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }
case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }
case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }
case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }
$t++; # Prepare to consider the next base in the window. 
}
$windowScore = $forwardScore;
if($windowScore > $fragmentScore)
{
$fragmentScore = $windowScore; # Save only the highest score.
}

$slideCount--;
$counter++; # Now we slide the window to the next position in the fragment sequence.


Why won't USSs accumulate in our model?

Our Perl model of USS evolution in a genome runs well, but USS-like sequences accumulate only slightly. I've been playing around with various factors that might be responsible but haven't really gotten anywhere. I need to get these factors clear in my mind, so it's time for a blog post about them.

What the model does (before I started fiddling with it on Friday): (I should clarify that this version (USSv5.pl) doesn't yet have all the formatting bells and whistles that would make it graceful to use, but it does have the basic functionality we think we want.) It first creates a random 'genome' sequence of specified length and base composition, whose evolution the model is going to simulate. In each evolutionary cycle it first takes random segments of this genome, mutates them according to a specified mutation rate, and scores them for sequences similar to a sequence motif specified in the form of a matrix (see figure in this post for examples). This version of the model uses a new multiplicative scoring procedure rather than our original additive procedure. Each segment's sequence has a chance to replace its original sequence, with probability proportional to its score expressed as a fraction of some preset maximum score. The modified genome is scored by the same procedure used for the segments, and then becomes the starting sequence for the next cycle. (We had intended that the genome would undergo mutation in each cycle but this step has been bypassed because it was causing the USS-like motifs to degenerate faster than they accumulated.)

We first tested the additive and multiplicative scoring procedures to see how much difference a perfect USS sequence made to the score of otherwise random-sequence genomes. As we already knew, the additive procedure gave sequences with USS scores that were at best only about 1% higher than sequences without USS - the precise difference depends on the length of the sequence (we tested 100, 1000 and 10000 bp) and on the numbers in the scoring matrix .

The scores obtained with the multiplicative procedure were far more sensitive to the presence of a USS. For the 100bp test sequences, scores with USS were from 2-fold to 700-fold higher than for the same sequence without a USS, depending on how much higher USS-consensus bases scored than non-consensus bases. The lowest sensitivities were seen when this ratio was 2 for all positions, with higher sensitivities when the ratios were from 5-fold to 50-fold.

So this looked very promising - with such a sensitive scoring system I expected USS-like sequences to accumulate rapidly and to a high frequency. But this didn't happen. The genome scores did go up dramatically, but this turned out to be due to the much more sensitive scoring system acting on only a few base changes.

I played around with different genome sizes and fragment sizes and numbers and mutation rates and matrix values, but nothing seemed to make much difference. "Seemed" is the right word here, as I didn't keep careful records. I created or reinstated various reporting steps, so I could get a better idea of what was going on. I also replaced the preset maximum segment score with a variable score (= 5% of the previous cycle's genome score), so that the strength of the simulated bias would increase ass USS-like sequences accumulated in the genome.

But USS-like sequences didn't accumulate much at all, and I don't know why. There could be a problem with the code, but none of the informal checks I've done has set off any alarm bells. There could instead be a fundamental problem with the design of the simulation, so that what we are telling the simulation to do cannot lead to the outcome we expect. Or perhaps only a very small portion of 'parameter-space' allows USS-like sequences to accumulate.

The post-doc and I came up with two approaches. One is to meticulously investigate the effects of the various factors we can manipulate, keeping everything else as simple and constant as possible. The other is to use our brains to think through what should be happening. While we're doing this our programming assistant will be adding a few more sophisticated tricks to the program, to make our analysis easier.

I'll end with a list of factors we need to method

Progress on USS bioinformatics

Yesterday my bioinformatics collaborator came by for a collaborational visit - she's in town for a few days. I've been quite frustrated by our slow sporadic email correspondence so it was great to discuss the work face-to-face.

We clarified a number of issues about the data - because she's generating it and I'm doing the writing about it, this is where all our problems lie. There's only one data set still to generate, a comparison of the peptides encoded by uptake sequences with their homologs in species that don't have uptake sequences. But there's still lots of writing to be done. I'm hoping she'll be able to come work with us for a couple of weeks early in the summer, so we can work on the writing together.

Biological relevance of USS scoring systems

In response to a recent post about how our USS-evolution model will score USS-like sequences, a commenter (Neil) says "I see ROC curves and cross-validation in your future." Google tells me that ROC (receiver operating characteristics) curves are graphs representing the relationship between a signal receiver's sensitivity and its specificity.  They thus represent the receiver's ability to detect true positive signals and its tendency to falsely report events that aren't true signals.

Does this apply to USS?  That's actually a scientific question about the nature of USS - are they really signals?  USS stands for 'uptake signal sequence', a name chosen because they were assumed to have evolved so competent bacteria could distinguish closely related 'self' DNA fragments from 'foreign' DNA.  Under this view the uptake machinery could be viewed as a signal receiver that needs to distinguish true USS (those on self DNA) from false-positive USS (chance occurrences of USS-like sequences in foreign DNA.  (Note for new readers: the conventional 'core' USS of Haemophilus influenzae is the sequence AAGTGCGGT.)

But we don't think that USS are signals, at least not in this sense.  Rather, our working hypothesis (null hypothesis) is that USS-like sequences accumulate in genomes as an accidental consequence of biases in the DNA-uptake machinery and recombination of 'uptaken' DNA with the genome.  (I put 'uptaken' in quotes because it's not a real word; I'm using it because it's clearer than 'incoming', the word we usually use.)  So rather than wanting a perfect scoring system to distinguish 'true' USS from 'false' ones, we would want it to reflect the real properties of the receiver (the DNA uptake machinery of real cells).

Unfortunately we don't know nearly enough about the uptake machinery to accurately describe it in a model.  We know that some positions of the USS are more important than others, and that sequences outside the core matter.  We have detailed analyses of the USS-like sequences that have accumulated in real genomes (see all my old posts about the Gibbs motif sampler), but we don't know how these sequences reflect the properties of the uptake machinery that caused them to accumulate.  That's one of the things we hope to use the model to clarify. 

For now, we don't really want to use a 'perfect' scoring system in our model.  Instead, we can treat different scoring systems as different hypotheses about how differences in DNA sequences affect uptake (different hypotheses about how the machinery works).  So we will start with very simple systems, such as those described in the previous post.  Once we know how those affect simulated uptake and the long-term accumulation of high-scoring sequences in the genome, we can then clarify how what we know about the real uptake machinery constrains our ideas of how these sequences evolve.

Testing the new scoring system

The undergrad is creating a new version of the USS model program with the scoring done multiplicatively rather than additively (see previous post). I was originally thinking we'd start by just playing around with it, to see what happens. But after a conversation with the post-doc I realize that we should start out being more systematic.

What we need to do first is just find out what scores are produced by different sequences using this system:
  1. What score does a random sequence of a specified length (and base composition) produce? We'll test 100, 1000 and 10000bp.
  2. What score does a sequence produce that differs only in containing one USS perfectly matched to the matrix consensus?
And we'll do the same tests with differently weighted matrices, using both additive and multiplicative scoring:
  1. Additively scored matrix with all consensus bases worth 1 and all non-consensus bases worth 0 (like the yellow one in the previous post).
  2. Additively scored matrix with consensus bases at different positions weighted differently, according to our measures of their contribution to uptake. For example, some consensus bases might be worth 1 and some 3 or 5 or 10.
  3. Multiplicatively scored matrix with all consensus bases worth the same value (say 2 or 5), and all non-consensus bases worth 1.
  4. Multiplicatively scored matrix with consensus bases at different positions weighted differently, but all non-consensus bases still weighted 1.
  5. Multiplicatively scored matrix with the different non-consensus bases also weighted differently, perhaps with some values smaller than 1.
Only after we've done these basic tests of the different scoring systems will we decide whether to test their effects on USS accumulation. These preliminary tests shouldn't take very long. It's just a matter of generating the 6 test sequences and the 5-10 test matrices in a text file (needing maybe half an hour), and then pasting the different permutations into the two (additive and multiplicative) test versions of the program. The actual scoring runs will only take a few seconds each.

Scoring USS-like sequences in our model (so blind!)

Months ago (last fall?) a post-doc and I spent what seemed like a lot of time at the whiteboard in the hall, considering different ways our planned USS model might score the sequences it was considering for their similarity to a USS motif.

We eventually settled on the crude system shown on the left (yellow table). It evaluates how well the DNA sequence in a 10-base window matches the USS core consensus. Each match to the consensus earns a point, with the total score for the sequence being the sum of the points it's earned. At the time, we realized that this way of scoring had two (or three?) big problems, but we needed something simple to get the model working so we settled for this.

The first problem is that the score is not very sensitive to how good the match is. The yellow numbers beside the table show the scores earned by specific sequences. A sequence matching at all 10 positions is only 11% better than a sequence matching at 9 positions, even though we know from real uptake experiments that some single base changes can reduce uptake by more than 95%. The second problem is that this method treats all 10 positions in the motif equally. But again our uptake experiments have shown that some positions in the motif affect uptake much more strongly than others.
The third problem is that random sequences have very high scores, and adding a single perfect-match USS to it increases this baseline score only slightly.

This morning the post-doc and I reconsidered the scoring system. We expected that finding a solution to these problems would be very difficult, but we quickly came up with a much better way, illustrated by the blue table on the right of the figure. The new method is to multiply the scores of the individual positions rather than summing them. This causes the scores of well-matched sequences to be dramatically higher than those of poorer matches. And we expect (though we haven't tested this yet), that the baseline score of a random sequence will be much smaller. For now we've given all but the consensus base scores of 1, but these could be larger or smaller; for example some bases at some positions could be worth only 0.1 of a point.

Now that the program is working, implementing a multiplicative scoring system should be simple. I'm tempted to try it right now, but I have lots of other things I should be doing, and I'd probably just get bogged down in technical problems anyway.

The Perl code had a bug, but I found it!

The bells-and-whistles version of the Perl model of USS evolution still had a bug, which became apparent once I fiddled the fragment scoring system to strongly favour good matches, and turned off mutation of the genome sequence (so only the fragments mutated). The bug manifested itself in the program cycles stopping, at fairly random points in the run (never stopping twice at the same cycle number or genome score, as far as I could tell).

After a LOT of careful detective work on my part, entirely unencumbered by knowledge of any Perl debugging tools, I found that a 'while' counter was being incremented at the wrong place (inside an 'if' instruction that was inside its 'while' loop, instead of just inside its 'while' loop). I still don't understand why this would cause the runs to stick at random points, but maybe the undergrad can explain it to me tomorrow.

(Confession added later: Solving the problem was not just the result of my careful detective work. The final discovery was helped by luck. I had added an 'else' statement to print a report that the next step had happened, but had incorrectly inserted one too many } brackets. In solving this I accidentally removed a different bracket than the one I had incorrectly inserted, which moved the while counter outside of the 'if' loop and, I discovered, eliminated the stopping problem.)

As usual, the Perl problem was the line feeds

Somehow, in being emailed to me, the USSv4a.pl and settings.txt files both acquired nasty Mac carriage returns instead of nice well-behaved Unix line feeds. This kind of problem has arisen often enough in the past that I knew to suspect it, but I had to figure out how Komodo deals with line feed issues before I could confirm that this was the problem and correct it.

Now the program runs fine, but it prints some reporting lines probably created by the undergrad while he was debugging the new bells and whistles....

...Three hours later.... I've found, understood and removed the unwanted reporting lines, and found and fixed a big mistake in how the program decided whether it was time to print out a report. And found and fixed several minor problems....

Not as simple a task as it should have been

Before he left last night the undergrad sent me what should be a fully functional version of our USS model, with the desired new features all working (they're described here).  But I just tried to run it on my home computer and absolutely nothing happens.  I enter "perl USSv4a.pl" at the Terminal prompt, and I just get another terminal prompt.

The problem is likely with the computer, not the program.  This is the first program I've tried to run on it since I replaced the hard drive a few months ago - as everything on the old drive was lost I was effectively starting from scratch.  I don't think I need to install perl - I've never had to do that before.  The USSv4a.pl file is in the Rosie directory, as is the settings.txt file it calls.  I could create a 'Hello World' perl file to test that perl is working but my perl book is at my office; maybe I can find one online ('Hello World' file, not perl book).

Yes I can (it's just one line), but now I discover that TextEdit will only save files as rtf, html, word and wordxml!  Wtf?  Guess I'd better install Komodo Edit.  OK, I remember that finding a Komodo Edit download site took a bit of looking (Komodo IDE was easy to find but costs a bundle), but Google found a site for downloading Komodo Edit.  Unfortunately what this has given me has a '.msi' suffix, and my Mac doesn't know what to do with it.  On looking more carefully at the source site, I see that it says something about bittorrent, which I don't know how to use.  Back to Google - OK, a site offering a Mac version, which is coming in (very slowly) as '.dmg'.

OK, got Komodo Edit, in Applications folder, opening for first time involves 'pre-loading' various bells and whistles, Mac gets VERY HOT (Fan Control maxed out).

OK, I created the Hello World file, and it ran fine.  Hmm, maybe there IS something wrong with the USSv4a.pl program.  Well, now I have Komodo Edit so I can take a look at it.  It looks fine, so I created a 'test' version with Hello World print lines at various places.  None of them gets printed.  I see that the undergrad has inserted a 'die' reporting step if the program can't open its settings file, so I think that's not the problem.  Instead Terminal acts as if the USSv4a.pl file and the new USSv4atest.pl file don't exist.

Back to the original USSv4a.pl file; now Komodo is claiming that every line fails to end with the expected "LF" end-of-line marker, but when I have it show the markers they all look fine.  The USSv4atest.pl file markers look exactly the same, and Komodo has no problem with them.

Time to quit working on this for now.  Maybe I'll send an email to the undergrad, or maybe I'll just wait till I get to my office (maybe not till tomorrow) and try to run it there.

Time to start preparing some talks

I have three talks to give in the next month and a half, so I need to start preparing them now.

First a 20- or 25-minute one at the annual workshop of the new CIfAR Program in Integrated Microbial Diversity, held somewhere not far from here, sometime close to the end of May. The guy in the next office invited me but he's out of town so I can't recheck the details. This talk will describe what we know about how natural selection has acted on the genes that cause bacterial genetic exchange. I think I can probably do this with slides I already have prepared.

Next, a 20-minute talk at a conference titled "Sex and Recombination: In Theory and In Practice", at the University of Iowa in mid-June. This talk will begin by introducing everything that the above talk will take 20 minutes to cover, and will then go on to explain how we are using computer simulations to understand how uptake sequences can accumulate in the genomes of competent bacteria. I hope to discuss results from our Perl model; the undergraduate is adding necessary embellishments (an oxymoron). (He thought he would have them all in place by late today but he left without passing the improved model on to me, so I suspect they're not quite debugged yet.) I don't yet have any model-specific slides for the talk, nor even a good idea of what the results will be.

And a few days after that, a 15-minute talk at the big Evolution meeting at the University of Minnesota. This will be on my work with the bioinformatician, on how the accumulation of uptake sequences in bacterial genomes has affected the ability of their genes to code for well-adapted proteins. Almost all the work is done here, and we have nice figures prepared for our manuscript. Unfortunately my collaborator has just redone some of the analysis and sent me new figures I don't completely understand. And her email response time has gotten very slow (though I guess this is only fair payback for my long silence while I was swamped with teaching).

My very old 'analytical' USS evolution model

I found my old USS evolution model in the files with my 1999 NIH grant proposal. It's not a computer simulation of evolution but analysis of equations describing an equilibrium. It uses only very simple algebra, so calling it an 'analytical' model is probably giving it more credit than it deserves. The introductory text gives an excellent description of the background, so here it is:
This model starts with the following assumptions:
  1. H. influenzae cells have a preexisting DNA uptake system that preferentially transports fragments containing a USS0 (the 9bp core: 5'AAGTGCGGT). Fragments with imperfect (USS1) sites are not favoured.
  2. Fragments of H. influenzae DNA are frequently brought into cells by this system.
  3. Once inside the cell these fragments recombine with and replace homologous regions of the chromosome.
  4. The DNA in the cells' environment comes from lysed cells ('donors') having the same distribution of USS0 and USS1 sites as the cells that are taking up DNA (the 'recipients').
  5. Random mutation acts on both USS0 and USS1 sites to destroy and create new USS.
How these assumptions cause USS0 to accumulate:
Because the DNA uptake mechanism is biased in favour of fragments containing USS0 sites, any donor DNA fragment containing a new mutation that has created a USS0 will be taken up more efficiently than the wildtype version of the fragment. Similarly, donor fragments containing new mutations that have eliminated a USS0 will be taken up less efficiently than their wildtype counterparts. Consequently the DNA fragments within cells will be enriched for USSs relative to the external DNA, and recombination between these fragments and the resident chromosome will increase the number of genomic USS0s more frequently than it will decrease it. The bias will similarly affect the fate of new mutations arising within the recipient cell; mutations removing a USS0 will often be replaced by donor fragments carrying the wildtype USS0, whereas mutations creating new USS0s in the recipient will less frequently be replaced. This biased gene conversion can thus both compensate for mutational loss and amplify mutational gain of USS0s, and will cause them to be maintained at a higher equilibrium number than would be expected for a random sequence of the same base composition.

How the model works:
This model addresses processes acting within a single genome. The model uses the observed equilibrium numbers of perfect and imperfect USS sites to derive an equation relating transformation frequency to the bias of the DNA uptake system. This equation tells us the values that the transformation frequency and uptake bias parameters would have to take in order to be responsible for the maintenance of perfect and imperfect USS at the high equilibrium ratio that we observe.
The model then defines its terms ('variables'? 'parameters'?). It assumes that the frequencies of USS0 and USS1 observed in the real H. influenzae genome represent an equilibrium between forces that create USS0s and forces that convert them into USS1s. It then derives an equation relating the bias of the DNA uptake machinery (enrichment of USS0 over USS1) to the transformation frequency (the probability that a USS site will be replaced a fragment brought in by the uptake machinery in each generation). Conveniently, the mutation rate drops out of the equation.

Again quoting from what I wrote ten years ago:
What the final equation means: This equation tells us how frequent transformation must be (T), given a specified bias B of the DNA uptake system in favor of the USS, in order to fully account for the observed numbers of USS0 and USS1 sites in the Rd genome. The range of values is given below.












This is a pleasingly tidy result. It makes good biological sense, and the values are not unreasonable. Estimates of the actual bias vary, no doubt partly because they have been determined using USSs with different flanking sequences, but are usually between 10 and 100. We have no good estimate of actual transformation frequencies for H. influenzae in its natural environment, but if cells grow slowly, and frequently take up DNA as food then an average of 1% transformation per generation seems plausible, and even 10% not impossible.
I'm not very sure that this work is sound. The mutation rate dropping out is a bit suspicious, and I'm not sure how to interpret the transformation frequency when there are no other terms depending on generations (such as mutation rate).

Gene Transfer Agent evolution

My GTA colleague suggests that GTA may persist because of species-level selection. 'Species' is a tricky concept because these are bacteria, but we can simplify this to consider selection of lineages.

The basic idea is like that proposed to explain the surprisingly high frequency of bacterial lineages with defective mismatch-repair genes. Like most mutations, most GTA-mediated recombinational events will probably be deleterious. But some will be beneficial. Each time a beneficial recombination event occurs the lineage of cells descending from it will all contain GTA as well as the new combination. Provided the short-term and long-term costs of GTA don't cause the new lineage to go extinct or lose its GTA genes before the GTA-mediated beneficial change, lineages with GTA could take over.

Evolution-of-sex theorists have produced theoretical evaluations of such 'hitchhiking' processes, treating the allele in question as a 'recombination-modifier' (in this case the 'allele' would be the block of GTA genes). I haven't looked at the literature recently, but I think the general conclusion is that hitchhiking is common but weak; it is very unlikely to push an otherwise harmful allele to fixation. But these analyses weren't done for bacteria but for sexually reproducing eukaryotes, with the modifier controlling the frequency of meiotic crossovers between two other loci. I don't know how they would apply to bacteria.

Nevertheless, we know the ability of relatively rare beneficial events to preserve the GTA gene cluster must depend on how frequent the beneficial events are, how beneficial they are, how often GTA genes themselves undergo mutations that block their function, and how much (if any) harm the GTA genes cause. For example, if beneficial events are very rare, functional GTA genes may be lost by random mutation in the interval between beneficial events. Subsequent selection might cause the lineage to then go extinct, but it wouldn't bring the GTA genes back.

The important question is, how could we tell if this sort of thing is responsible for the persistence of the GTA genes? Short answer: I don't know.

USS don't accumulate because the bias is much too weak

In our very-preliminary version of the USS evolution model, we've been using a very simple scheme to score the similarity of DNA sequences to the USS motif. We just count the number of matches to the 10bp core USS sequence. Right now I'm keeping everything simple by running the model with DNA fragments that are only 13 bp long (and a genome that's 200-1000 bp long).

So a fragment with a perfect 10 bp match to the motif is only twice as likely to recombine back into the genome as a fragment with only 5 bp matching. We know from our earlier model, and from a calculation I did years ago, that the bias favouring USS needs to be much stronger than this if it is to overcome the randomizing effects of mutation. For example, (ignoring the effects of base composition) a sequence that matches the motif at 9 positions has 27 different ways to mutate into a fragment that matches at only 8 positions, and only one way to mutate into a fragment that matches at 10 positions. To overcome this disproportion, the bias favouring 9 over 8 (or is it 10 over 9?) has to be proportionally strong (i.e. 27-fold).

Now I need to find that old calculation - I think it might be with my 1999 NIH proposal.

Why don't USS accumulate in our model?

The undergraduate is working at home, making improvements to the use-ability of our Perl model of USS evolution, while I'm here doing some test runs with the slightly unwieldy version we have now.

This is the version that incorporates the first major improvement. Instead of a single "best-score" fragment recombining with the genome in each cycle, each of the scored fragments can recombine with the genome, with a probability proportional to its score. In principle this should allow the recombination to introduce new USS-like sequences faster than they are lost from the genome by mutation. But in practice the USS-score of the genome drifts up and down but doesn't consistently increase.

"When in doubt, run more controls."

So I've made a modified version of this program that has the same feature as our PositiveControl.pl program (it's name is PositiveControlv2.pl). Instead of mutating random fragments and recombining them back into the genome, it replaces their sequences with perfect USS cores. Of course the fragments need to be the same length as the core.

OK, this positive control tells me that the program is doing what it should. 10 fragments are recombining into the genome each cycle, which is what should happen to the 10 fragments the program considers each generation. Now I'll test PositiveControlv3, which puts in an imperfect USS instead of a perfect one. If the program is doing what I want, on average 8 of the 10 fragments will recombine each generation, and the genome score will not get as high as with perfect USSs.

OK, that worked too. So I think the problem may not be with the implementation of the code, but with the design of this version of the model. Back to the drawing board... (literally, back to the big whiteboard in the hall outside my office).

Komodo rocks!

Today we started using Komodo Edit as our editor for Perl.  It's much better than Mi:  better colouring (the numbers are all in red), more reliable indentation, and it catches syntax errors on the fly.  And, very cute, when I typed "if (" , it automatically added the second bracket ")" on the far side of my cursor, to make sure I remembered that I needed to close the (if condition) brackets.

Our most immediate goal is to transfer some of the features of our old/abandoned program into this new one.  This includes some basic/sensible features: reading the parameter settings for each run from a separate file, rather than changing the code of the program, and having a mechanism to neatly end the program and save the interim work when it's interrupted with a control-C.

There are also a couple of quite clever features specific to the issues our models raises.  One is when to print interim reports on the model genome's status.  Depending on the parameters being tested, runs can take anywhere from a few hundred to a hundred thousand or more cycles, with rapid change only at the beginning.  Rather than printing reports at a fixed interval, we have the reports printed whenever the number of cycles increases by a specified percentage.  This gives very frequent updates at the beginning of a run, and reports at increasingly long intervals as the number of cycles gets large.  

The other clever feature is how the model decides that the evolving features of the genome have reached an equilibrium.  This is checked by comparing the recent state of the genome (mean of some attribute in the most recent print interval) with the mean of the same attribute in the previous interval and over the whole run.  By using the increasing print interval as the unit of measure, we get sensitivity appropriate to the rate of change. 

No time for Perl today

But I did spend some time with a post-doc working on the Discussion of her manuscript. I had forgotten that last time we worked on it we tore the existing Discussion into shreds and came up with a new organization. So this morning I was discouraged to see that we had improved our Discussion out of existence, but our new organization is so much better that we soon had at least half of the text in place.

On the Perl side of things, a very helpful commenter (Neil) pointed out that finding the missing/extraneous curly bracket would have been easy if we were using an editor that highlights and validates syntax. We're using an editor called 'mi'. It lets us specify that our text is Perl, and uses colours to distinguish between different kinds of text (comments are red, text to be printed is grey, functions are green, operators are sort of purplish, 'while's and 'if's and 'my's are blue), but it doesn't sort out hierarchical stuff like the levels of brackets, and the levels of indentation keep going to hell (possibly my own fault). I'd love to hear about a better Perl editor for Macs, if any reader knows of one.

Who knew that 'while' loops can't be nested...

Substantial progress on the Perl model of USS evolution.

First the undergrad and I added the code that tallies up the scores of every sliding-window position in the genome. It didn't take us long to get it running (a few stray semicolons, etc.) However after now Keith's comment on yesterday's post I think there are more efficient ways to do what we've done.

Then we created code that does the recombination a completely different way, so each of the fragments being tested has its own chance to recombine (probability of recombination depends on its score). The code wasn't tricky, but getting it running took ages of tracking down stray curly brackets and discovering that I can't nest two 'while' loops (I changed one of them to an internal 'if' test).

I still need to add in the feature I came up with yesterday, that writes out the genome sequence and the tally of scores only at specified intervals, but now I have to dash off to a 'visioning education' meeting imposed on us by the administration.

Positive control progress on the USS model

The simplest version of our new Perl model of USS evolution has progressed to the state where it runs correctly. This afternoon I've been doing lots of runs, both with a 'positive control' version that replaces a random genome position with a single perfect USS core in every cycle, and with a test version that mutates random fragments and scores them for goodness of match to the USS motif, and then recombined the best-matched one back into the genome. Tomorrow the undergrad and I are going to create a modified version, to try a different way of having the fragments' scores determine whether they recombine with the genome.

With the positive-control version I've been examining the effect of changing the genomic mutation rate. If the mutation rate is zero, the only limit to USS accumulation is the way insertion of new USS disrupts existing USSs. (This happens only because each 10bp fragment is changed to a perfect USS before recombination, and so bears no relation to the original sequence at that position.) Not surprisingly, more USSs are present at equilibrium when the mutation rate is zero, and fewer when the mutation rate is 0.01 or 0.05 changes per position per cycle. The rate of increase in genome score is largely independent of the mutation rate. Because only a single USS is inserted per cycle, the number of cycles to equilibrium depends on the length of the genome.

Wait - good idea! I think we need to add some code to give us the frequency of each sliding-window score at the end of the run. This would let us make a histogram of how many USS-like sequences the genome has at the beginning of the run, and how many it ahs accumulated at the end. Basically, as the sliding-window is scoring match to the motif at each position, it should record the score in a tally (number of sites scoring 0, number scoring 1, number scoring 2, ...... number scoring 10). I could write some inefficient code to do this (barring about a thousand syntax errors - I really should go back and reread the first few chapters of Beginning Perl for Bioinformatics), but this sounds like something the undergrad might have learned an efficient way to do.

Did I learn anything else from the positive control runs? If the genome is very short the program runs very fast but the scores are noisy (no surprise there). I learned that I have no practical insight into how a sequence's USS 'score' reflects the quality of its matches to the motif - that's why we need the tally. I played around with the 'threshold' we use as a cutoff for insignificant matches to the USS consensus, but I think we can't really understand what this accomplishes until we have the score tally.

I also did some runs with the test version (not the positive control). The results of these mostly served to reinforced the importance of the genomic mutations. Under the present recombination system, USS can't accumulate in the genome because they mutate away faster than they're improved by recombination. I tried turning off the mutation of the genome, so that mutation only happens to fragments that are about to be scored for possible uptake. Even with this 'cheating', the genome's USS score crept up slowly and then plateaued at what looks (without the tally) to be a genome with only weak USS sites.

Model systems in evolutionary biology

I spend the weekend, with the rest of my lab, at the Evo-WIBO meeting of evolutionary biologists.  Over breakfast we got into a discussion of the role (non-role?) of 'model systems'.  I did my usual rant about how evolutionary biologists don't even understand the concept of model systems, but I'm wondering whether the problem is partly just the nature of evolution research.

Molecular biology, biochemistry, cell biology and physiology have made dramatic advances, largely because many of them work on the same organisms, so that the findings of one study can be directly used as the groundwork for more studies.  Evolutionary biologists (and ecologists) almost always work on different organisms, and although they publish lots of nice papers these rarely can be applied to studies by other research groups.  

This is partly tradition - a mark of academic independence in evolutionary biology seems to be choosing your own research system (organism+field site+questions of interest), but it might also partly arise from the nature of the field.  The process of evolution is intrinsically tied more to variation than to shared properties - natural selection acts on differences, not similarities.  So maybe choosing to work on different systems just looks like the sensible thing to do.

But it's consequences are unfortunate, because although every nice bit of research claims to have big-picture implications, the lack of transferability means we haven't really gotten anywhere.

Outline of the perl program

(in response to good advice in the comments)

Below is just a list of the main sections of the program, in its present 'test' incarnation.

MAIN PROGRAM:

1.  Get parameter settings from a file (except it doesn't, the settings are hard-coded in this version).

2. Create a random-sequence 'genome' of the specified length and base composition.

3.  Simulate a set number of cycles (presently 100), each consisting of genome mutation, fragment creation, mutagenesis and scoring, and recombination.    

     3A.  Mutate the genome by randomly changing bases with a specified probability.  (This step should be later in the cycle, not here.)

     3B.  Select a specified number of segments of the same lengths, from random positions in the genome.  This will represent fragments in the external gene pool.

     3C.  Record each fragment's sequence and 5' end position.

     3D.  Mutate each fragment's sequence.

     3E.  Score each fragment's sequence for goodness of match to the uptake sequence motif, using a sliding window.  I think the sliding window scores are not being correctly cumulated.

     3F.  Choose the fragment with the highest score.  Put its sequence at the corresponding genome position, replacing the original genome sequence of this fragment.  (This is a simulated form of recombination by gene conversion.)  Any mutations of this fragment that occurred in step 3D will thus become changes in the genome sequence.

     3G.  Score the genome for how well its sequence matches the USS motif.

4.  At the end of all the specified cycles, stop and report.

-----------------------------------------------------

SUBROUTINES:

I.  Creating the original random genome sequence:  This is pretty simple; it just picks bases randomly, with probabilities specified by the base composition.  

II. Mutating the genome or fragment sequence:  This is more complex, partly because the mutations need to maintain the base composition (see subroutine III), but mainly because it does it a relatively non-obvious but more efficient way.  It first decides how many mutations to make, by dividing the genome length by the mutation rate and taking the integer value.  (Oops, this will only work if the genome or fragment is big enough to get more than one mutation per cycle. The 'test' version has only a 100nt genome and a specified mutation rate of 0.001, so it has a real mutation rate of zero.)   The subroutine then randomly chooses positions for this number of mutations, and makes the mutations at these positions.

III. Doing the calculations for the mutagenesis probabilities:  This creates arrays holding the mutation probabilities for each base (A or G or C or T) to mutate to each other base.

I've got to stop this and work on my course's final exam for a while.

Still reading code

I'm working my way through the undergraduate's Perl code, annotating it with detailed comments to explain to myself what I think is going on.

I'm only about 30% through the code, but I think I've found a number of problems, one of them big.  Tomorrow morning I'll tie down the undergraduate and go over everything with him.  Or maybe tomorrow early afternoon - tomorrow morning I'm supposed to help one of the post-docs polish her talk for this weekend's Evo-WIBO meeting.

Reading code

Our Perl-programming undergrad has just sent me a copy of the latest version of his program simulation the evolution of uptake sequences by molecular drive.  So far I've gotten to about line 100 and found several trivial typos and one source of confusion (to me).  I had thought that the order of steps in each cycle was as follows:
  1. Choose random fragments of a specified length from genome and mutate them (as if they came from different daughter cells).
  2. Score each fragment for its match to the USS consensus.
  3. Mutate the original genome according to the same procedure used on the fragments.
  4. Replace the corresponding segment of the genome with the fragment that has the best USS score.
But the standard version of this code seems to instead do the following:
  1. Mutate the whole genome.
  2. Chose random fragments and mutate them (again).
  3. Score each fragment for its match to the USS consensus.
  4. Replace the corresponding segment of the genome with the fragment that has the best USS score.
So the fragments are getting mutated twice.

In actuality, this 'test' version of the code has a couple of steps commented out, and short-circuits the fragment-generation and mutation steps by simply specifying the sequence of every fragment (as a perfect USS).  I think this makes it a lot easier to confirm the the code that does the scoring is working as intended.  Tomorrow I'll sit down with the undergrad and go over it.

Why "Expelled" is a bad movie

The guy in the next office just got off the phone with a colleague, arguing whether atheists should rise up and express their views. Triggered by the pending visit of Richard Dawkins to UBC. This horrible movie is one of the reasons I think we should:

Expelled

Classes are over!

Let's see if I can get back to posting every day.  First let's see if I can remember what's been on the back burner.

The USS-evolution computer simulation is up and running, and the programming assistant will be able to spend more time working with me and the post-doc to get it doing what we want.  It's close, so I'm hoping for lots of advances and discoveries.  I've signed up to give a short talk about this at an upcoming meeting on sex and recombination (in Iowa, just before the big evolution meeting in Minnesota at the end of June).

The "How do USS constrain genome function" project with the out-of-town bioinformaticist is ready for its final polishing.  A couple of weeks ago she sent me an email which (I think) contains the final data, and it really shouldn't take long now to have the manuscript ready for submission.  I've signed up to give a short talk on this work at the evolution meeting.

The post-doc who's been analysing the variation in competence in a diverse set of H. influenzae strains now has a manuscript that doesn't need a lot more work.  She's going to give a short talk on this work at next week's Evo-WIBO meeting (evolutionary biologists in Washington, Idaho,British Columbia and Oregon).

Both of the molecular biology post-docs have data that has yet to be put into manuscripts (at least I have yet to see the manuscripts).  One has been analysing how CRP binds to recognition sites, and the other has been doing microarrays to find out how CRP and Sxy regulate genes in E. coli.

On the teaching front, I still have to:  fix up the final exam so its a valid assessment tool for our homework-research project as well as for students' understanding; grade 17 term papers that are evaluating intelligent design as a scientific alternative to natural selection; help the graders grade the other ~75 project reports; help our homework grader finish grading the last two homeworks, and prepare detailed keys for these; analyse and post the marks for the 'clicker' questions the students have been answering in classes; administer and help grade final exams for about 360 students; and get all the grades analysed and submitted.

And on the homework-research project front, I (and the wonderful teaching fellow I'm working with) still have to finish a proposal for a small grant to hire assistants to assess the quality of writing in papers and exams by this year's and last year's students; find someone with sufficient Excel skills to transform our clicker-collected survey data into something we can work with; read the literature (I'm hoping the teaching fellow will point me to the appropriate papers); analyse the data; and write the paper.

Insights from a visitor

We are blessed this week by a visit from a potential collaborator - a computer scientist who's done work on uptake sequences.  When we described our computer-simulation model of uptake sequence evolution, he quickly discovered a serious problem arising from the way we have set up the uptake bias function to work.

In our model, a pool of DNA fragments is created each cycle, and the fragment with the best uptake sequence score gets to replace the homologous segment of the genome whose evolution is being simulated.  The uptake sequence score is determined by using a sliding window to compare the fragment's sequence to the designated uptake sequence.  For a long fragment this score is expected to reflect the combined effect of multiple uptake sequences.  But at present the model is using short fragments, so the score is likely to just be that of a single uptake sequence. 

This means that, once the genome contains a 'perfect' uptake sequence at one position, a fragment homologous to that position is expected to always out-score any other fragments in the pool, and thus always be the one that replaces the resident sequence.  Thus one good match prevents the gradual evolution of other not-quite-as-good matches at other positions.

There's a different way to do the competition that doesn't have this problem.  Rather than having multiple fragments 'compete' for the best score, with the winner taking the only opportunity to recombine with the genome, we can have each fragment in the pool independently challenge the odds of being taken up.  Uptake of one fragment would not affect the chance of any other fragment being taken up in the same cycle.  

This will require some rewriting of the code. But it shouldn't be a big deal, and luckily our undergraduate programmer will have more time to work for us once exams are over at the end of the month.

Can I remember how our USS-evolution model works?

Today we're having the first lab meeting in weeks. (When my turn came around a few weeks ago I just kept canceling them, but now I'm starting to see the light at the end of the teaching tunnel.)  We're going to discuss an issue that's arisen in the USS-modeling work being done by an undergrad research assistant, but first I promised to introduce this project.  What can I remember (or rediscover by reading my old blog posts about it)?

The big goal is to simulate how uptake sequences accumulate in genomes of competent bacteria, under the combination of mutation pressure (a randomizing force) and biased uptake preferring fragments containing these sequences.  The model follows a single genome-sized sequence through repeated cycles in which 
  1. Random segments of the genome are treated as if they were fragments in an external DNA pool released by descendants of the 'index genome'
  2. These fragments are scored for quality of their match to the ideal uptake sequence.  The best fragment is chosen for the uptake step
  3. In the conceptual meantime, the index genome itself undergoes random mutation, becoming the descendant index genome.
  4. The chosen fragment's sequence replaces the homologous sequence in the descendant  index genome.
  5. This recombinant sequence becomes the new index sequence and the cycle starts again at step 1.
There have been lots of issues to resolve along the way (how the mutation steps maintain the base composition of the sequence, how the uptake sequences are scored), but we finally have a program that runs.  It seems to be working correctly, but the undergraduate who's done most of the work tells me that it isn't causing any uptake sequences to accumulate.  He's quite a sophisticated undergraduate - he has a Biochemistry degree under his belt and is nearly finished a second degree, in computer science - and he's done a lot of statistical analysis to look for the expected accumulation.

I suspect that the problem is that so far the model is using inappropriate parameters (mutation rates too low or too high? uptake bias settings too weak or too fussy? numbers of cycles too short?).  Today's goal is to figure out what these might be.



Research about teaching

It's been so long since I posted that Blogger had a seizure when I clicked on 'New Post'.
I'm still struggling to find a few spare synapses to think about scientific research, but the teaching demands will begin to diminish soon. In the meantime here's a post about the only research I'm currently paying much attention to.

I'm teaching freshman biology to about 380 students, in two sections. With one of our wonderful teaching fellows (supported by UBC's Carl Wieman Science Education Initiative), I'm carrying out an experiment to find out how homework might improve students' understanding of the course material and ability to explain their understanding in writing.

It's been widely assumed, but perhaps never explicitly tested, that doing homework helps students understand course material. Int his experiment we're going to examine whether students whose homework required them to formulate their ideas in correctly written sentences, and who got detailed feedback on their errors, perform better on midterm and final exams than students whose homework required only that they recognize correct answers.

The course I teach (BIOL 121, Genetics, Evolution, Ecology) has no tutorials and no TAs, just graders for midterms and finals; it's never had homework. This term we've split the students randomly into two homework groups that both get weekly homework assignments with very similar content but different requirements. Each homework is built around a single theme; more of a 'case study' than an series of unrelated questions. Students are given some information, asked one or two questions, given a bit more information, asked more questions, etc.

Group B's questions are in formats that can be automatically graded by our BlackBoard course management system - mostly multiple-choice questions, with some matching and fill-in-the-blanks. Group A is asked many of the same questions, but they have to think up their own answers and write them out. If the question does not require writing (e.g. has a numerical answer) the students are asked to give a written explanation of their answer.

Group B students can check the grading of their answers through the online system but get no specific feedback about their errors. Group A students get individual feedback about both their writing errors and their content errors. Part of this feedback is a very detailed grading key that gives, for each question, a sample answer, a numbered list of points a good answer should contain and errors that should have been avoided, and often a reference to lecture notes, textbook pages, or other sources of clarifying information. For each student's submission, each answer that did not earn full marks is commented with numbers indicating which problems the answer contained. For example, Writing error A is 'grammar errors', and Content error 4a is 'misinterpreting pedigree symbols or relationships'.

At the end of term we will compare the performance of the two groups on both the midterm and the final exam. Both these assessments include questions with written answers, allowing us to evaluate both students' mastery of course material and their ability to write clearly and correctly. Students were given a pre-quiz at the first class, and some of these questions are repeated on the midterm or final, allowing direct before and after comparison.

To make sure students feel they have been treated fairly, the course grades will be normalized across the two groups before being officially submitted to the Registrar's Office. The group with the lower course mean will have its grades raised to match the mean of the other group. This seems to be having the desired effect. We had expected some students to protest the unequal treatment, complaining either that Group A had to work harder or that Group A was going to learn more, but this hasn't materialized.