RRResearch

Can I remember how to run a Gibbs motif sampler analysis?

By Rosie Redfield on Thursday, August 14, 2008

Our visiting grad student is working with Gallibacterium, a Pasteurellacean relative of Haemophilus. To help her optimize transformation we would like to find out about its uptake bias. As a first step, we'd like to find out whether it has repeats in its genome that resemble the known Pastuerellacean uptake signal sequences (USS) - fortunately a Gallibacterium genome sequence is available. I've done this analysis for all the other sequenced Pasteurellacean genomes, so I said I'd do this one too. Should be easy...

My first approach was to give the genome sequence to our Perl program that simulates USS, not because I want to do that, but because the program's first step is to count the numbers of full and partial USS matches in the starting sequence. The program was set up to do that for the H. influenzae USS (AAGTGCGGT), but when it didn't find many of these in the genome I modified it to find the other type of Pasteurellacean USS (ACAAGCGGT). It didn't find many of those either.

So, perhaps Gallibacterium has a previously unknown version of the USS. Or perhaps it has an unrelated USS. Or perhaps it doesn't have a USS at all, which would suggest that it has weak or no uptake bias. What was needed was analysis with the Gibbs motif sampler, which would look for any common repeat in the genome. OK, I did lots of those last summer, so I can do it again.

I remembered how to submit a sequence for analysis, but I didn't bother to carefully check what the different settings do bfore submitting the run. That was stupid, because 36 hours later I've received two emails fromt he system, telling me that my requested run failed. One says "ERROR:: Mismatched width ranges" and the other "ERROR:: Palandrome (sic) subscript overflow". Guess I'd better buckle down and sort it out.

CRP maniscript revisions submitted, on to Gallibacterium...

By Rosie Redfield on Monday, August 11, 2008

As usual, it took me about 3 tries to get the manuscript resubmitted with all the files in their correct forms. But it's done.

I'm going to try to get back to the bench next week, doing some competence-induction experiments with Gallibacterium, brought to the lab by our visiting grad student. Oh boy!

Manuscript almost ready to go back

By Rosie Redfield on Saturday, August 09, 2008

Our manuscript about how the CRP proteins of E. coli and H. influenzae differ in their sequence specificity has been provisionally accepted, and the revised version is almost ready to send back to the journal. We weren't able to do the one experiment requested by the editor, but we make what we think is a pretty good argument about why it isn't needed.

The only remaining problem is that some of the figures look a bit weird, I think due to being shuffled between different formats. The dark grey shading in some of the bar-graph bars has turned into a dark grey check pattern. My former-postdoc-coauthor converted the figures into high-resolution PDFs so he could email them to me, but maybe he should instead post them to one of those file-sharing sites where I can download them. I know Google Groups works for this, but I think there are also sites dedicated to this.

The Response to Reviewers letter has been written and revised and polished, so once the figures are sorted out I think I can sit down and do the on-line submission.

Where are they now? Part 2

By Rosie Redfield on Monday, August 04, 2008

Two more lines of research that we're no longer working on:

3. When did eukaryote sexual reproduction begin? During the first 10 years that I was working on competence, I fully intended to switch to studying the origins of meiosis in eukaryotes. The plan, and the reasons I set it aside, are explained in this post from last summer. (Fortunately John Logsdon has taken up the torch.)

As the first steps in this project Joel Dacks, then a M.Sc. student in my lab, and I published two papers on the phylogeny of early-diverging eukaryotes. These results have been since confirmed by more detailed analyses, although the deep phylogeny of eukaryotes is still rather obscure.

4. Quorum sensing and/or diffusion sensing: Most bacteria secrete small more-or-less inert molecules into their micro-environments and monitor the external concentrations of these molecules.When this autoinducer-secretion was first discovered it was proposed to be a means of cell-cell communication, evolved to enable bacteria to monitor the cell density of the population they are living in and to respond with appropriate changes in gene expression. This "quorum sensing" explanation quickly became dogma, despite having serious theoretical/evolutionary problems. In retrospect, this acceptance was partly because there were no alternative explanations for the evolution of autoinducer secretion and sensing, and partly because the idea that bacteria are secretly talking to each other is very appealing.

In 2002 I published an opinion piece (in Trends in Microbiology) proposing a much simpler explanation, that the secreted molecules serve as inexpensive sensors of the diffusional properties of each cell's microenvironment, and thus allow cells to secrete expensive effector molecules (such as degradative enzymes) only when they and their products will not be lost by diffusion. This ‘diffusion sensing’ hypothesis was welcomed by evolutionary biologists but largely ignored by the many researchers actively investigating quorum sensing. My lab initially tried to develop experimental systems to demonstrate that isolated cells use secreted autoinducers for gene regulation, but gave up because of the technical problems of monitoring gene expression at the scale of single isolated cells.

However the paper now gets regular citations in reviews of quorum sensing, and several other research groups have produced evidence validating the importance of diffusion in autoinducer regulation. The latest is a study of Pseudomonas cells on leaves (Dulla and Lindow PNAS 2007), which found that diffusion and other physical factors in cells' microenvironments are major determinants of this regulation. They pointed out that my proposal "has received little attention despite the extensive study of QS in many species", and even quoted approvingly my sentences about what research is needed.

Where are they now?

By Rosie Redfield on Sunday, August 03, 2008

In the course of updating my CV I've been checking what's become of hypotheses and projects we initiated but are no longer working on. The good news is that all of them are still active areas of research, and the ones I consider most important are getting increasing attention. Here's a quick overview of two of them.

1. Mutation rates in males vs females: In response to a paper reporting that point mutation rates are much higher in males than females (because sequences on X chromosomes evolve slower than sequences on Y chromosomes), I used a computer simulation model to show that the excess mutations in male lineages usually canceled out the benefits of sexual recombination for females (Redfield Nature 1994). This paper made a big media splash when it came out; Natalie Angier wrote it up for the New York Times, Jay Leno made a joke about it, and it even got a paragraph in Cosmopolitan! This was partly because the title was full of buzzwords 'sex', 'male', 'female', 'mutation', and partly because I wrote up a very clear useful press release.

It didn't make much of a scientific splash, and it hasn't had much impact on subsequent work on the evolution of sex, but the number of citations continues to increase. Many citations are from a European group of theoretical physicists who publish mainly in physics journals, but others are from evolutionary biologists. One 2007 review discusses the implications of my work, referring to it as 'a seminal study' (which I choose to interpret as not just a bad pun).

The hotspot paradox: Most meiotic crossing-over happens at chromosomal sites called recombination hotspots; the largest influence ont he activity of these sites is the DNA sequence at the site. While I was still a grad student I realized that, over evolutionary time, active hotspot sequences should disappear from genomes, being replaced first by leas-active and then by inactive sequences. This is because the mechanism by which hotspots cause recombination also causes more-active hotspot sequences to be physically replaced by less-active sequences. At that time the genetic evidence was strong but little was known about the molecular details. This creates a paradox, because hotspots have not disappeared (each chromosome has many of them).

About 10 years later I returned to this problem, using detailed computer simulations to model the evolution of hotspots. We first created a deterministic model of a single hotspot, and showed that none the forces opposing hotspot elimination (evolutionary benefits of recombination, benefits of correct chromosome segregation, direct fitness benefits of hotspots that also act as promoters, singly or in combination) were strong enough to maintain hotspots against their self-destructive activity. Several years later we created a better, stochastic, model that followed multiple hotspots on a chromosome - this confirmed and strengthened the previous conclusions.

The first paper (Boulton et al, PNAS 1997) was ignored by just about everyone, particularly the molecular biologists whose work might be expected to resolve the paradox. By the time the second paper was published (Pineda-Krch and Redfield 2005), evidence from human genetics had confirmed that the hotspot destruction originally studied in fungi also occurs in humans. Now, the increasing ability to examine individual crossover events at base-pair resolution has focused attention on the paradox, and most papers about hotspots in natural populations (including humans) mention it as a sign that the evolutionary history of recombination hotspots remains perplexing.

I'll write up a couple more of these projects tomorrow.

By Rosie Redfield on Saturday, July 26, 2008

I decided to do a USS-model run with the values in the DNA-uptake matrix increased five-fold, thinking this meant that drive favouring uptake sequences would be much stronger. This should have meant that, when beginning with a random sequence, there would initially be very few fragments that had sufficiently high scores to pass the probability filter and recombine with the genome. So I was surprised when the frequency of recombination at the beginning of the run looked to be about the same as with the normal matrix.

But then I remembered that one nice feature of the latest version of the model is that it uses whatever values are in the DNA-uptake matrix to calculate an appropriate value for the recombination probability filter, always equal to the score earned by a single site that perfectly matches the matrix-specified preferences. This is good under some circumstances, but in the present case it entirely defeats what I was trying to do.

So I guess it's time to tweak the model one more time, introducing a coefficient by which the probability filter can be multiplied. Ideally we'd like to be able to sometimes have this coefficient change during the run, as this would allow us to simulate evolution of the uptake specificity itself. I'm confident that I can introduce a constant coefficient (read in with the other run settings); I wonder if I can easily make it a variable...

Test post from iphone

By Rosie Redfield on Friday, July 25, 2008

The culture media problems we were having last fall are back. This
time the postdocs suspect the medium, so they've ordered a new stock
of the Bacto stuff.

I think I won't try to post any more from my phone until my iPhone
typing speed improves.

Sent from my iPhone (by email to Blogger, because the iPhone doesn't support MMS messaging).

How arbitrary is a position weight matrix?

By Rosie Redfield on Thursday, July 24, 2008

Recently I've run some USS-evolution simulations that started with a 50 kb segment of the H. influenzae genome rather than with a random sequence of base pairs. I used the position weight matrix derived by Gibbs analysis of the whole genome, thinking that this would be a non-arbitrary measure of the over-representation of the uptake sequence pattern. I was hoping that this bias would be strong enough to maintain the uptake sequences already in the genome, but the genome score fell to about half (or less) of its original value after 50,000 or 100,000 cycles.

That started me wondering whether the position weight matrix should be treated as a fixed set of values, or just as telling us the relative value that should be applied to each base at each position. Said another way, could different settings of the Gibbs analysis have given a very different matrix? The answer is No, but only because the Gibbs analysis reports the weight matrix as the frequency of each base at each position, so the sum of the weights of the four bases at each position must add up to 1. So if we want to consider stronger or weaker matrices, there's no reason not to multiply all the values by any factor we want to test.

So I think the first goal is to see when strength of matrix is needed to maintain the uptake sequences in the genome at their natural frequencies. Then we can see whether the uptake sequences are still in the places they started at, or whether these have decayed and new ones arisin in new places.

sent from my new iPhone! Unfortunately I don't know how to get text entry in the big frame.

By Rosie Redfield on Sunday, July 20, 2008

Some calculations and implications

By Rosie Redfield on Friday, July 18, 2008

In playing around with our Perl model of uptake sequence evolution, I've been surprised by how weak the effect of molecular drive seemed to be. But I just did some back-of-the envelope approximations and realized that our settings may have been unreasonable.

We can use the relationship between the genomic mutation rate the model is assuming and known real mutation rates for a rough calibration. Real mutation rates are on the order of 10^-9 per base pair per generation. Using such a low rate per simulation cycle would make our simulations take forever, so we've been using much higher rates (usually 10^-4 or higher) and treating each simulation cycle as collapsing the effects of many generations. But we didn't take the implications of this seriously.

If we do, we have each simulation cycle representing 10^5 generations. How much DNA uptake should have happened over 10^5 generations? We could assume that real bacteria take up about one fragment of about 1 kb every generation. The length is consistent with a recent estimate of the length of gene-conversion tracts in Neisseria, but the frequency is just a guess. I don't know whether it's a conservative guess, but if bacteria are taking up DNA as food it's probably on the low side of reality.

How much of this DNA recombines with the chromosome? For now lets assume it all does. This means that, in each cycle of our simulation, a simulated full-size genome would replace 10^5 kb of itself by recombination with incoming DNA. Because a real genome is only about 2 x 10^3 kb, each part of it would be replaced about 100 times in each cycle. We could cause this much recombination to happen in our model. but it wouldn't simulate reality because there wouldn't be any reproduction or DNA uptake between the multiple replacements of any one position.

We can make things more realistic by (1) assuming that only about 10% of fragments taken up recombine with the genome, and (2) decreasing the genomic mutation rate by a factor of 10, so each cycle only represents 10^4 generations. Now most of the genome gets replaced once in each cycle.

What about the fragment mutation rate? We might assume that, on average, the fragments that a cell takes up come from cells that are separated from it by about 5 generations. That is, the cell taking up the DNA and the cell the DNA came from had a common ancestor 5 generations back. This means that 10 generations of mutation separate the genome from the DNA it is taking up, so the fragment-specific mutation rate should be 10 times higher than the genomic mutation rate.

So I have a simulation running that uses a genome mutation rate of 10^-5 and a fragment mutation rate of 10^-4. The fragments are 1 kb long, and the cell considers 100 of these each cycle.

One other modification of the program matters here. We've now tweaked the program so that it can either start a run with a random-sequence 'genome' it's just created, or with a DNA sequence we give it, that can be taken from a real genome with real uptake sequences.

So the run I'm trying now starts not with a random sequence but with a 50 kb segment of the H. influenzae genome. This sequence already has lots of uptake sequences in it, so about half of the 1 kb fragments the model considers in each cycle pass the test and are recombined into the genome. I'm hoping the new conditions will enable the genome to maintain these uptake sequences over the 50,000 cycle run I have going overnight.

The USS-Perl project becomes the USS-drive paper

By Rosie Redfield on Wednesday, July 16, 2008

I think I'm going to start using the label "USS-drive" for the manuscript that describes our "USS-Perl" computer simulation model. That's because the focus of the manuscript will be on understanding the nature of the molecular drive process that we think is responsible for the accumulation of uptake sequences in genomes.

The plan is to combine the modeling analysis with our unpublished results on the bias of the uptake machinery and on the nature of the motif that has accumulated in the genome.

The broad outline will be as follows:

We want to understand how the bias of the uptake machinery can affect evolution of sequences in the genome, assuming that cells sometimes take up and recombine homologous sequences from closely related cells. And we want to examine this in the context of what is seen in real cells - the real biases of the DNA uptake machineries and the real patterns of uptake-related sequences in genomes.

So we will begin by properly characterizing the genome sequences using the Gibbs Motif Sampler. I've already done this analysis for all of the genomes that have uptake sequences. And we've done Gibbs analysis on different subsets of the H. influenzae genome (coding, non-coding, potential terminators, different reading frames), and looked for evidence of covariation between different positions.

We will also collate the published uptake data for H. influenzae and N. meningitidis and N. gonorrhoeae, adding our unpublished H. influenzae data.

And then we will present the model as a way to investigate the relationship between uptake bias and genome accumulation. A key feature of the model is that it models uptake bias using a position-weight matrix that treats uptake sequences as motifs rather than as elements. That is, it specifies the value of each base at each position of the motif. This means that we can evaluate both uptake-bias data and the genome-frequency data as inputs into the model. The uptake-bias data isn't really good enough for this, and I anticipate that the main focus will be using the genome frequency data to specify uptake bias in the model.

Because the model allows the matrix to be of any length, we can use it with the full-length H. influenzae motif (30 bp), not just the core. And because the model lets us specify base composition, we can also use it for the Neisseria DUS.

On to other manuscripts!

By Rosie Redfield on Tuesday, July 15, 2008

My bioinformatics colleague is headed back home, and our uptake sequence manuscript is waiting for the last bits of data, and for me to write a coherent and thoughtful Discussion. It's not nearly as far along as I'd hoped, but it's a much better piece of science than it was shaping up to be a week ago. One key step was switching the axes of a couple of the figures. Seems like a minor thing, but putting "% sequence identity" on the X-axis rather than the Y-axis transformed the data from meaningless to crystal clear.

The only big piece of analysis we still need is an examination of which of the H. influenzae genes that don't have BLAST hits in our three standard gamma-proteobacterial genomes also don't have hits in the more closely related A. pleuropneumoniae genome. Those that don't are stronger candidates for having entered the H. influenzae genome by horizontal gene transfer, and we predict they will also have relatively few uptake sequences.

So what's the problem with the Discussion? I don't seem to be able to see the big picture for the trees, and I'm hoping that setting the whole think aside for a couple of weeks will let the fog dissipate.

Lord knows I have lots of other stuff to work on. I wish I was going to do some benchwork, but I'm afraid for now it's two other manuscripts.

One of these is the educational research I've been referring to as "the homework project". It's been on the back burner while the graders we hired scored the quality of students' writing on various bits of coursework and the final exam. I think this is all finished now, though I don't even know where the data is. The teaching-research post-doc who was working closely with me on this project has started her new job back east, but she'll continue to be involved in the data analysis and writing. I'm also blessed with involvement of a new teaching-research post-doc, who'll be working with me here. (Both post-docs are part of our CWSEI program, not part of my own group.) The first step to getting the analysis and manuscript-writing under way is to email both of them and get the work organized. I'll do that this morning.

The other manuscript is the project I've been referring to as "USS-Perl". The basic computer simulation model now works well, our programming assistant has gone on to other things, and the post-doc involved with this project has done a lot of runs of the simulation to pin down some basic relationships between parameter settings and result. (I forget what these are, so I need to talk to her today about this.) I have a plan for the first manuscript, which I'll spell out in a separate post later today.

Divergence of genes in different COG functional categories?

By Rosie Redfield on Saturday, July 12, 2008

The bioinformatics manuscript is coming along nicely (though of course a lot more slowly than I had hoped).

One of the things it does is show that the densities of uptake sequences in genes does not correlate well with the 18 'COG functional categories' that genes have been assigned to. This is a significant result because a previous paper claimed a strong correlation between high uptake sequence density and assignment to a modified COG functional category containing 'genome maintenance genes'. This result was considered to provide strong support for the hypothesis that uptake sequences exist to help bacteria get useful new genes, a hypothesis I think is dead wrong.

Our hypothesis is that the distribution of uptake sequences among different types of genes (with different functions) should only reflect how strongly these gene's sequences are constrained by their functions. Our analysis would be quite a bit more impressive if we showed a positive result - that uptake sequence density in different COG functional categories correlates well with the degree of conservation of the genes in these groups. My first idea was to ask my bioinformatics collaborator to do this analysis. But I suspect it might be a lot of work, because she's only done any COG analysis with the A. pleuropneumoniae genome, but we would want analysis done with the H. influenzae and/or N. meningitidis COG functional categories genes, looking at the % identity of the three 'control' homologs we've used for our other analysis.

So I'm wondering whether someone might have already done a version of this analysis. Not with H. influenzae or N. meningitidis, and not with the control homologs we've used, but any general estimate of levels of conservation of genes in the 18 COG functional categories. I searched Google Scholar for papers about COG functional group divergence and found a good review of analyses one can do in the COG framework. This got me hoping that maybe there was a web server that would let me do the analysis myself, but the paper didn't describe anything that would do the job.

But I looked deeper in Google Scholar's hits and found something that looks very promising. It examines rates of sequence evolution across genes in the gamma-proteobacteria. H. influenzae and the control genomes we used with it are all in the gamma-proteobacteria, and I think the paper has looked specifically at the relative rates of evolution of genes in different COG functional categories, so this might be exactly what I'm looking for. The only problem is, the paper appeared in PLoS Genetics, and their site is down right now! I'm trying to read the paper in Google's cached version, but the page is all greyed out and it can't show me the figures. Guess I'll just have to be patient and hope the site is back up soon.

Base composition analysis supports gene transfer

By Rosie Redfield on Saturday, July 05, 2008

Both our results and those recently published by another group suggest that uptake sequences are less common in genes that don't have homologs in related genomes. I'm using 'related' loosely here, as our analysis searched for homologs in genes from other families of bacteria whereas the other group searched in members of the same genus. But both results are best explained by the hypothesis that genes lacking uptake sequences are often those that a genome has recently acquired by lateral transfer from genomes that don't share that species' uptake sequence.

To test this, we looked at gene properties that might give evidence of lateral transfer. The simplest such property is base composition (%G+C), a property that differs widely between different bacterial groups and changes only slowly after a sequence has been transferred to a different genome. My bioinformatics collaborator had already tabulated the %G+C of all the genes in the H. influenzae and N. meningitidis genomes, noting whether each gene had (1) uptake sequences and (2) homologs in the test genomes (see previous two posts for more explanation). She'd then calculated the means and standard deviations of the base compositions of genes in each of the four classes.

This analysis showed that, on average, the genes with no uptake sequences and no homologs also had anomalously low base compositions and larger standard deviations. But I thought the effect might be clearer with a graphical presentation, so I plotted the actual %G+C for each gene as a function of which class it is in (presence/absence of uptake sequence and homologs). I also noted sample sizes, as some classes have a lot more genes than others. This gave a nice visualization of the effects, showing, not surprisingly, that genes with homologs in the control genomes have more homogeneous base compositions than those without homologs, and that the effect is quite a bit stronger for those genes that don't have uptake sequences.

Order of Results

By Rosie Redfield on Saturday, July 05, 2008

We're still working out the order in which we present the different results in our draft manuscript. The first parts of the Results are now the analysis of how uptake sequence accumulation has changed the frequencies of different tripeptides in the the proteome, and of why some reading frames are preferred.

The next decision is how we present the results that use alignments of uptake-sequence-encoded proteins with homologs from three 'control' genomes that do not have uptake sequences. We had tentatively planned to first describe the finding that genes that don't have uptake sequences are less likely to have homologs in the control genomes (call this analysis A), then move to the genes that do have homologs in all three control genomes, first considering the overall degree of similarity (call this analysis B) and then considering the differences between the places where the uptake sequences are and the rest of the proteins (call this analysis C).

But I was having a hard time seeing how to motivate the first analysis - it didn't easily connect with what came before. Now I think a better plan is to first describe analysis B, which shows that genes with more uptake sequences have lower similarity scores (actually identity scores - my collaborator and I disagree about the relative value of using similarity and identity scores).

By describing analysis B first, we can more logically introduce the strategy of using alignments to homologs from a standard set of control genomes, before discussing the genes that lack homologs. And the results of analysis B then motivate analysis A, looking at the genes that had to be omitted from analysis B because they didn't have all three homologs.

Organization of the Results

By Rosie Redfield on Saturday, July 05, 2008

My bioinformatics collaborator and I are making good progress on both the writing of our manuscript and on improving the data that's going into it. One of the latter came when I redid an analysis I first did about a year ago, this time being more meticulous about the input data and more thoughtful about how I used it.

The question the data address is why uptake sequences in coding regions are preferentially located in particular reading frames, specifically the effect of the codons they use on how efficiently the mRNA can be translated. I had used the overall frequencies of different codons in the proteome (all the proteins the genome encodes) to estimate codon 'quality', by summing the frequencies of the three codons specified by the uptake sequence in a specific reading frame, and dividing this by the sum of the frequencies of the best (most-used) codons for the same amino acids. This analysis showed no relationship between codon quality and the reading frame bias.

In retrospect I should have multiplied the frequencies rather than adding them. This both gives a more realistic approximation of how the codons' effects interact, and eliminates the differences caused by the differing frequencies of different amino acids in the proteome. I should also have excluded those uptake sequence positions that allowed more than one codon.

And I probably should have done the analysis for different species rather than only for H. influenzae (though, to be fair on myself, at the time I was working on a H. influenzae USS analysis).

So now I've redone the analysis for the three species that exemplify the different uptake sequence types (H. influenzae, A. pleuropneumoniae and N. meningitidis), and find a substantial effect of codon quality. That is, in each species, uptake sequences are mostly found in reading frames that allow use of the most common codons for those amino acids. As codon frequency is thought to reflect the abundance of the corresponding tRNAs, this means that differences in the efficiency of translation explain some of the differences in reading frame use.

Collaborative writing

By Rosie Redfield on Thursday, July 03, 2008

My bioinformatics collaborator is in town till next Friday, so we can work together to finish up our uptake sequence manuscript. We have done (she has done) a lot of different analyses addressing various questions about how uptake sequence accumulation has affected protein evolution and vice versa, and I'm having a hard time keeping them all straight (partly because I haven't been thinking about this project for a while).

The manuscript is mostly written already; but some parts of the Results are still up in the air because they were waiting for some final data. It also needs polishing and some revising to incorporate results from a Neisseria uptake sequence paper that came out a few months ago.

To cope with all the data we've been occasionally creating 'Flow of Results' pages that summarize our results and conclusions, in the order we think the paper should present them. Yesterday we began our final push by going through the most recent version of the Flow of Results'. I'm not sure we have everything in the perfect order, but we have an order that's good enough, with each result triggering a question that's addressed by the next result.

Speedy simulations

By Rosie Redfield on Friday, June 13, 2008

Our programming assistant has finished his work for us. At least, he's finished the period when he's employed by us, but now he's gotten the research bug and wants to stay involved to see how the program he's developed for us works out.

At present it works great - MUCH faster than before. One of the reasons is that it no longer scores the genome in every cycle. I'm doing a run now with a big genome (100 kb) and I can see it pause at every 100th cycle, as it takes the time to score the genome. This is a sensible improvement, as the genome score isn't needed for the progress of the simulation - it just lets us the user monitor what's happening, and provides information used if the run needs to decide whether it has reached equilibrium.

I'm using this run to test whether the final accumulated USS motif precisely matches the biases specified in the recombination decisions (i.e. in the matrix used to score each fragment), by using a matrix where one position has substantially weaker bias than the others. If the final motif matches the matrix bias, this position should show a weaker consensus in the accumulated uptake sequences.

Of course to test this I may have to recall how to run the Gibbs Motif Sampler on the Westgrid server. The alternative is to chop my 100 kb output genome into 10 pieces and run each separately on the Gibbs web server, which has a sequence limit of 10 kb.

Sex and Recombination in Minneaopolis?

By Rosie Redfield on Thursday, June 12, 2008

The meeting on Sex and Recombination that was to be held in Iowa City next week has been canceled because of severe flooding! (And just when I was getting my talk pulled into shape too...)

I suspect I'm not the only one who planned to combine that meeting with the big Evolution meetings in Minneapolis, June 20-24. I'm flying into Minneapolis on June 15. If you'd like to get together with me or other sex-refugees, post a comment below and we'll see what we can organize.

Preparing talks

By Rosie Redfield on Tuesday, June 10, 2008

I've been struggling to pull together ideas and data for the two talks I'm giving (next week and the week after) at evolution meetings. Yesterday was my turn to give lab meeting, and I used it to get help from the post-docs. My draft talks were a mess, but the post-docs had lots of excellent suggestions. Both talks will use the same introduction to the evolutionary significance of bacterial DNA uptake, and will then diverge.

On the USS-evolution simulation front, the model is running nicely and I'm using it to quickly collect data for my first talk (20 minutes, for the Sex and Recombination meeting). But I have to compromise statistical significance with run time, as the large genomes needed to get lots of sequence data take a long time to simulate.

On the proteome-evolution front, my bioinformatics collaborator just sent me the last of the data, including a control set needed for comparison with the analysis of how divergent USS-encoded peptides are to their homologs.

Creeping up to equilibrium?

By Rosie Redfield on Friday, June 06, 2008

One other issue about equilibrium:

In our previous (unrealistic model) we found that USS initially accumulated very quickly, as singly and doubly mismatched sites were converted to perfectly matched sites. But this happened at the expense of depleting the genome of those mismatched sites, and further accumulation of perfect sites required waiting a long time for mutation of worse sites to regenerate the singly and doubly mismatched ones, which would then slowly allow further increase in the number of perfect matches. So achieving true equilibrium took a long time.

I expect this phenomenon to also apply in this new model. So I'm not at all confident that an early leveling-off of the rate of increase indicates closeness to the true equilibrium.

In the graphs to the left, the upper simulation probably hasn't reached equilibrium after 90,000 cycles (because the blue points indicating genome scores are still increasing), but the lower one has (because the blue points seem to be scattered around a stable mean).

I'm not sure why the lower run reached equilibrium so much faster than the upper one. Several factors differed - this is why I need to be more systematic. My excuse is that it's easier to motivate a systematic approach when individual tests are fast to do, and there are so many variables to test that I hate to spend a lot of time on just one. But it's time to treat this like real science.

Simulating USS evolution

By Rosie Redfield on Friday, June 06, 2008

Yes, the Perl model has progressed to the point where it's now a research tool. But I now need to focus my use of it, to get useful data rather than just noodling around to see what happens.

One remaining uncertainty is the decision that a simulation has reached an equilibrium, where forces increasing the frequency of USS-like sequences are balanced by forces decreasing it. So far I've been running simulations for a specified number of cycles instead of 'to equilibrium', because I'm not confident that they will indeed correctly identify equilibrium conditions. Now I guess I should take the settings I used for runs that did reach what I consider to be equilibrium, and rerun them 'to equilibrium' instead of to the specified number of cycles.

A problem is that the runs still take quite a long time. For example, last night I started a run using a 50kb genome, and taking up 100bp fragments. Although it was close to equilibrium after about 6 hours, the equilibrium criterion hasn't been met yet (because this criterion is quite conservative). Maybe we should use a less-conservative criterion, at least for now, because we're really mainly interested in order-of-magnitude differences at this initial stage.

One useful pair of runs I've done examined the effect of having a non-zero genome mutation rate. This is of course the only realistic treatment, but in the 'testing' runs we've had the genome mutation rate set to zero, with mutations occurring only in the fragments being considered for recombination, because otherwise USS-like sequences didn't accumulate. Both of the new runs considered a 20kb genome and 200bp fragments, with a fragment nutation rate of 0.05 per cycle. One of these runs had a genome mutation rate of zero; the equilibrium genome score was 3 x 10^9, 100-fold higher than the starting score. The other run had a genome mutation rate of 0.001; its final score was only 4 x 10^8.

This isn't surprising because mutations are much more likely to make good USS-matches worse than better, and this degeneration is only countered by selection for match-improving mutations (and against match-worsening mutations) in those fragments that recombine. So targeting mutation to fragments that might recombine increases the ratio of selected match-improving mutations to unselected mutations. Another way to look at it is that the whole genome gets mutated at its rate every generation, and none of these mutations is selected for or against unless it subsequently changes due to a new mutation arising in a fragment.

It may be that setting lower mutation rates for genomes than for fragments is equivalent to assuming that, on average, fragments are from genomes of close relatives separated by R generations from the genome under consideration (where R is the ratio of fragment rate to genome rate). This is probably a reasonable assumption.

Another issue is how much of the genome can be replaced by recombination each cycle. I've been keeping this down to about 10%, but any value can be justified by having each 'cycle' represent more or fewer generations. So it we want a cycle to represent 100 generations, we should have the amount of recombination equivalent to 100 times the amount of recombination we might expect in a single generation. As we don't even know what this number should be, I guess there's no reason not to have 100% of the genome replaced each cycle.

I don't think there's any benefit to having more than 100% replaced, as each additional recombination event would undo the effect of a previous one. Hmm, could this be viewed as a variant of the genome-coverage problems that arise in planning shotgun-sequencing projects? They want to maximize the genome coverage while minimizing the amount of sequencing they do. Here we want to maximize the amount of the genome replaced while minimizing the amount of wasteful multiple replacements. The difference is that, for genome projects, it's important to cover almost all the genome - covering 99% is MUCH better than covering only 90%, so it's worth doing a lot more sequencing. For us, the emphasis is on more on avoiding wasteful recombination, and the difference between replacing 99% and replacing 90% is worth only 9% more fragment screening. I guestimate that the best compromise will be replacing about 50-75% of the genome in each cycle.

I've raised this issue before (point 2 in this post): One problem is that, as the genome evolves to have more USS-like sequences, the number of fragments that pass the recombination criterion increases. So the above discussion applies mainly at equilibrium, when the genome will the most USS-like sequences. We control the number of fragments that recombine by specifying the number of fragments to be considered (F) and by explicitly setting a limit (M) (e.g. max of 10 fragments can recombine each cycle). Early in the run F needs to be high, or it will be many cycles before a significant number of fragments has recombined. But a high F late in the run has the simulation wasting time scoring many fragments that will never get a chance to recombine. At present I've been setting F to be 5 or 10 times larger than M, but maybe I should try reducing F and increasing M.

Means, arithmetic and geometric (= additive and multiplicative)

By Rosie Redfield on Monday, June 02, 2008

Now that we've replaced our old additive scoring algorithm with a multiplicative one (see this post), the algorithm that decides whether the simulation has reached its equilibrium isn't working well. The post-doc suggests that this is because we need to track the changing score of the genome using a multiplicative mean rather than an additive mean.

The test for equilibrium works as follows: The USS-score of the genome is calculated each generation, and is used to calculate both a "recent" mean score (over the interval between status updates, usually specified as a percent of the elapsed cycles) and a "grand" mean score (over the entire run). Both means are calculated as simple averages (sum of the scores divided by the number of scores). Early in the run the grand mean is much smaller than the recent mean. Equilibrium conditions are satisfied when the % difference between these means becomes smaller than some pre-specified threshold.

With additive scoring this worked well regardless of the length of the genome. But with multiplicative scoring, a single base change can cause a dramatic change in the score (ten-fold or more, depending on the scoring matrix used), especially when using short genomes. The post-doc, who is much more statistically sophisticated than I am, says that when numbers differ by large values, their means should be calculated geometrically rather than arithmetically.

Wikipedia explains that arithmetic means are the typical 'average' values, and geometric means are calculated using products and roots rather than sums and division. I'll say that another way: a geometric mean is calculated by first calculating the product of all the n values (rather than their sum) and then taking the n-th root of this product. Luckily this can be easily done using logarithms.

The post-doc has modified the code to do this, but she's now gone off to Barcelona (!) for the Society for Molecular Biology and Evolution meeting. I haven't yet tried out her version, but checking whether it finds better equilibria will be much easier now that the runs go so much faster.

Thank you for the comments!

By Rosie Redfield on Sunday, June 01, 2008

The profiling I did yesterday, using DProf as suggested in a comment from Keith, showed that most of the runtime was spent in the Switch statements that are the heart of the sliding-window scoring algorithm. In new comments, Keith and Conrad explained that 'Switch' is not the fastest way to do the scoring, and that replacing it with a cascade of if/else statements could be a lot faster. (Faithful commenter Neil had also pointed out that Switch is buggy, but it wasn't setting off any alarms.)

So I've just replaced all three of the switch scoring steps with if/else cascades, and here's the spectacular results. Thanks, guys!

Benchmarking(??) the USS model

By Rosie Redfield on Saturday, May 31, 2008

(I don't think 'benchmarking' is actually the right term for what I've been doing...)

The comments on my last post recommended using the Perl profiling function to find out which parts of our simulation are indeed responsible for its slow runtime. So I Googled "DProf" and ran it on the simulation, and then poked around to find out how to use 'dprofpp' to turn the DProf output into a table listing where most of the time was being spent. Pretty soon I discovered that I just needed to type 'dprofpp' at the usual prompt. (duh).

Then I spent quite a while trying to figure out what the profile entries meant. Here's an example of the output.

Total Elapsed Time = 65.63472 Seconds
User+System Time = 63.53472 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
41.6 26.44 37.689 543882 0.0000 0.0000 Switch::case
28.9 18.37 18.373 241893 0.0000 0.0000 Switch::switch
17.7 11.24 11.243 543882 0.0000 0.0000 Switch::__ANON__
16.5 10.52 38.066 12 0.8768 3.1722 main::tallyScores
0.25 0.156 0.195 3065 0.0001 0.0001 Text::Balanced::_match_quotelike
0.23 0.146 0.227 3692 0.0000 0.0001 Text::Balanced::_match_variable
0.23 0.143 0.608 44 0.0032 0.0138 Switch::filter_blocks
0.15 0.095 0.128 1599 0.0001 0.0001 Text::Balanced::_match_codeblock
0.06 0.040 0.040 7432 0.0000 0.0000 Text::Balanced::_failmsg
0.04 0.027 0.027 13146 0.0000 0.0000 Regexp::DESTROY
0.03 0.020 0.029 9 0.0022 0.0032 Switch::BEGIN
0.02 0.010 0.010 1 0.0100 0.0100 main::createRandomGenome
0.02 0.010 0.010 1 0.0100 0.0100 DynaLoader::dl_load_file
0.00 0.000 0.000 1 0.0000 0 Exporter::Heavy::heavy_export_ok_tags
0.00 0.000 0.000 1 0.0000 0.0000 Exporter::Heavy::heavy_export

(I tried putting the above into a monospaced font for clarity, but the editor collapsed all the extra spaces so that wasn't much help. Putting it in point form stops the editor from collapsing all the single line breaks into one big paragraph.)

The problem is that our program wasn't written with this profiler in mind, and the big time hogs aren't separate subroutines but loops within the main program. We could change this, but here 'we' means our programming assistant, not me.

I did figure out that the various 'switch' entries are steps in the sliding-window scoring. But this scoring happens to different components of the program and in different ways. I tried running the profiler on different versions of the simulation (large or small genome sequences, large or small fragments, many or few fragments) but I couldn't disentangle the effects in the resulting profiles.

So instead I began just recording the runtimes taken by sets of runs in which I systematically varied only one of the input parameters, with the simulation set to run for only 10 cycles. The first graph shows the effect of changing the genome size but holding the fragment size and number of fragments constant (at values so low that they don't contribute much to the runtime).

The first discovery is that the 'tally' we added to help us see how the genome evolves almost doubles the runtime for each genome length (compare the blue line to the red line). This is because obtaining the tally requires a sliding-window scoring step. Even though this scoring doesn't use the position-weight matrix, the time it takes is similar to the time needed for the formal scoring of the genome. The good news is that this step is completely dispensable.

The second discovery is that the steps that mutate the genome and the fragments add almost nothing to the runtime (compare the green and red lines). So we won't bother trying to further optimize the mutagenesis.

The third discovery is that the runtime is proportional to the length of the genome. Because the fragments are few and small, this is almost certainly due to the time needed for the sliding-window scoring. This tells us that reducing the number of times the genome needs to be scored should dramatically speed up the simulations.

The second graph shows how runtime depends on the number of fragments being tested. The fragment size was held at 200bp (ten times more than in the previous set of runs), and the genome length was 10kb.

Here we see that including the tally step increases all the run's times by about 20 seconds (compare blue and red lines). This is just what we expect for a 10kb genome from the previous graph. And again the mutation steps in each cycle contribute almost nothing to the runtime.

The runtime is directly proportional to the number of fragments, consistent with the sliding-window scoring being the major factor.

To confirm that other steps in processing fragments aren't a big problem, I also varied fragment length while holding fragment number constant. This third graph shows that runtime is linearly dependent on fragment length.

Finally, I held the total length of fragments to be screened constant (the product of fragment length and number of fragments screened each cycle). If the time needed for the sliding-window scoring of the fragments is the main factor responsible for the increasing runtime, then holding the total length constant should give a constant runtime even when fragment length and number are reciprocally varied.

That's exactly what we see, provided the fragments are larger than about 100bp. (As the fragment size decreases to approach the size of the window, the number of positions the window takes decreases faster than fragment size (a 10bp fragment can only be scored at one window position). The runtime is proportional to the length of sequence being scored each cycle.

This suggests one more test. I've been keeping either the genome length or the total fragment length constant. What if I plot runtime as a function of the total length of sequence (genome plus fragments) being scored each cycle.

If the sliding-window scoring is indeed the major factor affecting runtime, then there should be a linear relationship between total sequence and runtime. And voila, here it is.

Speeding up the simulation of USS evolution

By Rosie Redfield on Wednesday, May 28, 2008

Our Perl model of USS evolution now runs fine and does just about everything we want it to. Unfortunately it takes a long time to do this. To date we haven't bothered much about speed, being more concerned with just being able to do the job. But now it's time to improve its efficiency. I see several points where big improvements can be made, but first I should describe the steps the simulation presently takes.

1. Create genome sequence. This is done only once per run so speed isn't important.

In each cycle: (Efficiency is much more important for steps done every cycle.)

2. Mutate the genome, and choose and mutate a specified number of fragments of the genome. I think the mutation step is already quite efficient. The number of fragments that need to be processed each cycle can probably be reduced - I'll describe this below.

3. Score each fragment using a sliding window. Score the whole genome too. This scoring is probably by far the most time-consuming step.

4. Decide whether a fragment should recombine back into the genome, and do the recombination. Not a big time-suck, I think.

5. Make decisions about whether results should be reported and whether the simulation has reached equilibrium. Then report, etc. Not a big time-suck, I think.

I can see three ways to reduce the amount of time used by the sliding-window scoring:

1. Don't score the genome every cycle. The genome score is used to give the operator a sense of how the run is progressing - this needs only be done at the reporting intervals, rather than after every cycle - and to decide whether equilibrium has been reached. To make this decision the simulation calculates a mean score over specified intervals and over the whole run. I think we could devise a way to get most of the information the equilibrium decision needs using genome scores calculated much less often, perhaps even calculated only at the reporting intervals.

2. Score only the number of fragments needed for recombination. The specified number of fragments to be mutated and scored in each run (F) is usually much larger than the maximum number of fragments allowed to recombine with the genome (M); at present these values are set at F=100 and M=10. When the genome has few or no strong matches to the USS consensus (at the start of a run), few or no fragments meet the recombination criterion, so every fragment needs to be considered. But later in the run much time is wasted scoring fragments that won't be considered for recombination.

Rather than mutating and scoring the full F fragments, then considering them for recombination until the M limit is reached, the simulation could just get M fragments, mutate, score and consider those, and then get another M fragments, until a total of M fragments have recombined.

3. Stop scoring each fragment once its score reaches a preset maximum. The program already has an implicit maximum score for fragments, set as the range of the probability a fragment will recombine. But at present the sliding window scores the whole fragment. It could instead stop scoring each fragment once this limit is reached. As in improvement 2, this won't have much effect at the start of a run, when few USS-like sequences are present, but should speed things up a lot later in the run.

4. Improve the efficiency of the sliding-window scoring algorithm. This is where we'd appreciate advice from any programing masters who read this blog. I'm going to paste the code for this bit below, but unfortunately I don't know how to format it so it's easily readable (all the tabs collapse). Maybe I'll first put the code without comments (tidy and compact) and then put it again with the comments.

while( $slideCount> 0)
{
my $t = 0;
my $forwardScore = 1;
while($t < $windowSize)

{

my $currentBaseInWindow = substr($fragment,($counter+$t),1);

switch ($currentBaseInWindow)

{

case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }

case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }

case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }

case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }

}

$t++;

}

$windowScore = $forwardScore;

if($windowScore> $fragmentScore)
{
$fragmentScore = $windowScore;
}
$slideCount--;
$counter++;

Here's the version with comments:

while( $slideCount > 0) # Keep scoring window positions until slideCount falls to zero .
{ # Prepare to score the 10 bases in the window at its present position.

my $t = 0; # t is the position of each base contained in the window, with respect to the 5' end of the window.
my $forwardScore = 1; # Score for the 10 bp sequence at each window position, compared to forward USS.

# This code scores the 10 bases of the fragment that are in the window, from t=0 to t=9.
# It first gets the base at each position t, and then gets a score for this base from the scoring matrix.

while($t < $windowSize)

{

my $currentBaseInWindow = substr($fragment,($counter+$t),1); # The base will be A or T or G or C.

switch ($currentBaseInWindow)

{

case "A" { $forwardScore = $forwardScore*$matrix[$t][0]; }

case "T" { $forwardScore = $forwardScore*$matrix[$t][1]; }

case "C" { $forwardScore = $forwardScore*$matrix[$t][2]; }

case "G" { $forwardScore = $forwardScore*$matrix[$t][3]; }

}

$t++; # Prepare to consider the next base in the window.

}

$windowScore = $forwardScore;

if($windowScore > $fragmentScore)
{
$fragmentScore = $windowScore; # Save only the highest score.
}

$slideCount--;
$counter++; # Now we slide the window to the next position in the fragment sequence.

Why won't USSs accumulate in our model?

By Rosie Redfield on Tuesday, May 20, 2008

Our Perl model of USS evolution in a genome runs well, but USS-like sequences accumulate only slightly. I've been playing around with various factors that might be responsible but haven't really gotten anywhere. I need to get these factors clear in my mind, so it's time for a blog post about them.

What the model does (before I started fiddling with it on Friday): (I should clarify that this version (USSv5.pl) doesn't yet have all the formatting bells and whistles that would make it graceful to use, but it does have the basic functionality we think we want.) It first creates a random 'genome' sequence of specified length and base composition, whose evolution the model is going to simulate. In each evolutionary cycle it first takes random segments of this genome, mutates them according to a specified mutation rate, and scores them for sequences similar to a sequence motif specified in the form of a matrix (see figure in this post for examples). This version of the model uses a new multiplicative scoring procedure rather than our original additive procedure. Each segment's sequence has a chance to replace its original sequence, with probability proportional to its score expressed as a fraction of some preset maximum score. The modified genome is scored by the same procedure used for the segments, and then becomes the starting sequence for the next cycle. (We had intended that the genome would undergo mutation in each cycle but this step has been bypassed because it was causing the USS-like motifs to degenerate faster than they accumulated.)

We first tested the additive and multiplicative scoring procedures to see how much difference a perfect USS sequence made to the score of otherwise random-sequence genomes. As we already knew, the additive procedure gave sequences with USS scores that were at best only about 1% higher than sequences without USS - the precise difference depends on the length of the sequence (we tested 100, 1000 and 10000 bp) and on the numbers in the scoring matrix .

The scores obtained with the multiplicative procedure were far more sensitive to the presence of a USS. For the 100bp test sequences, scores with USS were from 2-fold to 700-fold higher than for the same sequence without a USS, depending on how much higher USS-consensus bases scored than non-consensus bases. The lowest sensitivities were seen when this ratio was 2 for all positions, with higher sensitivities when the ratios were from 5-fold to 50-fold.

So this looked very promising - with such a sensitive scoring system I expected USS-like sequences to accumulate rapidly and to a high frequency. But this didn't happen. The genome scores did go up dramatically, but this turned out to be due to the much more sensitive scoring system acting on only a few base changes.

I played around with different genome sizes and fragment sizes and numbers and mutation rates and matrix values, but nothing seemed to make much difference. "Seemed" is the right word here, as I didn't keep careful records. I created or reinstated various reporting steps, so I could get a better idea of what was going on. I also replaced the preset maximum segment score with a variable score (= 5% of the previous cycle's genome score), so that the strength of the simulated bias would increase ass USS-like sequences accumulated in the genome.

But USS-like sequences didn't accumulate much at all, and I don't know why. There could be a problem with the code, but none of the informal checks I've done has set off any alarm bells. There could instead be a fundamental problem with the design of the simulation, so that what we are telling the simulation to do cannot lead to the outcome we expect. Or perhaps only a very small portion of 'parameter-space' allows USS-like sequences to accumulate.

The post-doc and I came up with two approaches. One is to meticulously investigate the effects of the various factors we can manipulate, keeping everything else as simple and constant as possible. The other is to use our brains to think through what should be happening. While we're doing this our programming assistant will be adding a few more sophisticated tricks to the program, to make our analysis easier.

I'll end with a list of factors we need to method

Progress on USS bioinformatics

By Rosie Redfield on Thursday, May 15, 2008

Yesterday my bioinformatics collaborator came by for a collaborational visit - she's in town for a few days. I've been quite frustrated by our slow sporadic email correspondence so it was great to discuss the work face-to-face.

We clarified a number of issues about the data - because she's generating it and I'm doing the writing about it, this is where all our problems lie. There's only one data set still to generate, a comparison of the peptides encoded by uptake sequences with their homologs in species that don't have uptake sequences. But there's still lots of writing to be done. I'm hoping she'll be able to come work with us for a couple of weeks early in the summer, so we can work on the writing together.

Biological relevance of USS scoring systems

By Rosie Redfield on Thursday, May 15, 2008

In response to a recent post about how our USS-evolution model will score USS-like sequences, a commenter (Neil) says "I see ROC curves and cross-validation in your future." Google tells me that ROC (receiver operating characteristics) curves are graphs representing the relationship between a signal receiver's sensitivity and its specificity. They thus represent the receiver's ability to detect true positive signals and its tendency to falsely report events that aren't true signals.

Does this apply to USS? That's actually a scientific question about the nature of USS - are they really signals? USS stands for 'uptake signal sequence', a name chosen because they were assumed to have evolved so competent bacteria could distinguish closely related 'self' DNA fragments from 'foreign' DNA. Under this view the uptake machinery could be viewed as a signal receiver that needs to distinguish true USS (those on self DNA) from false-positive USS (chance occurrences of USS-like sequences in foreign DNA. (Note for new readers: the conventional 'core' USS of Haemophilus influenzae is the sequence AAGTGCGGT.)

But we don't think that USS are signals, at least not in this sense. Rather, our working hypothesis (null hypothesis) is that USS-like sequences accumulate in genomes as an accidental consequence of biases in the DNA-uptake machinery and recombination of 'uptaken' DNA with the genome. (I put 'uptaken' in quotes because it's not a real word; I'm using it because it's clearer than 'incoming', the word we usually use.) So rather than wanting a perfect scoring system to distinguish 'true' USS from 'false' ones, we would want it to reflect the real properties of the receiver (the DNA uptake machinery of real cells).

Unfortunately we don't know nearly enough about the uptake machinery to accurately describe it in a model. We know that some positions of the USS are more important than others, and that sequences outside the core matter. We have detailed analyses of the USS-like sequences that have accumulated in real genomes (see all my old posts about the Gibbs motif sampler), but we don't know how these sequences reflect the properties of the uptake machinery that caused them to accumulate. That's one of the things we hope to use the model to clarify.

For now, we don't really want to use a 'perfect' scoring system in our model. Instead, we can treat different scoring systems as different hypotheses about how differences in DNA sequences affect uptake (different hypotheses about how the machinery works). So we will start with very simple systems, such as those described in the previous post. Once we know how those affect simulated uptake and the long-term accumulation of high-scoring sequences in the genome, we can then clarify how what we know about the real uptake machinery constrains our ideas of how these sequences evolve.

Testing the new scoring system

By Rosie Redfield on Wednesday, May 14, 2008

The undergrad is creating a new version of the USS model program with the scoring done multiplicatively rather than additively (see previous post). I was originally thinking we'd start by just playing around with it, to see what happens. But after a conversation with the post-doc I realize that we should start out being more systematic.

What we need to do first is just find out what scores are produced by different sequences using this system:

What score does a random sequence of a specified length (and base composition) produce? We'll test 100, 1000 and 10000bp.
What score does a sequence produce that differs only in containing one USS perfectly matched to the matrix consensus?

And we'll do the same tests with differently weighted matrices, using both additive and multiplicative scoring:

Additively scored matrix with all consensus bases worth 1 and all non-consensus bases worth 0 (like the yellow one in the previous post).
Additively scored matrix with consensus bases at different positions weighted differently, according to our measures of their contribution to uptake. For example, some consensus bases might be worth 1 and some 3 or 5 or 10.
Multiplicatively scored matrix with all consensus bases worth the same value (say 2 or 5), and all non-consensus bases worth 1.
Multiplicatively scored matrix with consensus bases at different positions weighted differently, but all non-consensus bases still weighted 1.
Multiplicatively scored matrix with the different non-consensus bases also weighted differently, perhaps with some values smaller than 1.

Only after we've done these basic tests of the different scoring systems will we decide whether to test their effects on USS accumulation. These preliminary tests shouldn't take very long. It's just a matter of generating the 6 test sequences and the 5-10 test matrices in a text file (needing maybe half an hour), and then pasting the different permutations into the two (additive and multiplicative) test versions of the program. The actual scoring runs will only take a few seconds each.

Scoring USS-like sequences in our model (so blind!)

By Rosie Redfield on Tuesday, May 13, 2008

Months ago (last fall?) a post-doc and I spent what seemed like a lot of time at the whiteboard in the hall, considering different ways our planned USS model might score the sequences it was considering for their similarity to a USS motif.

We eventually settled on the crude system shown on the left (yellow table). It evaluates how well the DNA sequence in a 10-base window matches the USS core consensus. Each match to the consensus earns a point, with the total score for the sequence being the sum of the points it's earned. At the time, we realized that this way of scoring had two (or three?) big problems, but we needed something simple to get the model working so we settled for this.

The first problem is that the score is not very sensitive to how good the match is. The yellow numbers beside the table show the scores earned by specific sequences. A sequence matching at all 10 positions is only 11% better than a sequence matching at 9 positions, even though we know from real uptake experiments that some single base changes can reduce uptake by more than 95%. The second problem is that this method treats all 10 positions in the motif equally. But again our uptake experiments have shown that some positions in the motif affect uptake much more strongly than others.
The third problem is that random sequences have very high scores, and adding a single perfect-match USS to it increases this baseline score only slightly.

This morning the post-doc and I reconsidered the scoring system. We expected that finding a solution to these problems would be very difficult, but we quickly came up with a much better way, illustrated by the blue table on the right of the figure. The new method is to multiply the scores of the individual positions rather than summing them. This causes the scores of well-matched sequences to be dramatically higher than those of poorer matches. And we expect (though we haven't tested this yet), that the baseline score of a random sequence will be much smaller. For now we've given all but the consensus base scores of 1, but these could be larger or smaller; for example some bases at some positions could be worth only 0.1 of a point.

Now that the program is working, implementing a multiplicative scoring system should be simple. I'm tempted to try it right now, but I have lots of other things I should be doing, and I'd probably just get bogged down in technical problems anyway.

The Perl code had a bug, but I found it!

By Rosie Redfield on Monday, May 12, 2008

The bells-and-whistles version of the Perl model of USS evolution still had a bug, which became apparent once I fiddled the fragment scoring system to strongly favour good matches, and turned off mutation of the genome sequence (so only the fragments mutated). The bug manifested itself in the program cycles stopping, at fairly random points in the run (never stopping twice at the same cycle number or genome score, as far as I could tell).

After a LOT of careful detective work on my part, entirely unencumbered by knowledge of any Perl debugging tools, I found that a 'while' counter was being incremented at the wrong place (inside an 'if' instruction that was inside its 'while' loop, instead of just inside its 'while' loop). I still don't understand why this would cause the runs to stick at random points, but maybe the undergrad can explain it to me tomorrow.

(Confession added later: Solving the problem was not just the result of my careful detective work. The final discovery was helped by luck. I had added an 'else' statement to print a report that the next step had happened, but had incorrectly inserted one too many } brackets. In solving this I accidentally removed a different bracket than the one I had incorrectly inserted, which moved the while counter outside of the 'if' loop and, I discovered, eliminated the stopping problem.)

As usual, the Perl problem was the line feeds

By Rosie Redfield on Sunday, May 11, 2008

Somehow, in being emailed to me, the USSv4a.pl and settings.txt files both acquired nasty Mac carriage returns instead of nice well-behaved Unix line feeds. This kind of problem has arisen often enough in the past that I knew to suspect it, but I had to figure out how Komodo deals with line feed issues before I could confirm that this was the problem and correct it.

Now the program runs fine, but it prints some reporting lines probably created by the undergrad while he was debugging the new bells and whistles....

...Three hours later.... I've found, understood and removed the unwanted reporting lines, and found and fixed a big mistake in how the program decided whether it was time to print out a report. And found and fixed several minor problems....

Not as simple a task as it should have been

By Rosie Redfield on Saturday, May 10, 2008

Before he left last night the undergrad sent me what should be a fully functional version of our USS model, with the desired new features all working (they're described here). But I just tried to run it on my home computer and absolutely nothing happens. I enter "perl USSv4a.pl" at the Terminal prompt, and I just get another terminal prompt.

The problem is likely with the computer, not the program. This is the first program I've tried to run on it since I replaced the hard drive a few months ago - as everything on the old drive was lost I was effectively starting from scratch. I don't think I need to install perl - I've never had to do that before. The USSv4a.pl file is in the Rosie directory, as is the settings.txt file it calls. I could create a 'Hello World' perl file to test that perl is working but my perl book is at my office; maybe I can find one online ('Hello World' file, not perl book).

Yes I can (it's just one line), but now I discover that TextEdit will only save files as rtf, html, word and wordxml! Wtf? Guess I'd better install Komodo Edit. OK, I remember that finding a Komodo Edit download site took a bit of looking (Komodo IDE was easy to find but costs a bundle), but Google found a site for downloading Komodo Edit. Unfortunately what this has given me has a '.msi' suffix, and my Mac doesn't know what to do with it. On looking more carefully at the source site, I see that it says something about bittorrent, which I don't know how to use. Back to Google - OK, a site offering a Mac version, which is coming in (very slowly) as '.dmg'.

OK, got Komodo Edit, in Applications folder, opening for first time involves 'pre-loading' various bells and whistles, Mac gets VERY HOT (Fan Control maxed out).

OK, I created the Hello World file, and it ran fine. Hmm, maybe there IS something wrong with the USSv4a.pl program. Well, now I have Komodo Edit so I can take a look at it. It looks fine, so I created a 'test' version with Hello World print lines at various places. None of them gets printed. I see that the undergrad has inserted a 'die' reporting step if the program can't open its settings file, so I think that's not the problem. Instead Terminal acts as if the USSv4a.pl file and the new USSv4atest.pl file don't exist.

Back to the original USSv4a.pl file; now Komodo is claiming that every line fails to end with the expected "LF" end-of-line marker, but when I have it show the markers they all look fine. The USSv4atest.pl file markers look exactly the same, and Komodo has no problem with them.

Time to quit working on this for now. Maybe I'll send an email to the undergrad, or maybe I'll just wait till I get to my office (maybe not till tomorrow) and try to run it there.

Time to start preparing some talks

By Rosie Redfield on Thursday, May 08, 2008

I have three talks to give in the next month and a half, so I need to start preparing them now.

First a 20- or 25-minute one at the annual workshop of the new CIfAR Program in Integrated Microbial Diversity, held somewhere not far from here, sometime close to the end of May. The guy in the next office invited me but he's out of town so I can't recheck the details. This talk will describe what we know about how natural selection has acted on the genes that cause bacterial genetic exchange. I think I can probably do this with slides I already have prepared.

Next, a 20-minute talk at a conference titled "Sex and Recombination: In Theory and In Practice", at the University of Iowa in mid-June. This talk will begin by introducing everything that the above talk will take 20 minutes to cover, and will then go on to explain how we are using computer simulations to understand how uptake sequences can accumulate in the genomes of competent bacteria. I hope to discuss results from our Perl model; the undergraduate is adding necessary embellishments (an oxymoron). (He thought he would have them all in place by late today but he left without passing the improved model on to me, so I suspect they're not quite debugged yet.) I don't yet have any model-specific slides for the talk, nor even a good idea of what the results will be.

And a few days after that, a 15-minute talk at the big Evolution meeting at the University of Minnesota. This will be on my work with the bioinformatician, on how the accumulation of uptake sequences in bacterial genomes has affected the ability of their genes to code for well-adapted proteins. Almost all the work is done here, and we have nice figures prepared for our manuscript. Unfortunately my collaborator has just redone some of the analysis and sent me new figures I don't completely understand. And her email response time has gotten very slow (though I guess this is only fair payback for my long silence while I was swamped with teaching).

My very old 'analytical' USS evolution model

By Rosie Redfield on Tuesday, May 06, 2008

I found my old USS evolution model in the files with my 1999 NIH grant proposal. It's not a computer simulation of evolution but analysis of equations describing an equilibrium. It uses only very simple algebra, so calling it an 'analytical' model is probably giving it more credit than it deserves. The introductory text gives an excellent description of the background, so here it is:

This model starts with the following assumptions:
H. influenzae cells have a preexisting DNA uptake system that preferentially transports fragments containing a USS0 (the 9bp core: 5'AAGTGCGGT). Fragments with imperfect (USS1) sites are not favoured.
Fragments of H. influenzae DNA are frequently brought into cells by this system.
Once inside the cell these fragments recombine with and replace homologous regions of the chromosome.
The DNA in the cells' environment comes from lysed cells ('donors') having the same distribution of USS0 and USS1 sites as the cells that are taking up DNA (the 'recipients').
Random mutation acts on both USS0 and USS1 sites to destroy and create new USS.
How these assumptions cause USS0 to accumulate:
Because the DNA uptake mechanism is biased in favour of fragments containing USS0 sites, any donor DNA fragment containing a new mutation that has created a USS0 will be taken up more efficiently than the wildtype version of the fragment. Similarly, donor fragments containing new mutations that have eliminated a USS0 will be taken up less efficiently than their wildtype counterparts. Consequently the DNA fragments within cells will be enriched for USSs relative to the external DNA, and recombination between these fragments and the resident chromosome will increase the number of genomic USS0s more frequently than it will decrease it. The bias will similarly affect the fate of new mutations arising within the recipient cell; mutations removing a USS0 will often be replaced by donor fragments carrying the wildtype USS0, whereas mutations creating new USS0s in the recipient will less frequently be replaced. This biased gene conversion can thus both compensate for mutational loss and amplify mutational gain of USS0s, and will cause them to be maintained at a higher equilibrium number than would be expected for a random sequence of the same base composition.

How the model works:
This model addresses processes acting within a single genome. The model uses the observed equilibrium numbers of perfect and imperfect USS sites to derive an equation relating transformation frequency to the bias of the DNA uptake system. This equation tells us the values that the transformation frequency and uptake bias parameters would have to take in order to be responsible for the maintenance of perfect and imperfect USS at the high equilibrium ratio that we observe.

The model then defines its terms ('variables'? 'parameters'?). It assumes that the frequencies of USS0 and USS1 observed in the real H. influenzae genome represent an equilibrium between forces that create USS0s and forces that convert them into USS1s. It then derives an equation relating the bias of the DNA uptake machinery (enrichment of USS0 over USS1) to the transformation frequency (the probability that a USS site will be replaced a fragment brought in by the uptake machinery in each generation). Conveniently, the mutation rate drops out of the equation.

Again quoting from what I wrote ten years ago:

What the final equation means: This equation tells us how frequent transformation must be (T), given a specified bias B of the DNA uptake system in favor of the USS, in order to fully account for the observed numbers of USS0 and USS1 sites in the Rd genome. The range of values is given below.

This is a pleasingly tidy result. It makes good biological sense, and the values are not unreasonable. Estimates of the actual bias vary, no doubt partly because they have been determined using USSs with different flanking sequences, but are usually between 10 and 100. We have no good estimate of actual transformation frequencies for H. influenzae in its natural environment, but if cells grow slowly, and frequently take up DNA as food then an average of 1% transformation per generation seems plausible, and even 10% not impossible.

I'm not very sure that this work is sound. The mutation rate dropping out is a bit suspicious, and I'm not sure how to interpret the transformation frequency when there are no other terms depending on generations (such as mutation rate).

Gene Transfer Agent evolution

By Rosie Redfield on Monday, May 05, 2008

My GTA colleague suggests that GTA may persist because of species-level selection. 'Species' is a tricky concept because these are bacteria, but we can simplify this to consider selection of lineages.

The basic idea is like that proposed to explain the surprisingly high frequency of bacterial lineages with defective mismatch-repair genes. Like most mutations, most GTA-mediated recombinational events will probably be deleterious. But some will be beneficial. Each time a beneficial recombination event occurs the lineage of cells descending from it will all contain GTA as well as the new combination. Provided the short-term and long-term costs of GTA don't cause the new lineage to go extinct or lose its GTA genes before the GTA-mediated beneficial change, lineages with GTA could take over.

Evolution-of-sex theorists have produced theoretical evaluations of such 'hitchhiking' processes, treating the allele in question as a 'recombination-modifier' (in this case the 'allele' would be the block of GTA genes). I haven't looked at the literature recently, but I think the general conclusion is that hitchhiking is common but weak; it is very unlikely to push an otherwise harmful allele to fixation. But these analyses weren't done for bacteria but for sexually reproducing eukaryotes, with the modifier controlling the frequency of meiotic crossovers between two other loci. I don't know how they would apply to bacteria.

Nevertheless, we know the ability of relatively rare beneficial events to preserve the GTA gene cluster must depend on how frequent the beneficial events are, how beneficial they are, how often GTA genes themselves undergo mutations that block their function, and how much (if any) harm the GTA genes cause. For example, if beneficial events are very rare, functional GTA genes may be lost by random mutation in the interval between beneficial events. Subsequent selection might cause the lineage to then go extinct, but it wouldn't bring the GTA genes back.

The important question is, how could we tell if this sort of thing is responsible for the persistence of the GTA genes? Short answer: I don't know.

USS don't accumulate because the bias is much too weak

By Rosie Redfield on Saturday, May 03, 2008

In our very-preliminary version of the USS evolution model, we've been using a very simple scheme to score the similarity of DNA sequences to the USS motif. We just count the number of matches to the 10bp core USS sequence. Right now I'm keeping everything simple by running the model with DNA fragments that are only 13 bp long (and a genome that's 200-1000 bp long).

So a fragment with a perfect 10 bp match to the motif is only twice as likely to recombine back into the genome as a fragment with only 5 bp matching. We know from our earlier model, and from a calculation I did years ago, that the bias favouring USS needs to be much stronger than this if it is to overcome the randomizing effects of mutation. For example, (ignoring the effects of base composition) a sequence that matches the motif at 9 positions has 27 different ways to mutate into a fragment that matches at only 8 positions, and only one way to mutate into a fragment that matches at 10 positions. To overcome this disproportion, the bias favouring 9 over 8 (or is it 10 over 9?) has to be proportionally strong (i.e. 27-fold).

Now I need to find that old calculation - I think it might be with my 1999 NIH proposal.

Why don't USS accumulate in our model?

By Rosie Redfield on Friday, May 02, 2008

The undergraduate is working at home, making improvements to the use-ability of our Perl model of USS evolution, while I'm here doing some test runs with the slightly unwieldy version we have now.

This is the version that incorporates the first major improvement. Instead of a single "best-score" fragment recombining with the genome in each cycle, each of the scored fragments can recombine with the genome, with a probability proportional to its score. In principle this should allow the recombination to introduce new USS-like sequences faster than they are lost from the genome by mutation. But in practice the USS-score of the genome drifts up and down but doesn't consistently increase.

"When in doubt, run more controls."

So I've made a modified version of this program that has the same feature as our PositiveControl.pl program (it's name is PositiveControlv2.pl). Instead of mutating random fragments and recombining them back into the genome, it replaces their sequences with perfect USS cores. Of course the fragments need to be the same length as the core.

OK, this positive control tells me that the program is doing what it should. 10 fragments are recombining into the genome each cycle, which is what should happen to the 10 fragments the program considers each generation. Now I'll test PositiveControlv3, which puts in an imperfect USS instead of a perfect one. If the program is doing what I want, on average 8 of the 10 fragments will recombine each generation, and the genome score will not get as high as with perfect USSs.

OK, that worked too. So I think the problem may not be with the implementation of the code, but with the design of this version of the model. Back to the drawing board... (literally, back to the big whiteboard in the hall outside my office).

Field of Science