Field of Science

I decided to do a USS-model run with the values in the DNA-uptake matrix increased five-fold, thinking this meant that drive favouring uptake sequences would be much stronger.  This should have meant that, when beginning with a random sequence, there would initially be very few fragments that had sufficiently high scores to pass the probability filter and recombine with the genome.  So I was surprised when the frequency of recombination at the beginning of the run looked to be about the same as with the normal matrix.

But then I remembered that one nice feature of the latest version of the model is that it uses whatever values are in the DNA-uptake matrix to calculate an appropriate value for the recombination probability filter, always equal to the score earned by a single site that perfectly matches the matrix-specified preferences.  This is good under some circumstances, but in the present case it entirely defeats what I was trying to do.

So I guess it's time to tweak the model one more time, introducing a coefficient by which the probability filter can be multiplied.  Ideally we'd like to be able to sometimes have this coefficient change during the run, as this would allow us to simulate evolution of the uptake specificity itself.  I'm confident that I can introduce a constant coefficient (read in with the other run settings); I wonder if I can easily make it a variable...

Test post from iphone

The culture media problems we were having last fall are back. This
time the postdocs suspect the medium, so they've ordered a new stock
of the Bacto stuff.

I think I won't try to post any more from my phone until my iPhone
typing speed improves.

Sent from my iPhone (by email to Blogger, because the iPhone doesn't support MMS messaging).

How arbitrary is a position weight matrix?

Recently I've run some USS-evolution simulations that started with a 50 kb segment of the H. influenzae genome rather than with a random sequence of base pairs. I used the position weight matrix derived by Gibbs analysis of the whole genome, thinking that this would be a non-arbitrary measure of the over-representation of the uptake sequence pattern. I was hoping that this bias would be strong enough to maintain the uptake sequences already in the genome, but the genome score fell to about half (or less) of its original value after 50,000 or 100,000 cycles.

That started me wondering whether the position weight matrix should be treated as a fixed set of values, or just as telling us the relative value that should be applied to each base at each position. Said another way, could different settings of the Gibbs analysis have given a very different matrix? The answer is No, but only because the Gibbs analysis reports the weight matrix as the frequency of each base at each position, so the sum of the weights of the four bases at each position must add up to 1. So if we want to consider stronger or weaker matrices, there's no reason not to multiply all the values by any factor we want to test.

So I think the first goal is to see when strength of matrix is needed to maintain the uptake sequences in the genome at their natural frequencies. Then we can see whether the uptake sequences are still in the places they started at, or whether these have decayed and new ones arisin in new places.

Some calculations and implications

In playing around with our Perl model of uptake sequence evolution, I've been surprised by how weak the effect of molecular drive seemed to be.  But I just did some back-of-the envelope approximations and realized that our settings may have been unreasonable.

We can use the relationship between the genomic mutation rate the model is assuming and known real mutation rates for a rough calibration.   Real mutation rates are on the order of 10^-9 per base pair per generation.  Using such a low rate per simulation cycle would make our simulations take forever, so we've been using much higher rates (usually 10^-4 or higher) and treating each simulation cycle as collapsing the effects of many generations.  But we didn't take the implications of this seriously.  

If we do, we have each simulation cycle representing 10^5 generations.  How much DNA uptake should have happened over 10^5 generations?  We could assume that real bacteria take up about one fragment of about 1 kb every generation.  The length is consistent with a recent estimate of the length of gene-conversion tracts in Neisseria, but the frequency is just a guess.  I don't know whether it's a conservative guess, but if bacteria are taking up DNA as food it's probably on the low side of reality.  

How much of this DNA recombines with the chromosome?  For now lets assume it all does.  This means that, in each cycle of our simulation, a simulated full-size genome would replace 10^5 kb of itself by recombination with incoming DNA.  Because a real genome is only about 2 x 10^3 kb, each part of it would be replaced about 100 times in each cycle.  We could cause this much recombination to happen in our model. but it wouldn't simulate reality because there wouldn't be any reproduction or DNA uptake between the multiple replacements of any one position.  

We can make things more realistic by (1) assuming that only about 10% of fragments taken up recombine with the genome, and (2) decreasing the genomic mutation rate by a factor of 10, so each cycle only represents 10^4 generations.  Now most of the genome gets replaced once in each cycle.

What about the fragment mutation rate?  We might assume that, on average, the fragments that a cell takes up come from cells that are separated from it by about 5 generations.  That is, the cell taking up the DNA and the cell the DNA came from had a common ancestor 5 generations back.  This means that 10 generations of mutation separate the genome from the DNA it is taking up, so the fragment-specific mutation rate should be 10 times higher than the genomic mutation rate.

So I have a simulation running that uses a genome mutation rate of 10^-5 and a fragment mutation rate of 10^-4.  The fragments are 1 kb long, and the cell considers 100 of these each cycle.  

One other modification of the program matters here.  We've now tweaked the program so that it can either start a run with a random-sequence 'genome' it's just created, or with a DNA sequence we give it, that can be taken from a real genome with real uptake sequences.  

So the run I'm trying now starts not with a random sequence but with a 50 kb segment of the H. influenzae genome.  This sequence already has lots of uptake sequences in it, so about half of the 1 kb fragments the model considers in each cycle pass the test and are recombined into the genome.  I'm hoping the new conditions will enable the genome to maintain these uptake sequences over the 50,000 cycle run I have going overnight.

The USS-Perl project becomes the USS-drive paper

I think I'm going to start using the label "USS-drive" for the manuscript that describes our "USS-Perl" computer simulation model.  That's because the focus of the manuscript will be on understanding the nature of the molecular drive process that we think is responsible for the accumulation of uptake sequences in genomes.
The plan is to combine the modeling analysis with our unpublished results on the bias of the uptake machinery and on the nature of the motif that has accumulated in the genome.
The broad outline will be as follows:
We want to understand how the bias of the uptake machinery can affect evolution of sequences in the genome, assuming that cells sometimes take up and recombine homologous sequences from closely related cells.  And we want to examine this in the context of what is seen in real cells - the real biases of the DNA uptake machineries and the real patterns of uptake-related sequences in genomes.
So we will begin by properly characterizing the genome sequences using the Gibbs Motif Sampler.  I've already done this analysis for all of the genomes that have uptake sequences.  And we've done Gibbs analysis on different subsets of the H. influenzae genome (coding, non-coding, potential terminators, different reading frames), and looked for evidence of covariation between different positions.
We will also collate the published uptake data for H. influenzae and N. meningitidis and N. gonorrhoeae, adding our unpublished H. influenzae data.  
And then we will present the model as a way to investigate the relationship between uptake bias and genome accumulation.  A key feature of the model is that it models uptake bias using a position-weight matrix that treats uptake sequences as motifs rather than as elements.  That is, it specifies the value of each base at each position of the motif.  This means that we can evaluate both uptake-bias data and the genome-frequency data as inputs into the model.  The uptake-bias data isn't really good enough for this, and I anticipate that the main focus will be using the genome frequency data to specify uptake bias in the model.
Because the model allows the matrix to be of any length, we can use it with the full-length H. influenzae motif (30 bp), not just the core.  And because the model lets us specify base composition, we can also use it for the Neisseria DUS. 

On to other manuscripts!

My bioinformatics colleague is headed back home, and our uptake sequence manuscript is waiting for the last bits of data, and for me to write a coherent and thoughtful Discussion. It's not nearly as far along as I'd hoped, but it's a much better piece of science than it was shaping up to be a week ago. One key step was switching the axes of a couple of the figures. Seems like a minor thing, but putting "% sequence identity" on the X-axis rather than the Y-axis transformed the data from meaningless to crystal clear.
The only big piece of analysis we still need is an examination of which of the H. influenzae genes that don't have BLAST hits in our three standard gamma-proteobacterial genomes also don't have hits in the more closely related A. pleuropneumoniae genome. Those that don't are stronger candidates for having entered the H. influenzae genome by horizontal gene transfer, and we predict they will also have relatively few uptake sequences.
So what's the problem with the Discussion? I don't seem to be able to see the big picture for the trees, and I'm hoping that setting the whole think aside for a couple of weeks will let the fog dissipate.
Lord knows I have lots of other stuff to work on. I wish I was going to do some benchwork, but I'm afraid for now it's two other manuscripts.
One of these is the educational research I've been referring to as "the homework project". It's been on the back burner while the graders we hired scored the quality of students' writing on various bits of coursework and the final exam. I think this is all finished now, though I don't even know where the data is. The teaching-research post-doc who was working closely with me on this project has started her new job back east, but she'll continue to be involved in the data analysis and writing. I'm also blessed with involvement of a new teaching-research post-doc, who'll be working with me here. (Both post-docs are part of our CWSEI program, not part of my own group.) The first step to getting the analysis and manuscript-writing under way is to email both of them and get the work organized. I'll do that this morning.
The other manuscript is the project I've been referring to as "USS-Perl". The basic computer simulation model now works well, our programming assistant has gone on to other things, and the post-doc involved with this project has done a lot of runs of the simulation to pin down some basic relationships between parameter settings and result. (I forget what these are, so I need to talk to her today about this.) I have a plan for the first manuscript, which I'll spell out in a separate post later today.

Divergence of genes in different COG functional categories?

The bioinformatics manuscript is coming along nicely (though of course a lot more slowly than I had hoped).

One of the things it does is show that the densities of uptake sequences in genes does not correlate well with the 18 'COG functional categories' that genes have been assigned to.  This is a significant result because a previous paper claimed a strong correlation between high uptake sequence density and assignment to a modified COG functional category containing 'genome maintenance genes'.  This result was considered to provide strong support for the hypothesis that uptake sequences exist to help bacteria get useful new genes, a hypothesis I think is dead wrong.

Our hypothesis is that the distribution of uptake sequences among different types of genes (with different functions) should only reflect how strongly these gene's sequences are constrained by their functions.  Our analysis would be quite a bit more impressive if we showed a positive result - that uptake sequence density in different COG functional categories correlates well with the degree of conservation of the genes in these groups.  My first idea was to ask my bioinformatics collaborator to do this analysis.  But I suspect it might be a lot of work, because she's only done any COG analysis with the A. pleuropneumoniae genome, but we would want analysis done with the H. influenzae and/or N. meningitidis COG functional categories genes, looking at the % identity of the three 'control' homologs we've used for our other analysis.

So I'm wondering whether someone might have already done a version of this analysis.  Not with H. influenzae or N. meningitidis, and not with the control homologs we've used, but any general estimate of levels of conservation of genes in the 18 COG functional categories.  I searched Google Scholar for papers about COG functional group divergence and found a good review of analyses one can do in the COG framework.  This got me hoping that maybe there was a web server that would let me do the analysis myself, but the paper didn't describe anything that would do the job.

But I looked deeper in Google Scholar's hits and found something that looks very promising.  It examines rates of sequence evolution across genes in the gamma-proteobacteria.  H. influenzae and the control genomes we used with it are all in the gamma-proteobacteria, and I think the paper has looked specifically at the relative rates of evolution of genes in different COG functional categories, so this might be exactly what I'm looking for.  The only problem is, the paper appeared in PLoS Genetics, and their site is down right now!  I'm trying to read the paper in Google's cached version, but the page is all greyed out and it can't show me the figures.  Guess I'll just have to be patient and hope the site is back up soon.

Base composition analysis supports gene transfer

Both our results and those recently published by another group suggest that uptake sequences are less common in genes that don't have homologs in related genomes.  I'm using 'related' loosely here, as our analysis searched for homologs in genes from other families of bacteria whereas the other group searched in members of the same genus.  But both results are best explained by the hypothesis that genes lacking uptake sequences are often those that a genome has recently acquired by lateral transfer from genomes that don't share that species' uptake sequence. 

To test this, we looked at gene properties that might give evidence of lateral transfer.  The simplest such property is base composition (%G+C), a property that differs widely between different bacterial groups and changes only slowly after a sequence has been transferred to a different genome.  My bioinformatics collaborator had already tabulated the %G+C of all the genes in the H. influenzae and N. meningitidis genomes, noting whether each gene had (1) uptake sequences and (2) homologs in the test genomes (see previous two posts for more explanation).  She'd then calculated the means and standard deviations of the base compositions of genes in each of the four classes.  

This analysis showed that, on average, the genes with no uptake sequences and no homologs also had anomalously low base compositions and larger standard deviations.  But I thought the effect might be clearer with a graphical presentation, so I plotted the actual %G+C for each gene as a function of which class it is in (presence/absence of uptake sequence and homologs).  I also noted sample sizes, as some classes have a lot more genes than others.  This gave a nice visualization of the effects, showing, not surprisingly, that genes with homologs in the control genomes have more homogeneous base compositions than those without homologs, and that the effect is quite a bit stronger for those genes that don't have uptake sequences.

Order of Results

We're still working out the order in which we present the different results in our draft manuscript.  The first parts of the Results are now the analysis of how uptake sequence accumulation has changed the frequencies of different tripeptides in the the proteome, and of why some reading frames are preferred.  

The next decision is how we present the results that use alignments of uptake-sequence-encoded proteins with homologs from three 'control'  genomes that do not have uptake sequences.  We had tentatively planned to first describe the finding that genes that don't have uptake sequences are less likely to have homologs in the control genomes (call this analysis A), then move to the genes that do have homologs in all three control genomes, first considering the overall degree of similarity (call this analysis B) and then considering the differences between the places where the uptake sequences are and the rest of the proteins (call this analysis C). 

But I was having a hard time seeing how to motivate the first analysis - it didn't easily connect with what came before. Now I think a better plan is to first describe analysis B, which shows that genes with more uptake sequences have lower similarity scores (actually identity scores - my collaborator and I disagree about the relative value of using similarity and identity scores). 

By describing analysis B first, we can more logically introduce the strategy of using alignments to homologs from a standard set of control genomes, before discussing the genes that lack homologs.  And the results of analysis B then motivate analysis A, looking at the genes that had to be omitted from analysis B because they didn't have all three homologs.

Organization of the Results

My bioinformatics collaborator and I are making good progress on both the writing of our manuscript and on improving the data that's going into it.  One of the latter came when I redid an analysis I first did about a year ago, this time being more meticulous about the input data and more thoughtful about how I used it.

The question the data address is why uptake sequences in coding regions are preferentially located in particular reading frames, specifically the effect of the codons they use on how efficiently the mRNA can be translated.  I had used the overall frequencies of different codons in the proteome (all the proteins the genome encodes) to estimate codon 'quality', by summing the frequencies of the three codons specified by the uptake sequence in a specific reading frame, and dividing this by the sum of the frequencies of the best (most-used) codons for the same amino acids.  This analysis showed no relationship between codon quality and the reading frame bias.  

In retrospect I should have multiplied the frequencies rather than adding them.  This both gives a more realistic approximation of how the codons' effects interact, and eliminates the differences caused by the differing frequencies of different amino acids in the proteome.  I should also have excluded those uptake sequence positions that allowed more than one codon. 
And I probably should have done the analysis for different species rather than only for H. influenzae (though, to be fair on myself, at the time I was working on a H. influenzae USS analysis). 

So now I've redone the analysis for the three species that exemplify the different uptake sequence types (H. influenzae, A. pleuropneumoniae and N. meningitidis), and find a substantial effect of codon quality.  That is, in each species, uptake sequences are mostly found in reading frames that allow use of the most common codons for those amino acids.  As codon frequency is thought to reflect the abundance of the corresponding tRNAs, this means that differences in the efficiency of translation explain some of the differences in reading frame use.

Collaborative writing

My bioinformatics collaborator is in town till next Friday, so we can work together to finish up our uptake sequence manuscript.  We have done (she has done) a lot of different analyses addressing various questions about how uptake sequence accumulation has affected protein evolution and vice versa, and I'm having a hard time keeping them all straight (partly because I haven't been thinking about this project for a while).  

The manuscript is mostly written already; but some parts of the Results are still up in the air because they were waiting for some final data.  It also needs polishing and some revising to incorporate results from a Neisseria uptake sequence paper that came out a few months ago.

To cope with all the data we've been occasionally creating 'Flow of Results' pages that summarize our results and conclusions, in the order we think the paper should present them. Yesterday we began our final push by going through the most recent version of the Flow of Results'.  I'm not sure we have everything in the perfect order, but we have an order that's good enough, with each result triggering a question that's addressed by the next result.