Field of Science

Now that's a time course!

I tested the effect of allowing the cells time to express their new kanamycin-resistance gene, and found that it didn't matter at all. But everything seemed to be behaving well, so I went ahead and repeated the big time course experiment anyway. And it worked very nicely this time.

The top graph shows how the cultures grew under the different treatments. BHI is the rich medium, and they grew nicely in it. Adding 1mM cAMP slowed growth down a little bit, which is not surprising,as cAMP is a powerful metabolic signal molecule. Transfering the cells to the starvation medium MIV stopped their growth, and even caused quite a drop in cfu/ml, but after a few hours they began to grow again. This could be because A. pleuropneumoniae differs from H. influenzae in being able to synthesize its own pyrimidines - we would need to check its genome.

The lower graph shows the transformation frequencies of the cultures at the same times the cfu were measured. Cells in BHI did become quite competent when the culture density got high, just as in H. influenzae. Transfer to MIV rapidly stimulated competence, but only to the same level that develops 'spontaneously' when the culture gets dense in BHI. (In H. influenzae MIV competence is about 10-100-fold higher.) Adding cAMP to the BHI didn't appear to affect competence at all; the slightly lower competence is likely an indirect effect of the slightly slower growth rate.

This is prettier time course than the one I was trying to replicate, so this will probably be the figure that goes into the manuscript.

Praise those who post Excel Tips!

Here's a graph of some data:
How do I tell if the blue slope is significantly different from either of the grey slopes? I know just enough about statistics to know that there will be a way to test this, but not enough to do it. And the post-doc I've relied on for statistics help just moved on to a second post-doc position on the other side of the country. What to do?

Google the problem, of course. I think I searched with 'calculate confidence interval for slope, and that led me to this page, one of a collection of Excel Tips for scientists and engineers posted by one Bernard Liengme, a retired professor of chemistry and lecturer in information systems at St. Francis Xavier University in Nova Scotia.

At first the page looked quite daunting. The post-doc confirmed by email that this was instructions for doing what I wanted, but didn't offer to do it for me. I then tried clicking on the link to the sample workbook, which gave me an actual Excel file with the example calculation all set up. So I just did to my data what the example did to its, and presto, I have the confidence intervals for my lines! it's a bit embarrasing to admit that I don't know what the "INDEX(LINEST" command does, but then I don't know what's in the secret buffers and columns of the kits we use either.

So thank you Dr. Liengme! Note - a new edition of his book A Guide to Microsoft Excel 2007 for Scientists and Engineers is available in paperback from Amazon.

p.s. The 95% confidence interval for the blue line overlaps slightly the intervals for the grey lines.

Another time course to do

We convinced our A. pleuropneumoniae collaborators that the manuscript should be submitted in its current form after I do another replicate of the time course (and with some genome-analysis data presented as a table rather than just described in a sentence in the text). So now I first need to do a test of expression-time requirements for starved and growing cells, which should be relatively simple.
Quick plan: start with frozen starved cells, thaw cells, resuspend half in fresh starvation medium (to get rid of the glycerol they're frozen in) and half in a larger volume of rich medium.  Add DNA to the first cells, incubate 15 minutes, add DNase I, incubate 5 minutes, add an equal volume of rich medium, continue incubating.  At intervals (0, 10, 30, 60 minutes) take samples, dilute and plate on plain and kanamycin plates.  Incubate the rich-medium cells until they're moderately dense (OD about 1.0), then add DNA and DNase I and sample as above.  Total of 8 samples, needing about 32 plain plates and about 50 kanamycin plates.  This time include a no-DNA control.
Then, do another time course like the one I did last week.

Manuscript genuinely nearing completion?

One of the bioinformatics manuscripts on my desk (on the shelf, on my office floor) has been 'nearing completion' for so many years now that I'd genuinely come to see this as its permanent state.  

The work began about 10 years ago as a collaboration with a colleague in Taiwan, produced interesting results about the effects of uptake sequences on proteomes, and was partly written up and then set aside when the student who had done the bioinformatics graduated and went on to unrelated work.  At that stage I had a manuscript that was fine in some parts but flawed in others. 

About four or five years ago we began a new collaboration with bioinformaticians in Ottawa, initially on another project, but later on a new version of the old proteome project.  That produced lots of data and a new draft manuscript, but we kept finding little problems with the data and getting new ideas for analysis.  And I kept getting setting it aside to work on other things, as did my Ottawa colleague.  

But I can see the light at the end of the tunnel now, and am seriously hoping to get the damned thing submitted by Christmas.  So yesterday I sat down to try to remember where it stood, and to look at the latest (hopefully final) data.  It looks good enough, so I just need to lower my standards and get it done.  (Maybe drinking some of the beer in the lab food fridge will help.)

Time course results and prospects

Well, my execution of the time course experiment was near-flawless, but the results leave a bit to be desired.  Once again the colony counts were erratic, and some of the results are inconsistent with results of previous 'replicates'.  (Is replicate the right word here?).

I do have a new hypothesis to explain the colony-count problems: too short of an 'expression time' for the antibiotic resistance allele.  When cells recombine an allele coding for an antibiotic-resistant version of a protein, they don't instantly become resistant -- some time is needed for the new allele to be transcribed and translated into protein, and full resistance may take an hour or more.  However this matters more for some antibiotics and some forms of resistance than for others.  
With kanamycin and H. influenzae, experiments I did when I first set up my lab showed that cells could be spread on kanamycin agar plates right away (15 minutes for DNA uptake, 5 minutes for DNaseI, maybe 5 minutes for dilutions and plating), and every cell that had the new allele could form a colony.  When I started working with A. pleuropneumoniae I was told to use a very high concentration of kanamycin to prevent sensitive cells growing on the kanamycin plates, and found that cells did need an hour for expression of the allele before being able to form colonies on plates.  But I tested lower concentrations and found that sensitive cells couldn't grow at a much lower concentration.  In writing up that test I speculated that maybe they wouldn't need expression time to grow on this concentration.  But I seem to have then just gone on to do subsequent experiments without expression time without ever really testing whether it was needed.  So maybe my erratic results and low colony counts on kanamycin plates are because many cells that had acquired the resistance allele didn't have time to become phenotypically resistant before they encountered the kanamycin.
To resolve this, I can thaw out and transform some frozen competent A. pleuropneumoniae cells and test their need for expression time.  But I now wonder if these cells, having been starved to induce competence, might actually need less expression time than the cells growing in rich medium, because the antibiotic causes most of its killing when cells are actively replicating their DNA.  This would be consistent with my results, because the plating problems did mainly happen with growing cells. That's an interesting issue in its own right, and but would entail doing a more complicated test.  If the tests showed that expression time was the problem (at least a big part of the problem), I'd redo the time course.
The issue may be moot, because we may now leave the time course out of the manuscript.  One of the authors from the other research group (the leader of the group) strongly feels that we need to do more experiments before submitting this for publication, even as a 'Note' to a fairly minor journal.  But of course he wouldn't be the one doing the experiments, and the rest of us think we should just send it in and see what the reviewers think.  I and the postdoc can also think of desirable experiments, but our lists don't overlap with his.  So we'll try to persuade him by sending him our list and arguing that we can't possibly do all of these experiments, so let's wait and see which ones the reviewers might want us to do, rather than trying to read their notoriously unpredictable minds. 

Good and bad news

The good news is that my test transformation worked quite well - the transformation frequency was a bit lower than I would like, but the numbers of colonies produced from the different dilutions were just what they should have been, and the colonies themselves were both vigorous and uniform.

The bad news is that our work-study student has the flu, so I won't have any help with the big experiment. But of course it isn't really that big of an experiment, just a full day of keeping track and paying attention and neglecting other responsibilities.

It's been soooo looong....

...but I'm about to do a real experiment at the bench, because one of my post-docs has called me on my bragging that "I can do a time course in my sleep".
Several years ago we did some collaborative work on DNA uptake by Actinobacillus pleuropneumoniae, a H. influenzae relative that infects pigs rather than people (the PDF is here). The collaborators also generated data on the regulation of competence by the A. pleuropneumoniae homolog of our favourite protein, Sxy. They have now moved on to other things, and this post-doc is writing up a short manuscript (just a 'Note') combining their data with some of ours.
One issue her manuscript examines is how competence changes in response to different culture conditions. I had investigated this during the period of collaboration, by following changes in transformation frequency over time, as cells were grown in rich medium and transfered to the starvation medium that induces competence in H. influenzae. I did versions of this 'time course' several times, but each was plagued by inconsistencies in the numbers of colonies that grew up on the agar plates. I couldn't figure out what was causing the inconsistencies -- the cells were no more sensitive to minor variations in plating conditions than H. influenzae is -- so I compensated by sampling the culture more often. This generated data that was marginally acceptable, and I went on to other things.
Now the post-doc wants to include this data in her manuscript, but it needs to be replicated first. Over to me. I checked yesterday and was reassured to find that I have lots of the kanamycin-resistance DNA I need to give the cells (I sensibly did a large high-quality prep in 2005), and a good stock of frozen competent A. pleuropneumoniae cells that I can use as controls. So today I'm going to thaw out some of these cells and transform them with the DNA, just to confirm that everything is working. Sunday I'll count the colonies from that little experiment, and if all looks good I'll plan the big time course, using my best 2005 time course as a guide.
Monday I'll have help from our work-study student. She and I will make the agar for all the plates we'll need, and pour about 250 plates. She'll first make up a high-concentration stock of the NAD supplement this agar needs. We'll prepare lots of dilution solution and put 5 ml aliquots into hundreds of tubes. We'll start an overnight culture of the cells, so they'll be ready to go on Tuesday morning.
Tuesday I'll start by diluting the cells into fresh medium and following their growth by measuring the turbidity of the culture. If I'm lucky the work-study student will be free to help me. We'll put a drop (10 ul = 1 ug) of DNA in each of about 30 tubes. At about 30-minute intervals we'll take 1 ml of cells from the culture into a tube, incubate this for 15 min, add 10 ul DNaseI to destroy the remaining DNA, incubate for 5 minutes, and then dilute the cells and spread them on plates with and without kanamycin. Once the culture has reached a specific density we'll transfer part of it to starvation medium, and also test aliquots from this at 30 min intervals.
If we're ambitious we may also add the potential inducer cyclic AMP (cAMP) to another part of the culture. In H. influenzae cAMP induces competence, but my 2005 results with A. pleuropneumoniae were very unreproducible.
Then, on Wednesday or Thursday, maybe I can convince the work-study student that counting colonies is a thrilling part of scientific research (like Tom Sawyer white-washing the fence). Otherwise I'll have to count them myself.

Nothing interesting here, folks

OK, so I got to my office this morning all enthusiastic to do the additional runs that would clarify why N. meningitidis has twice as many forward-orientation DUSs in the strand synthesized discontinuously. I did three different things, all of which confirmed that the two-fold difference was just an aberration in the Gibbs analysis.

First, I plotted the distribution of forward-DUSs along both strands of the genome (yesterday I only had time to do it for the reverse-complement strand). This clearly showed that the two strands are the same-- the blue ticks in the figure below (just a close-up of part of the figure) are DUSs on the forward strand, and the red ones are DUSs on the reverse-complement strand.

Second, I completed the control analysis I had to interupt yesterday. This analyzed the reverse-complements of the 'leading' and 'lagging' sequences I had assembled yesterday. It was a way of repeating the analysis on different sequences that had the same information content. Result: very similar numbers of DUSs in both.

Third, I assembled new 'leading' and 'lagging' sequences, using our script to efficiently find the midpoints I'm using as surrogate termini, then reran the Gibbs analysis on these. Result: very similar numbers of DUSs in both, and these DUSs gave effectively identical logos.

So I went back and examined the Gibbs output that had had twice as many DUSs as the others. For unknown reasons, both replicate runs had settled on less-highly specified motifs, and thus included a lot more poorly matched sites in their output. Well, at least I can now very confidently report that there is no direction-of-replication bias in N. meningitidis DUSs.

Two manuscripts submitted!

Within 24 hours of each other! One on the phenotypic and genotypic variation between H. influenzae strains in competence and transformation, and the other on what Sxy does in E. coli.

leading and lagging DUSs

We're adding data about the uptake sequences in the Neisseria meningitidis genome (called DUSs) to our previous analysis of Haemophilus influenzae uptake sequences (USS).  But my initial  analysis of how these DUSs are distributed turned up something unexpected.

For H. influenzae, we've known for a long time that USSs are distributed quite randomly around the genome.  Well, the spacing is a bit more even than perfectly random would be, and USSs are somewhat less common in coding sequences, but these qualifications not important for this unexpected result.  In particular, the two possible orientations of the USSs occur quite randomly around the chromosome.  (I've described this analysis in a previous post, though not the results.)

So I did the same analysis for Neisseria DUSs.  I separated the forward and reverse-complement genome sequences sequence into 'replicated as leading strand' and 'replicated as lagging strand' halves.  This was relatively simple because the origin of replication had been assigned to position #1.  Nobody has mapped the terminus of replication, so I just picked the midpoint (bidirectional replication from the origin is expected).  I wrestled with Word to find this midpoint, but I'll redo this using our '' perl script to divide each sequence exactly in two. Then I used the Gibbs motif sampler (now running happily on our desktop Macs) to find all the forward-orientation DUSs in the 'leading' and 'lagging' sequences.  

The surprise was that it found twice as many DUSs in the lagging strand as in the leading strand.  After mistakenly considering that this could be because there were more genes on the leading strand (irrelevant because genes, and DUSs, occupy both strands) I decided that this must be because DUS were oriented differently around the genome, mostly pointing in the same way with respect to the direction that the replication fork passes through them.  So I did a quick analysis of the locations of forward-pointing DUSs around the chromosome, expecting to find that they were more frequent near one end than the other, but they appeared to be evenly distributed, which would mean that the only other explanation is that I've screwed up. 

Later I'll do some more analyses to sort this out.

Neisseria repeats

I've spent parts of the last coupe of days first discovering that a dataset of uptake sequences (DUS) from the Neisseria meningitidis intergenic regions contained a large number of occurrences of a different motif.  I could easily see that it was longer than the 12 bp DUS but couldn't figure out what the motif was.

I spent a long time suspecting that these repeats were 'correia' elements, a very short but complex transposable element common in Neisseria genomes.  But I couldn't find a clear illustration of the correia consensus, and I couldn't find a good match between the correia sequences I could find and the sequences of the stray motif in my DUS dataset.

Finally I realized that I could try using the Gibbs motif sampler to characterize the motif.  So I took my set of intergenic sequences, used Word to delete all the perfect DUS (both orientations), and asked Gibbs to find a long motif.  I didn't know how long the stray motif actually was, so I tried guessing 20 bp, then 30, then 40.  But this didn't seem to be working - instead of finding a couple of hundred long correia-like motifs it would find a couple of thousand occurrences of something with what looked like a very poor consensus.  So I seeded the sequence set with about 20 occurrences of the motif taken from the dataset where I'd first noticed it.  

Gibbs again returned about 1500 of what looked like poor-consensus occurrences, but this time I had a bit more confidence that this might be what I was looking for, so I trimmed away all the notation and posted them into WebLogo.  This gave me a palindromic repeat that I'll paste below later, and a bit of Google Scholar searching showed me that this isn't correia at all, but a short repeat called RS3, known to be especially common in intergenic sequences of the N. meningitidis strain I'm using.

So now I can write a sensible manuscript sentence explaining what these repeats are and why I'm justified in removing them from the dataset.

Another issue for the new uptake plans

Here's another technical problem for the new plans. We want to reisolate and sequence DNA that competent cells have taken up into their cytoplasm. We'd start with a large number of competent cells carrying a rec-1 mutation that prevents them from recombining DNA with their chromosome, and incubate them with genetically marked DNA fragments. We expect the DNA to be taken up (it would contain one or more USSs) across the outer membrane and then translocated into the cytoplasm. The problems arise because, although the process of translocation starts at a free end of double-stranded DNA, one of the strands is degraded as the DNA is translocated. This means that the DNA in the cytoplasm will be single-stranded.

We can treat the cells with DNaseI to destroy DNA that's left outside the cells or on their surfaces, and then wash them thoroughly t remove the DNase I so it doesn't act on the DNA inside the cells. We can then lyse the cells with SDS in the presence of EDTA, being very gentle so to not break the long chromosome.

Getting rid of chromosomal DNA is very important, as there will be at least 10 times as much of it as the incoming DNA we want to reisolate. If we start with input DNA that is of a uniform and relatively short length, we will be able to use size fractionation to get rid of most of the chromosomal DNA. And we probably can further enrich for the input DNA by fractionating on a column that selects for single-stranded DNA.

One solution would be to affix a sequence tag to the ends of the input fragments, and then use a primer for this tag to PCR-amplify the input DNA. Unfortunately, the leading (3') end of the surviving incoming strand is also thought to be degraded, so the tag would probably be lost. As this is the end the PCR would start from, the PCR then wouldn't work.

We don't want to tamper with the structure of the incoming DNA, as this is likely to interfere with normal uptake in some way we can't control for. And we don't want to use recipients with nuclease knockout mutations, partly because we don't even know which nucleases are responsible and partly because we don't want to pervert the normal uptake/processing events.

One possibility is to use a combination of tagging and random priming, with the lack of tags on the 3' ends compensated by the random primers. Maybe we could test this, using radioactively labelled input DNA with tags. If the method is working, most of the input radioactivity in the reisolated DNA would be converted to double-stranded. Or we could test it using DNA from another species, and sequencing a small fraction of the PCR products to see if they were indeed from the input genome rather than from the recipient.

Because we're really only interested in the relative proportions of different input sequences in our reisolated DNA, we can tolerate a modest fraction being from the recipient genome. But we don't want to waste our expensive sequencing dollars on DNA that's mostly from the recipient.

Planning an uptake/sequencing experiment

A future post-doc and I are re-planning an experiment I discussed a couple of years ago and included in an unsuccessful grant proposal, designed to finally find out the real sequence specificity of the DNA uptake machinery. At that time I only proposed a little bit of sequencing, but sequencing keeps getting cheaper so this time we're going to do it right. This blog post is to help me get the numbers clear.

In my original plan we'd start with a pool of mutagenized (or degenerately synthesized) USS, either as short fragments (say 50bp) or inserts in a standard plasmid. The USS in the pool would have, at each position of the 30bp USS, a 91% probability of carrying the consensus base and a 3% probability of carrying any one of the 3 other bases. The flanking sequences might be mutagenized too, or the mutagenized fragment might have been ligated to a non-mutagenized tag or plasmid sequence. We'd give this DNA to competent cells, reisolate the DNA that had bound to the cells, and sequence it. We'd also sequence the 'input' pool. The differences between the 'input' and 'bound' pool sequences would define the specificity.

Only about 20 positions of the USS show a significant consensus, so in the analysis below I'm going to assume that the USS is 20bp long. On average each USS would have about 2 changes away from the consensus in these 2obp, but the range would be broad. Figuring out how broad is part of the reason for this post.

For example, what fraction of the fragments would have no differences from the consensus in the relevant 20bp? That's (0.91)^20, which Google says is about 0.15. What about fragments with 1 difference? I think that's about 20 * 0.09 * (0.91)^19, because there are 20 different possible locations for the difference. That's about 0.3. I fear the calculation will be more complicated for the larger numbers of differences. A similar calculation as the second one above gives 0.56 as the frequency of USSs with 2 mismatches, but that's unlikely to be correct because the sum of 0.15 + 0.3 +0.56 =1.01, leaving no possibility of USSs with more than two differences from the consensus. So for the analysis below I'll just use some fake numbers that I think are plausible: 0.1 with no difference, 0.25 with one difference, 0.3 with two differences, 0.2 with three differences, and 0.15 with four or more differences.

How is a sequence-specific uptake system likely to act on this variation? Let's assume that fragments with 'perfect' consensus USSs are taken up with probability 1. For the simplest version, let's assume all one-position differences have the same effect, reducing binding/uptake to 0.1, and all two-position differences reduce it to 0.01, etc. The 'bound' pool would then be expected to contain 0.78 perfects, 0.195 one-offs, 0.023 two-offs, 0.0015 three-offs and 0.0001 USSs with four or more differences.

How much sequencing of the 'bound' pool would we have to do to have included all of the three-off sequences (i.e. sequenced each version at least once)? Only one sequence in about 1000 will be a three-off, and there are about 7000 different ways to combine the positions of the changes (20*19*18). But there are three ways to be different at each position that's different... Yikes, I'm in over my head.

OK, let's do it for the one-offs first. There are 20 different locations for the difference, and three possible differences at each location, so that's 60 different versions. About 0.2 of all sequences will be in this class, so we will need to sequence a few thousand USSs to get reasonable coverage. What about the two-offs. There are 20*19=380 possible pairs of locations for the differences, and 9 possible differences for each pair of locations, so that's 3420 different sequences. Only about .023 of the fragments in the pool would have two differences, so we'd need to sequence about 150,000 USSs to get one-fold coverage, say about a million sequences to get good coverage. For the three-offs, the numbers are .0015, 6840, and 27, giving almost 200,000 different sequences, with about 1.5x10^8 sequences needed for one-fold coverage (say 10^9 for reasonable coverage).

If I instead assume that each mismatch reduces binding/uptake by only 0.5, then the 'bound' pool would have 0.3 perfects, 0.37 one-offs, 0.22 two-offs, 0.07 three-offs and 0.03 USS with four or more differences from the consensus. The corresponding numbers of fragments sequenced for reasonable coverage of one-offs, two-offs and three-offs would be about 1000, 100,000, and 10^8.

A more realistic model should have some positions more important that others, because that's the main thing we want to find out. What if differences at one of the positions reduces uptake to 0.1, and differences at the other 19 reduce it only to 0.5? We'd of course recover more of the differences in the less-important positions than of the differences in the important positions.

How does thinking about this change the amount of sequencing we'd need to do? If all we're interested in is the different degrees of importance of positions considered singly, then we'd easily see this by doing, say, enough to get 100-fold coverage of the important single mismatches. Even 100-fold coverage of the unimportant positions would be quite informative, as we only need enough to be confident that differences at the one 'important' position are underrepresented in our sequences. So 10,000 USS sequences from the 'bound' pool would be plenty to detect various degrees of underrepresentation of differences at the important positions.

But what if we also want to detect how differences at different positions interact? For example, what if having two differences beside each other is much worse for uptake than having two widely separated differences. Or if having differences at positions 3 and 17 is much worse than having differences at other combinations of positions? Or having A and C at positions 3 and 17 is much worse than having other non-consensus bases at those positions?

We would certainly need many more sequences to detect the scarcity of particular combinations of two or more differences. The big question is, how many? Consider just the two-offs. 100,000 sequences would let us get almost all of the non-important two-off variants at least once, and most of them about 5-10 times. But that wouldn't be enough to confidently conclude that the missing ones were not just due to chance -- we'd need at least 10 times that much sequence.

How much sequencing is that? If the fragments are 50bp, and we want, say, 10^6 of them, that's 5x10^7 bp of sequence. Future post-doc, that's a modest amount, right?

Given a sufficiently large data set, we do have access to software that would detect correlations between sequences at different positions (we used it for the correlation analysis in the USS manuscript).

Once we had identified candidates for significantly under-represented sequences, I wonder if there if a technique we could use to go back to the pool and confirm that these sequences were genuinely under-represented? Something analogous to validating microarray results with Q-PCR? Maybe the reverse? Ideally, we would have an oligo array with every possible two-off USS, and hybridize our bound pool to it. Probably not worth the trouble.

The other reason I'm writing this post is to figure out how much DNA we'd need to start with, in order to end up with enough sequences to define the specificity. If the 'bound' pool contained 1 microgram of 50bp fragments -- that would be 10^13 fragments. This should be enough to encompass way more diversity than we would ever be able to sequence. To get 1 microgram we'd need to start with an awful lot of cells, but even if we cut this down to 1 ng we'd still be fine.

Outline for Why do bacteria take up DNA

As I said in the previous post, this will be a review article emphasizing how we can decide why bacteria take up DNA, rather than claiming we have enough evidence to decide.

It will begin with an overview about why this question is so important. To make the writing easier I can probably modify text from my grant proposals and from Do bacteria have sex, but this new introduction should be a lot shorter than the coverage of this issue in that article. The basic things to say are:
  1. Recombination between bacteria has been enormously important in their evolution, so we really should try to understand why it happens.
  2. It's usually been assumed that the processes that cause recombination exist because the recombination they cause is selectively advantageous. (This might be viewed as meta-selection - selection acting on how genetic variation arises.) But studies of the evolution of sex in eukaryotes have shown that the costs and benefits of recombination are very complex.
  3. What are these processes? Consider gene transfer separately from physical recombination. The major gene transfer processes are conjugation, transduction and transformation. Some 'minor' processes also contribute to bacterial recombination: gene transfer agent is the only one that comes to mind right now. In other cases, genetic parasites (transposable elements, and probably integrons) hitchhike on processes that transfer genes. I won't discuss these, though I may say that I won't discuss them. Physical recombination acts on DNA that has entered the cytoplasm ("foreign DNA").
  4. The proximate causes of most bacterial recombination are reasonably well understood, at least at the molecular level (what happens inside cells, and when cells encounter other cells, phages or DNA. We know much less about the processes that determine when such encounters happen, but that's probably not a topic for this review.
  5. Some of the ultimate (evolutionary) causes are also understood. We understand the 'primary' evolutionary forces acting on cells, phages and conjugative elements. By 'primary' I guess I mean forces that directly impact survival and reproduction.
  6. These forces appear to provide sufficient explanation for the existence of many of the processes that contribute to bacterial recombination. Specifically, strong selection for replication by infectious transfer to new hosts explains why phages and conjugative elements exist, and strong selection for DNA maintenance and repair explains why the cytoplasmic proteins that cause physical breaking, pairing and joining of DNA exist. This logic needs to be spelled out very clearly.
  7. The need for nucleotides and other 'nutrients' obtained from DNA breakdowncould also explain why cells take up DNA, but this question is more complicated than the others.
  8. Describe the other potential advantages of DNA uptake: templates for DNA repair, and genetic variation by recombination.
To be continued...

Review articles to be written

Two, in fact. I really really need to publish a review about the evolution of competence (=DNA uptake). Something like my Do bacteria have sex review, but updated and focusing much more on the competence and uptake issues. And I've also promised to write a chapter on competence and transformation for a book celebrating a wonderful colleague. The model for this book is Phage and the Origins of Molecular Biology, written as a feitschrift for Max Delbruck. Ideally I'd love to produce something as charming as Andre Lwoff's The Prophage and I in that book, but I think I'd better lower my standards and get the job done.

Starting now.

OK, the first things I need are outlines.

For the book chapter, I was thinking about writing it as several levels of history, maybe interleaved (?): my personal history of working on competence and transformation, the history of research into competence and transformation, and the evolutionary history. But I don't know how I would make my personal history interesting - maybe emphasizing how I came to realize that the standard view is probably wrong?

What about an outline for the evolution of competence review? It should be driven by a question, probably "Why do bacteria take up DNA?", and should emphasize the unanswered questions and the research that's needed, rather than claiming that the answer is now known. Maybe I'll start with the big picture issue of the evolutionary and proximate causes and consequences of genetic exchange in bacteria, summarizing the arguments in Do bacteria have sex?. This introduction would conclude that the only phenomenon requiring more research is competence. Then I'll consider the components of competence (regulation, DNA uptake mechanisms, and the associated cytoplasmic 'DNA processing' events), putting what we know in the context of the evolutionary questions and explaining how additional information would clarify things.

Why do bacteria take up only one DNA strand?

People who disagree with us about the nutritional importance of DNA for bacteria very often cite the degradation of one DNA strand at the cell surface. Like almost all (maybe all) other competent bacteria, H. influenzae initially binds double-stranded DNA, but brings only one of the two strands across the inner membrane into the cytoplasm. The other strand is degraded to nucleotides in the periplasm. In Gram-positive bacteria the degradation also occurs at the surface, before the remaining DNA strand is transported into the cytoplasm, and the released nucleotides are found in the culture medium.

(Hmm, aside, I must go back and check the old literature, to see if it's the nucleotides or only the phosphates that are found in the medium. This is significant because we know that the phosphate must be removed from each nucleotide before it can be taken up. No, they used tritiated thymidine, so the label was in the bases.)

If bacteria were taking up DNA for its nucleotides, they argue, the bacteria would certainly not discard all of one strand, thus getting only half the ncleotides they would get if they took up both strands. Therefore DNA must be taken up for its genetic information. My usual response to this (usually ignored) is to point out that the bacteria have efficient uptake systems for nucleotides (actually nucleosides = nucleotides with their phosphates removed), and any nucleotides produced at the cell surface are likely to be taken up by these systems rather than 'discarded'. The same people happily accept that bacteria secrete nucleases to obtain nucleotides from environmental DNA, a process much more strongly limited by diffusion, so they should find this convincing.

But today I've thought of another response, which is to bounce the 'discard = waste' argument right back. If bacteria are taking DNA up for its genetic information, then surely it is wasteful to take up only one strand. Although only a single DNA strand usually participates in a single recombination event, many strands are degraded by cytoplasmic nucleases before they can recombine at all. Others recombine, but the new genetic information they contain is lost through mismatch correction. Surely it would be better to take up both strands and give each a chance to recombine, effectively doubling the probability that the cell will get the new genetic information.

Super-ultra-high-throughput sequencing? done cheap?

A potential collaborator/coworker has suggested what I think would be a great experiment, if we can find a way to do it economically.  But it would require sequencing to very high coverage, something I know almost nothing about the current economics of.

Basically, we would want to sequence a pool of Haemophilus influenzae DNA from a transformation experiment between a donor and a recipient whose (fully sequenced) genomes differ at about 3% of positions, as well as by larger insertions, deletions and rearrangements.  The genomes are about 2 mb in length. The DNA fragments could have any desired length distribution, and the amount of DNA is not limiting.  

Ideally we would want to determine the frequencies of all the donor-specific sequences in this DNA.  For now I'll limit the problem to detecting the single-nucleotide differences (SNPs).  And although we would eventually want to have information about length/structure of recombination tracts, for now we can consider the frequency of each SNP in the DNA pool as an independent piece of information.

At any position, about 1-5% of the fragments in the pool would have a base derived from the donor strain.  This means that simply sequencing to, say, 100-fold coverage of the 2 mb genome (2 x 10^8 bases of sequence) would be sufficient to detect most of the SNPs but could not establish their relative frequencies.  Increasing the coverage 10-fold would give us lots of useful information, but even higher coverage would be much better.

The collaborator and the metagenomics guy in the next office both think Solexa sequencing is probably the best choice among the currently available technologies.  Readers, do you agree?  Are there web pages that compare and contrast the different technologies?  What should I be reading?

One possibility I need to check out is skipping sequencing entirely and instead using hybridization of the DNA pool to arrays to directly measure the relative frequencies of the SNPs.  Because the pool will contain mainly recipient sequences, we'd want to use an array optimized to detect donor sequences against this background.  How short to oligos need to be to detect SNPs?  How specific can they be for a rare donor base against a high background of recipient sequences?  Is this the kind of thing Nimblegen does?

More from Elsevier (names removed...)

Email from an Elsevier Publisher:

Dear Rosie,

I'm so sorry this process has been difficult. As you say below, you can pay by credit card (personal or institutional). We don't have an agreement with CIHR, but they may indeed refund the cost of the sponsored access. That's something CIHR will have to advise you on.

The journal manager for JMB; (name removed) is working with her colleagues to make your article available should you wish. If you haven't completed the paperwork for payment, the link is here:

I see on your blog, however, that you've given up due to frustration with the process and the inability of the service group to respond to your questions. I know they did the best they could. Please do let the journal manager know if you do indeed want to pursue the sponsored access option.

While I recognize your frustration, and I agree, I would feel the same, it's frustrating to those of us working on the journal that your blog has been wholly negative about JMB and has never acknowledged our correspondence or those of us working to resolve the difficulties you've had with your proofs.

Please let the journal manager and I know if you are indeed resolved not to move forward with the sponsored access.

Best wishes,

Friendly Elsevier Publisher (name removed)

Dear Elsevier Publisher,

I have indeed abandoned my attempt to pay for open access to our article, entirely because of my frustration with the inability of the service person handing the transaction (name removed) to give a straight answer to any of my questions.

As I now think I understand the situation, most of the things she told me were not true. First, she said my purchase order could not be processed because my granting agency was not on the list. Then she said that my granting agency would not refund me the open access charge. I took care to make my questions very clear and simple, and each time she'd respond with boiler-plate statements that did not answer the questions and raised new confusion, and with referrals to largely irrelevant documents.

I don't think the fault can be all hers. The Elsevier sponsored-access system is confusing, the policy is not clearly explained, and the necessary information is hard to find.

The Journal of Molecular Biology is an excellent journal, and we're proud to have our article appear there. The submission and review process went very smoothly, the copy editing was very professionally done, and the 50 free offprints are a nice treat. But I feel strongly that taxpayer-supported research should be published where the taxpayers can see it, so I won't be submitting to any Elsevier journals in the future.



p.s. CIHR has never quibbled in the past about paying open access charges, and no concerns were raised when I included them as a line item in my grant budget.

p.p.s. I'll post this correspondence on my blog, with the names removed.

Open access frustration more powerful than principles

I give up. No, Elsevier, I am not going to give you $3000. Principles be damned.

Dear Dr Redfield,

Thank you for your reply.

A purchase order is for your benefit only, Elsevier do not require one, please see below link:

Elsevier accept 3 methods of payment, please see below link:

As previously advised your article can be made available via open access, however you must cover the cost of USD$3,000.

Please confirm if you wish to proceed, I will update our system and send the forms for invoicing.

Elsevier Support Person

The saga continues...

Reply from the Elsevier Support Person:

Dear Dr Redfield,

Thank you for your reply and apologies for any confusion caused.

Elsevier has established agreements and developed policies to allow Authors who publish in Elsevier journals to comply with the manuscript archiving requirements of funding bodies, as specified as conditions of researcher grant awards, the below is a list of these:

Arthritis Research Campaign (UK)
British Heart Foundation (UK)
Cancer Research (UK)
Chief Scientist Office
Department of Health (UK)
Howard Hughes Medical Institute (US)
Medical Research Council (UK)
National Institute of Health (US)
Wellcome Trust (UK)

As mentioned on the below link:

More than 40 journals published by Elsevier offer authors the option to sponsor non-subscriber access to individual articles. The charge for article sponsorship is $3,000. This charge is necessary to offset publishing costs.

The funding body you mention must agree to cover this cost with you, as the author you are liable for this payment, as Elsevier do not have an agreement with them.

I hope this now makes sense. The funding body you mention can cover the cost however the payment must come from you as the author of the article.


Elsevier Support Person


Dear Elsevier Support Person,

I still don't understand. Are you saying that, because Elsevier does not have an explicit agreement with CIHR, you will not accept a UBC purchase order based on a CIHR-funded grant?

My grant budget explicitly included a line item for open-access publication charges; this was not questioned when the application was approved. If I obtain written confirmation from CIHR that the charge is allowed, will you then accept the purchase order?

If not, what forms of payment will you accept? A bank draft drawn on my personal chequeing account? A charge to my personal MasterCard? A charge to my Institutional MasterCard?

Thank you,


More Elsevier hassles about open access

Recent correspondence, beginning with the last of a series of emails about a form that had gone astray:

Hi Elsevier Support Person (I'll won't use this person's name),

I've attached pdfs of the signed sponsorship form and the purchase
order to this email. I'm also cc'ing the email to the Sponsored
articles address,




Dear Dr Redfield,

Thank you for your reply.

The funding body listed on your sponsorship article form is not a
body Elsevier currently has a policy with, therefore we cannot
process this.

For more information and a full list of our Funding Bodies, please
see below link:

Yours sincerely,

Elsevier Support Person


Dear Elsevier Support Person,

Are you saying that, because Elsevier does not have an explicit
agreement with the Canadian Institutes of Health Research, we do not
have the option of making our article available to non-subscribers?
That was not the impression I got from reading the posted information
on sponsored access (pasted below).
"Worldwide approximately 10 million scientists, faculty members and
graduate students can access Journal of Molecular Biology through
institutional subscriptions. In addition, Elsevier's ScienceDirect
licenses permit all public users who are permitted by the library to
walk in and use its resources to access all journals to which the
institution subscribes. In a few instances, authors have requested to
make their articles freely available online to all non-subscribers.
Journal of Molecular Biology offers authors the option to sponsor an
article and make it available online to non-subscribers via Elsevier's
electronic publishing platforms.

Authors can choose this option for all articles accepted after May
2006. Authors can only select this option after receiving
notification that their article has been accepted for publication.
This prevents a potential conflict of interest where Journal of
Molecular Biologywould have a financial incentive to accept an article.

The author charge for article sponsorship is $3,000. This charge is
necessary to offset publishing costs - from managing article
submission and peer review, to typesetting, tagging and indexing of
articles, hosting articles on dedicated servers, supporting sales and
marketing costs to ensure global dissemination via ScienceDirect, and
permanently preserving the published journal article. The fee
excludes taxes and other potential author fees such as color charges
which are additional.

Authors who have had their article accepted and who wish to sponsor
their article to make it available to non-subscribers should complete
and submit the order form.

When calculating subscription prices we plan to only take into account
content published under the subscription model. We do not plan to
charge subscribers for author sponsored content."



Dear Dr Redfield,

Thank you for your reply.

No you have not read this incorrectly, you can have your article available as "Open Access" as the "Journal of Molecular Biology" is one of the journals covered under this policy by Elsevier.

Please note however, that as Elsevier do not have a policy with the "Canadian Institutes of Health Research, you must agree to cover the cost of US$3000, this will not be refunded to you by the funding body

More information on the Open Access policy can be found below:

Before we arrange for this to be done, we need to make you aware of this procedure, therefore if you agree to cover the cost the article will be made available. Please confirm.

Elsevier Support Person


Dear Elsevier Support Person,

The web page you direct me to says nothing about the sponsored-access option being limited to specific funding agencies. Let us try again to clarify this policy. Would the following be a correct explanation of the policy?
Elsevier has explicit agreements with some funding agencies, stating that grant funds provided by these agencies may be used to cover the $3000 cost of sponsoring open-access publication in the Elsevier journals offering this option. These are the agencies listed on the Elsevier page that your previous email pointed me to.

Funding agencies with which Elsevier has no explicit agreement could refuse to authorize use of their funds in this way. If this should happen, Elsevier would hold the author of the article personally responsible for the $3000.
Your latest email says that, because Elsevier does not have an agreement with the Canadian Institutes of Health Research (CIHR), CIHR will not cover the cost of sponsored access publishing in Elsevier journals. Do you know that this is indeed the case, or is it only a possibility that I need to check for myself with CIHR? If you know that CIHR will refuse this expense, can you direct me to the source of this information?

Thank you,


Other nucleases that might act on internalized DNA

I just discovered a 2008 paper in Mutation Research, about the phenotypes of H. influenzae exonuclease mutants, and I've emailed the senior author asking if they would be willing to send us chromosomal DNA of these mutants, so we could test a pet hypothesis of mine.

The hypothesis concerns the functions of the competence-induced genes comM and dprA. Phenotypes of mutants in other bacteria suggest that the products of these genes protect incoming DNA from nuclease degradation. But what nuclease? By testing the competence phenotypes of double mutants we've ruled out ExoV (recBCD). Using the new mutants would let us test the involvement of other nucleases. (The hypothesis is described more thoroughly in this blog post.)

A 2002 paper about Snyechocystis shows that knocking out recJ increases transformation frequencies 100-fold. Synechocycsis does have homologs of both comM and dprA, so I don't know what the increased transformation implies about their roles.

Open Access and other sliminess at Elsevier

Our latest paper on CRP sites has been accepted by the Journal of Molecular Biology. This is great, but now I'm dealing with the messy post-acceptance issues.

First, our page proofs have gone astray. The Elsevier manuscript-tracking page (yes, JMB is part of the evil empire of scientific publishing) says that page proofs were sent out on Aug. 27. They should have been sent by email, but I've seen no sign of them so far. Usually page proofs are supposed to be corrected and returned within 48 hrs; I've just emailed a person at JMB about them.

While looking for the proof email I discovered that I'd ignored other bureaucratic requirements. I needed to complete an on-line document assigning all copyright to Elsevier. This document made no mention of an open-access option, but I accepted it anyway. It did say that I am allowed to post the Elsevier-created pdf of the manuscript version on my own web page or on a public server, and that I can use the final Journal-quality pdf for teaching (but I can't post to any publicly available sites).

Then I dug around looking for the "authors-pay" open access option. I had been assuming that this would give me a creative-commons-type license to do anything I want with the final pdf and data it contains, but no. All that I get for paying Elsevier $3000 US is access to the paper on the JMB web site by people who don't have subscription access (who don't pay the ~$1000 for a personal JMB subscription or belong to institutions paying ~$8000 for a subscription).

So, should I give Elsevier the $3000? That way nobody will need to search around to find out if I've posted a free (unformatted) copy on my home page (or linked to this blog). But Elsevier will still hold the copyright.

Searching for motifs

Well, I don't know why my first attempt at running a Gibbs motif search with the Gallibacterium genome returned errors. The errors it described were ones I hadn't made (as far as I could tell), so I resubmitted the runs and they were fine.

But the grid-computer system was slow to get around to my runs (probably those physicists and meteorologists hogging the system), so I poked around and rediscovered that the Gibbs motif searcher program now also runs on Macs. Luckily one of the post-docs had just come back from a two-week course that required intensive use of Unix, so she was able to dive in and sort out the permissions etc. for me. So now I can run the Gibbs searches on my newish MacBook Pro and on our other fast Mac. And I can also still run them on the grid system too.

Results: I can't find anything (i.e. Gibbs can't find anything) in the Gallibacterium genome that looks anything like a typical Pasteurellacean uptake sequence. I've done simple searches where I just asked it to look for an 8-mer motif, and searches wherre I gave it 'segmentation-mask' prior files telling it the spacing typical of a H. influenzae or A. pleuropneumoniae USS, but it still didn't find anything. Most of the time it can't find anything it even considers to be a motif at all. This may be because the USS motifs are too sparse for it to pick them up, or because Gallibacterium really doesn't have a USS at all.

But I'm trying one more thing - giving it a prior file that specifies not just the spacing but the actual position weight matrix to expect. If this doesn't find anything we may need to do some uptake-specificity experiments.

Can I remember how to run a Gibbs motif sampler analysis?

Our visiting grad student is working with Gallibacterium, a Pasteurellacean relative of Haemophilus. To help her optimize transformation we would like to find out about its uptake bias. As a first step, we'd like to find out whether it has repeats in its genome that resemble the known Pastuerellacean uptake signal sequences (USS) - fortunately a Gallibacterium genome sequence is available. I've done this analysis for all the other sequenced Pasteurellacean genomes, so I said I'd do this one too. Should be easy...

My first approach was to give the genome sequence to our Perl program that simulates USS, not because I want to do that, but because the program's first step is to count the numbers of full and partial USS matches in the starting sequence. The program was set up to do that for the H. influenzae USS (AAGTGCGGT), but when it didn't find many of these in the genome I modified it to find the other type of Pasteurellacean USS (ACAAGCGGT). It didn't find many of those either.

So, perhaps Gallibacterium has a previously unknown version of the USS. Or perhaps it has an unrelated USS. Or perhaps it doesn't have a USS at all, which would suggest that it has weak or no uptake bias. What was needed was analysis with the Gibbs motif sampler, which would look for any common repeat in the genome. OK, I did lots of those last summer, so I can do it again.

I remembered how to submit a sequence for analysis, but I didn't bother to carefully check what the different settings do bfore submitting the run. That was stupid, because 36 hours later I've received two emails fromt he system, telling me that my requested run failed. One says "ERROR:: Mismatched width ranges" and the other "ERROR:: Palandrome (sic) subscript overflow". Guess I'd better buckle down and sort it out.

CRP maniscript revisions submitted, on to Gallibacterium...

As usual, it took me about 3 tries to get the manuscript resubmitted with all the files in their correct forms. But it's done.

I'm going to try to get back to the bench next week, doing some competence-induction experiments with Gallibacterium, brought to the lab by our visiting grad student. Oh boy!

Manuscript almost ready to go back

Our manuscript about how the CRP proteins of E. coli and H. influenzae differ in their sequence specificity has been provisionally accepted, and the revised version is almost ready to send back to the journal. We weren't able to do the one experiment requested by the editor, but we make what we think is a pretty good argument about why it isn't needed.

The only remaining problem is that some of the figures look a bit weird, I think due to being shuffled between different formats. The dark grey shading in some of the bar-graph bars has turned into a dark grey check pattern. My former-postdoc-coauthor converted the figures into high-resolution PDFs so he could email them to me, but maybe he should instead post them to one of those file-sharing sites where I can download them. I know Google Groups works for this, but I think there are also sites dedicated to this.

The Response to Reviewers letter has been written and revised and polished, so once the figures are sorted out I think I can sit down and do the on-line submission.

Where are they now? Part 2

Two more lines of research that we're no longer working on:

3. When did eukaryote sexual reproduction begin? During the first 10 years that I was working on competence, I fully intended to switch to studying the origins of meiosis in eukaryotes. The plan, and the reasons I set it aside, are explained in this post from last summer. (Fortunately John Logsdon has taken up the torch.)

As the first steps in this project Joel Dacks, then a M.Sc. student in my lab, and I published two papers on the phylogeny of early-diverging eukaryotes. These results have been since confirmed by more detailed analyses, although the deep phylogeny of eukaryotes is still rather obscure.

4. Quorum sensing and/or diffusion sensing:
Most bacteria secrete small more-or-less inert molecules into their micro-environments and monitor the external concentrations of these molecules.When this autoinducer-secretion was first discovered it was proposed to be a means of cell-cell communication, evolved to enable bacteria to monitor the cell density of the population they are living in and to respond with appropriate changes in gene expression. This "quorum sensing" explanation quickly became dogma, despite having serious theoretical/evolutionary problems. In retrospect, this acceptance was partly because there were no alternative explanations for the evolution of autoinducer secretion and sensing, and partly because the idea that bacteria are secretly talking to each other is very appealing.

In 2002 I published an opinion piece (in Trends in Microbiology) proposing a much simpler explanation, that the secreted molecules serve as inexpensive sensors of the diffusional properties of each cell's microenvironment, and thus allow cells to secrete expensive effector molecules (such as degradative enzymes) only when they and their products will not be lost by diffusion. This ‘diffusion sensing’ hypothesis was welcomed by evolutionary biologists but largely ignored by the many researchers actively investigating quorum sensing. My lab initially tried to develop experimental systems to demonstrate that isolated cells use secreted autoinducers for gene regulation, but gave up because of the technical problems of monitoring gene expression at the scale of single isolated cells.

However the paper now gets regular citations in reviews of quorum sensing, and several other research groups have produced evidence validating the importance of diffusion in autoinducer regulation. The latest is a study of Pseudomonas cells on leaves (Dulla and Lindow PNAS 2007), which found that diffusion and other physical factors in cells' microenvironments are major determinants of this regulation. They pointed out that my proposal "has received little attention despite the extensive study of QS in many species", and even quoted approvingly my sentences about what research is needed.

Where are they now?

In the course of updating my CV I've been checking what's become of hypotheses and projects we initiated but are no longer working on. The good news is that all of them are still active areas of research, and the ones I consider most important are getting increasing attention. Here's a quick overview of two of them.

1. Mutation rates in males vs females: In response to a paper reporting that point mutation rates are much higher in males than females (because sequences on X chromosomes evolve slower than sequences on Y chromosomes), I used a computer simulation model to show that the excess mutations in male lineages usually canceled out the benefits of sexual recombination for females (Redfield Nature 1994). This paper made a big media splash when it came out; Natalie Angier wrote it up for the New York Times, Jay Leno made a joke about it, and it even got a paragraph in Cosmopolitan! This was partly because the title was full of buzzwords 'sex', 'male', 'female', 'mutation', and partly because I wrote up a very clear useful press release.

It didn't make much of a scientific splash, and it hasn't had much impact on subsequent work on the evolution of sex, but the number of citations continues to increase. Many citations are from a European group of theoretical physicists who publish mainly in physics journals, but others are from evolutionary biologists. One 2007 review discusses the implications of my work, referring to it as 'a seminal study' (which I choose to interpret as not just a bad pun).

The hotspot paradox: Most meiotic crossing-over happens at chromosomal sites called recombination hotspots; the largest influence ont he activity of these sites is the DNA sequence at the site. While I was still a grad student I realized that, over evolutionary time, active hotspot sequences should disappear from genomes, being replaced first by leas-active and then by inactive sequences. This is because the mechanism by which hotspots cause recombination also causes more-active hotspot sequences to be physically replaced by less-active sequences. At that time the genetic evidence was strong but little was known about the molecular details. This creates a paradox, because hotspots have not disappeared (each chromosome has many of them).

About 10 years later I returned to this problem, using detailed computer simulations to model the evolution of hotspots. We first created a deterministic model of a single hotspot, and showed that none the forces opposing hotspot elimination (evolutionary benefits of recombination, benefits of correct chromosome segregation, direct fitness benefits of hotspots that also act as promoters, singly or in combination) were strong enough to maintain hotspots against their self-destructive activity. Several years later we created a better, stochastic, model that followed multiple hotspots on a chromosome - this confirmed and strengthened the previous conclusions.

The first paper (Boulton et al, PNAS 1997) was ignored by just about everyone, particularly the molecular biologists whose work might be expected to resolve the paradox. By the time the second paper was published (Pineda-Krch and Redfield 2005), evidence from human genetics had confirmed that the hotspot destruction originally studied in fungi also occurs in humans. Now, the increasing ability to examine individual crossover events at base-pair resolution has focused attention on the paradox, and most papers about hotspots in natural populations (including humans) mention it as a sign that the evolutionary history of recombination hotspots remains perplexing.

I'll write up a couple more of these projects tomorrow.
I decided to do a USS-model run with the values in the DNA-uptake matrix increased five-fold, thinking this meant that drive favouring uptake sequences would be much stronger.  This should have meant that, when beginning with a random sequence, there would initially be very few fragments that had sufficiently high scores to pass the probability filter and recombine with the genome.  So I was surprised when the frequency of recombination at the beginning of the run looked to be about the same as with the normal matrix.

But then I remembered that one nice feature of the latest version of the model is that it uses whatever values are in the DNA-uptake matrix to calculate an appropriate value for the recombination probability filter, always equal to the score earned by a single site that perfectly matches the matrix-specified preferences.  This is good under some circumstances, but in the present case it entirely defeats what I was trying to do.

So I guess it's time to tweak the model one more time, introducing a coefficient by which the probability filter can be multiplied.  Ideally we'd like to be able to sometimes have this coefficient change during the run, as this would allow us to simulate evolution of the uptake specificity itself.  I'm confident that I can introduce a constant coefficient (read in with the other run settings); I wonder if I can easily make it a variable...

Test post from iphone

The culture media problems we were having last fall are back. This
time the postdocs suspect the medium, so they've ordered a new stock
of the Bacto stuff.

I think I won't try to post any more from my phone until my iPhone
typing speed improves.

Sent from my iPhone (by email to Blogger, because the iPhone doesn't support MMS messaging).

How arbitrary is a position weight matrix?

Recently I've run some USS-evolution simulations that started with a 50 kb segment of the H. influenzae genome rather than with a random sequence of base pairs. I used the position weight matrix derived by Gibbs analysis of the whole genome, thinking that this would be a non-arbitrary measure of the over-representation of the uptake sequence pattern. I was hoping that this bias would be strong enough to maintain the uptake sequences already in the genome, but the genome score fell to about half (or less) of its original value after 50,000 or 100,000 cycles.

That started me wondering whether the position weight matrix should be treated as a fixed set of values, or just as telling us the relative value that should be applied to each base at each position. Said another way, could different settings of the Gibbs analysis have given a very different matrix? The answer is No, but only because the Gibbs analysis reports the weight matrix as the frequency of each base at each position, so the sum of the weights of the four bases at each position must add up to 1. So if we want to consider stronger or weaker matrices, there's no reason not to multiply all the values by any factor we want to test.

So I think the first goal is to see when strength of matrix is needed to maintain the uptake sequences in the genome at their natural frequencies. Then we can see whether the uptake sequences are still in the places they started at, or whether these have decayed and new ones arisin in new places.

Some calculations and implications

In playing around with our Perl model of uptake sequence evolution, I've been surprised by how weak the effect of molecular drive seemed to be.  But I just did some back-of-the envelope approximations and realized that our settings may have been unreasonable.

We can use the relationship between the genomic mutation rate the model is assuming and known real mutation rates for a rough calibration.   Real mutation rates are on the order of 10^-9 per base pair per generation.  Using such a low rate per simulation cycle would make our simulations take forever, so we've been using much higher rates (usually 10^-4 or higher) and treating each simulation cycle as collapsing the effects of many generations.  But we didn't take the implications of this seriously.  

If we do, we have each simulation cycle representing 10^5 generations.  How much DNA uptake should have happened over 10^5 generations?  We could assume that real bacteria take up about one fragment of about 1 kb every generation.  The length is consistent with a recent estimate of the length of gene-conversion tracts in Neisseria, but the frequency is just a guess.  I don't know whether it's a conservative guess, but if bacteria are taking up DNA as food it's probably on the low side of reality.  

How much of this DNA recombines with the chromosome?  For now lets assume it all does.  This means that, in each cycle of our simulation, a simulated full-size genome would replace 10^5 kb of itself by recombination with incoming DNA.  Because a real genome is only about 2 x 10^3 kb, each part of it would be replaced about 100 times in each cycle.  We could cause this much recombination to happen in our model. but it wouldn't simulate reality because there wouldn't be any reproduction or DNA uptake between the multiple replacements of any one position.  

We can make things more realistic by (1) assuming that only about 10% of fragments taken up recombine with the genome, and (2) decreasing the genomic mutation rate by a factor of 10, so each cycle only represents 10^4 generations.  Now most of the genome gets replaced once in each cycle.

What about the fragment mutation rate?  We might assume that, on average, the fragments that a cell takes up come from cells that are separated from it by about 5 generations.  That is, the cell taking up the DNA and the cell the DNA came from had a common ancestor 5 generations back.  This means that 10 generations of mutation separate the genome from the DNA it is taking up, so the fragment-specific mutation rate should be 10 times higher than the genomic mutation rate.

So I have a simulation running that uses a genome mutation rate of 10^-5 and a fragment mutation rate of 10^-4.  The fragments are 1 kb long, and the cell considers 100 of these each cycle.  

One other modification of the program matters here.  We've now tweaked the program so that it can either start a run with a random-sequence 'genome' it's just created, or with a DNA sequence we give it, that can be taken from a real genome with real uptake sequences.  

So the run I'm trying now starts not with a random sequence but with a 50 kb segment of the H. influenzae genome.  This sequence already has lots of uptake sequences in it, so about half of the 1 kb fragments the model considers in each cycle pass the test and are recombined into the genome.  I'm hoping the new conditions will enable the genome to maintain these uptake sequences over the 50,000 cycle run I have going overnight.

The USS-Perl project becomes the USS-drive paper

I think I'm going to start using the label "USS-drive" for the manuscript that describes our "USS-Perl" computer simulation model.  That's because the focus of the manuscript will be on understanding the nature of the molecular drive process that we think is responsible for the accumulation of uptake sequences in genomes.
The plan is to combine the modeling analysis with our unpublished results on the bias of the uptake machinery and on the nature of the motif that has accumulated in the genome.
The broad outline will be as follows:
We want to understand how the bias of the uptake machinery can affect evolution of sequences in the genome, assuming that cells sometimes take up and recombine homologous sequences from closely related cells.  And we want to examine this in the context of what is seen in real cells - the real biases of the DNA uptake machineries and the real patterns of uptake-related sequences in genomes.
So we will begin by properly characterizing the genome sequences using the Gibbs Motif Sampler.  I've already done this analysis for all of the genomes that have uptake sequences.  And we've done Gibbs analysis on different subsets of the H. influenzae genome (coding, non-coding, potential terminators, different reading frames), and looked for evidence of covariation between different positions.
We will also collate the published uptake data for H. influenzae and N. meningitidis and N. gonorrhoeae, adding our unpublished H. influenzae data.  
And then we will present the model as a way to investigate the relationship between uptake bias and genome accumulation.  A key feature of the model is that it models uptake bias using a position-weight matrix that treats uptake sequences as motifs rather than as elements.  That is, it specifies the value of each base at each position of the motif.  This means that we can evaluate both uptake-bias data and the genome-frequency data as inputs into the model.  The uptake-bias data isn't really good enough for this, and I anticipate that the main focus will be using the genome frequency data to specify uptake bias in the model.
Because the model allows the matrix to be of any length, we can use it with the full-length H. influenzae motif (30 bp), not just the core.  And because the model lets us specify base composition, we can also use it for the Neisseria DUS. 

On to other manuscripts!

My bioinformatics colleague is headed back home, and our uptake sequence manuscript is waiting for the last bits of data, and for me to write a coherent and thoughtful Discussion. It's not nearly as far along as I'd hoped, but it's a much better piece of science than it was shaping up to be a week ago. One key step was switching the axes of a couple of the figures. Seems like a minor thing, but putting "% sequence identity" on the X-axis rather than the Y-axis transformed the data from meaningless to crystal clear.
The only big piece of analysis we still need is an examination of which of the H. influenzae genes that don't have BLAST hits in our three standard gamma-proteobacterial genomes also don't have hits in the more closely related A. pleuropneumoniae genome. Those that don't are stronger candidates for having entered the H. influenzae genome by horizontal gene transfer, and we predict they will also have relatively few uptake sequences.
So what's the problem with the Discussion? I don't seem to be able to see the big picture for the trees, and I'm hoping that setting the whole think aside for a couple of weeks will let the fog dissipate.
Lord knows I have lots of other stuff to work on. I wish I was going to do some benchwork, but I'm afraid for now it's two other manuscripts.
One of these is the educational research I've been referring to as "the homework project". It's been on the back burner while the graders we hired scored the quality of students' writing on various bits of coursework and the final exam. I think this is all finished now, though I don't even know where the data is. The teaching-research post-doc who was working closely with me on this project has started her new job back east, but she'll continue to be involved in the data analysis and writing. I'm also blessed with involvement of a new teaching-research post-doc, who'll be working with me here. (Both post-docs are part of our CWSEI program, not part of my own group.) The first step to getting the analysis and manuscript-writing under way is to email both of them and get the work organized. I'll do that this morning.
The other manuscript is the project I've been referring to as "USS-Perl". The basic computer simulation model now works well, our programming assistant has gone on to other things, and the post-doc involved with this project has done a lot of runs of the simulation to pin down some basic relationships between parameter settings and result. (I forget what these are, so I need to talk to her today about this.) I have a plan for the first manuscript, which I'll spell out in a separate post later today.

Divergence of genes in different COG functional categories?

The bioinformatics manuscript is coming along nicely (though of course a lot more slowly than I had hoped).

One of the things it does is show that the densities of uptake sequences in genes does not correlate well with the 18 'COG functional categories' that genes have been assigned to.  This is a significant result because a previous paper claimed a strong correlation between high uptake sequence density and assignment to a modified COG functional category containing 'genome maintenance genes'.  This result was considered to provide strong support for the hypothesis that uptake sequences exist to help bacteria get useful new genes, a hypothesis I think is dead wrong.

Our hypothesis is that the distribution of uptake sequences among different types of genes (with different functions) should only reflect how strongly these gene's sequences are constrained by their functions.  Our analysis would be quite a bit more impressive if we showed a positive result - that uptake sequence density in different COG functional categories correlates well with the degree of conservation of the genes in these groups.  My first idea was to ask my bioinformatics collaborator to do this analysis.  But I suspect it might be a lot of work, because she's only done any COG analysis with the A. pleuropneumoniae genome, but we would want analysis done with the H. influenzae and/or N. meningitidis COG functional categories genes, looking at the % identity of the three 'control' homologs we've used for our other analysis.

So I'm wondering whether someone might have already done a version of this analysis.  Not with H. influenzae or N. meningitidis, and not with the control homologs we've used, but any general estimate of levels of conservation of genes in the 18 COG functional categories.  I searched Google Scholar for papers about COG functional group divergence and found a good review of analyses one can do in the COG framework.  This got me hoping that maybe there was a web server that would let me do the analysis myself, but the paper didn't describe anything that would do the job.

But I looked deeper in Google Scholar's hits and found something that looks very promising.  It examines rates of sequence evolution across genes in the gamma-proteobacteria.  H. influenzae and the control genomes we used with it are all in the gamma-proteobacteria, and I think the paper has looked specifically at the relative rates of evolution of genes in different COG functional categories, so this might be exactly what I'm looking for.  The only problem is, the paper appeared in PLoS Genetics, and their site is down right now!  I'm trying to read the paper in Google's cached version, but the page is all greyed out and it can't show me the figures.  Guess I'll just have to be patient and hope the site is back up soon.

Base composition analysis supports gene transfer

Both our results and those recently published by another group suggest that uptake sequences are less common in genes that don't have homologs in related genomes.  I'm using 'related' loosely here, as our analysis searched for homologs in genes from other families of bacteria whereas the other group searched in members of the same genus.  But both results are best explained by the hypothesis that genes lacking uptake sequences are often those that a genome has recently acquired by lateral transfer from genomes that don't share that species' uptake sequence. 

To test this, we looked at gene properties that might give evidence of lateral transfer.  The simplest such property is base composition (%G+C), a property that differs widely between different bacterial groups and changes only slowly after a sequence has been transferred to a different genome.  My bioinformatics collaborator had already tabulated the %G+C of all the genes in the H. influenzae and N. meningitidis genomes, noting whether each gene had (1) uptake sequences and (2) homologs in the test genomes (see previous two posts for more explanation).  She'd then calculated the means and standard deviations of the base compositions of genes in each of the four classes.  

This analysis showed that, on average, the genes with no uptake sequences and no homologs also had anomalously low base compositions and larger standard deviations.  But I thought the effect might be clearer with a graphical presentation, so I plotted the actual %G+C for each gene as a function of which class it is in (presence/absence of uptake sequence and homologs).  I also noted sample sizes, as some classes have a lot more genes than others.  This gave a nice visualization of the effects, showing, not surprisingly, that genes with homologs in the control genomes have more homogeneous base compositions than those without homologs, and that the effect is quite a bit stronger for those genes that don't have uptake sequences.

Order of Results

We're still working out the order in which we present the different results in our draft manuscript.  The first parts of the Results are now the analysis of how uptake sequence accumulation has changed the frequencies of different tripeptides in the the proteome, and of why some reading frames are preferred.  

The next decision is how we present the results that use alignments of uptake-sequence-encoded proteins with homologs from three 'control'  genomes that do not have uptake sequences.  We had tentatively planned to first describe the finding that genes that don't have uptake sequences are less likely to have homologs in the control genomes (call this analysis A), then move to the genes that do have homologs in all three control genomes, first considering the overall degree of similarity (call this analysis B) and then considering the differences between the places where the uptake sequences are and the rest of the proteins (call this analysis C). 

But I was having a hard time seeing how to motivate the first analysis - it didn't easily connect with what came before. Now I think a better plan is to first describe analysis B, which shows that genes with more uptake sequences have lower similarity scores (actually identity scores - my collaborator and I disagree about the relative value of using similarity and identity scores). 

By describing analysis B first, we can more logically introduce the strategy of using alignments to homologs from a standard set of control genomes, before discussing the genes that lack homologs.  And the results of analysis B then motivate analysis A, looking at the genes that had to be omitted from analysis B because they didn't have all three homologs.

Organization of the Results

My bioinformatics collaborator and I are making good progress on both the writing of our manuscript and on improving the data that's going into it.  One of the latter came when I redid an analysis I first did about a year ago, this time being more meticulous about the input data and more thoughtful about how I used it.

The question the data address is why uptake sequences in coding regions are preferentially located in particular reading frames, specifically the effect of the codons they use on how efficiently the mRNA can be translated.  I had used the overall frequencies of different codons in the proteome (all the proteins the genome encodes) to estimate codon 'quality', by summing the frequencies of the three codons specified by the uptake sequence in a specific reading frame, and dividing this by the sum of the frequencies of the best (most-used) codons for the same amino acids.  This analysis showed no relationship between codon quality and the reading frame bias.  

In retrospect I should have multiplied the frequencies rather than adding them.  This both gives a more realistic approximation of how the codons' effects interact, and eliminates the differences caused by the differing frequencies of different amino acids in the proteome.  I should also have excluded those uptake sequence positions that allowed more than one codon. 
And I probably should have done the analysis for different species rather than only for H. influenzae (though, to be fair on myself, at the time I was working on a H. influenzae USS analysis). 

So now I've redone the analysis for the three species that exemplify the different uptake sequence types (H. influenzae, A. pleuropneumoniae and N. meningitidis), and find a substantial effect of codon quality.  That is, in each species, uptake sequences are mostly found in reading frames that allow use of the most common codons for those amino acids.  As codon frequency is thought to reflect the abundance of the corresponding tRNAs, this means that differences in the efficiency of translation explain some of the differences in reading frame use.

Collaborative writing

My bioinformatics collaborator is in town till next Friday, so we can work together to finish up our uptake sequence manuscript.  We have done (she has done) a lot of different analyses addressing various questions about how uptake sequence accumulation has affected protein evolution and vice versa, and I'm having a hard time keeping them all straight (partly because I haven't been thinking about this project for a while).  

The manuscript is mostly written already; but some parts of the Results are still up in the air because they were waiting for some final data.  It also needs polishing and some revising to incorporate results from a Neisseria uptake sequence paper that came out a few months ago.

To cope with all the data we've been occasionally creating 'Flow of Results' pages that summarize our results and conclusions, in the order we think the paper should present them. Yesterday we began our final push by going through the most recent version of the Flow of Results'.  I'm not sure we have everything in the perfect order, but we have an order that's good enough, with each result triggering a question that's addressed by the next result.  

Speedy simulations

Our programming assistant has finished his work for us.  At least, he's finished the period when he's employed by us, but now he's gotten the research bug and wants to stay involved to see how the program he's developed for us works out.

At present it works great - MUCH faster than before.  One of the reasons is that it no longer scores the genome in every cycle.  I'm doing a run now with a big genome (100 kb) and I can see it pause at every 100th cycle, as it takes the time to score the genome.  This is a sensible improvement, as the genome score isn't needed for the progress of the simulation - it just lets us the user monitor what's happening, and provides information used if the run needs to decide whether it has reached equilibrium.

I'm using this run to test whether the final accumulated USS motif precisely matches the biases specified in the recombination decisions (i.e. in the matrix used to score each fragment), by using a matrix where one position has substantially weaker bias than the others.  If the final motif matches the matrix bias, this position should show a weaker consensus in the accumulated uptake sequences.  

Of course to test this I may have to recall how to run the Gibbs Motif Sampler on the Westgrid server.  The alternative is to chop my 100 kb output genome into 10 pieces and run each separately on the Gibbs web server, which has a sequence limit of 10 kb.

Sex and Recombination in Minneaopolis?

The meeting on Sex and Recombination that was to be held in Iowa City next week has been canceled because of severe flooding! (And just when I was getting my talk pulled into shape too...)

I suspect I'm not the only one who planned to combine that meeting with the big Evolution meetings in Minneapolis, June 20-24. I'm flying into Minneapolis on June 15. If you'd like to get together with me or other sex-refugees, post a comment below and we'll see what we can organize.

Preparing talks

I've been struggling to pull together ideas and data for the two talks I'm giving (next week and the week after) at evolution meetings.  Yesterday was my turn to give lab meeting, and I used it to get help from the post-docs.  My draft talks were a mess, but the post-docs had lots of excellent suggestions.  Both talks will use the same introduction to the evolutionary significance of bacterial DNA uptake, and will then diverge. 

On the USS-evolution simulation front, the model is running nicely and I'm using it to quickly collect data for my first talk (20 minutes, for the Sex and Recombination meeting).  But I have to compromise statistical significance with run time, as the large genomes needed to get lots of sequence data take a long time to simulate.  

On the proteome-evolution front, my bioinformatics collaborator just sent me the last of the data, including a control set needed for comparison with the analysis of how divergent USS-encoded peptides are to their homologs.