It's been soooo looong....

...but I'm about to do a real experiment at the bench, because one of my post-docs has called me on my bragging that "I can do a time course in my sleep".
Several years ago we did some collaborative work on DNA uptake by Actinobacillus pleuropneumoniae, a H. influenzae relative that infects pigs rather than people (the PDF is here). The collaborators also generated data on the regulation of competence by the A. pleuropneumoniae homolog of our favourite protein, Sxy. They have now moved on to other things, and this post-doc is writing up a short manuscript (just a 'Note') combining their data with some of ours.
One issue her manuscript examines is how competence changes in response to different culture conditions. I had investigated this during the period of collaboration, by following changes in transformation frequency over time, as cells were grown in rich medium and transfered to the starvation medium that induces competence in H. influenzae. I did versions of this 'time course' several times, but each was plagued by inconsistencies in the numbers of colonies that grew up on the agar plates. I couldn't figure out what was causing the inconsistencies -- the cells were no more sensitive to minor variations in plating conditions than H. influenzae is -- so I compensated by sampling the culture more often. This generated data that was marginally acceptable, and I went on to other things.
Now the post-doc wants to include this data in her manuscript, but it needs to be replicated first. Over to me. I checked yesterday and was reassured to find that I have lots of the kanamycin-resistance DNA I need to give the cells (I sensibly did a large high-quality prep in 2005), and a good stock of frozen competent A. pleuropneumoniae cells that I can use as controls. So today I'm going to thaw out some of these cells and transform them with the DNA, just to confirm that everything is working. Sunday I'll count the colonies from that little experiment, and if all looks good I'll plan the big time course, using my best 2005 time course as a guide.
Monday I'll have help from our work-study student. She and I will make the agar for all the plates we'll need, and pour about 250 plates. She'll first make up a high-concentration stock of the NAD supplement this agar needs. We'll prepare lots of dilution solution and put 5 ml aliquots into hundreds of tubes. We'll start an overnight culture of the cells, so they'll be ready to go on Tuesday morning.
Tuesday I'll start by diluting the cells into fresh medium and following their growth by measuring the turbidity of the culture. If I'm lucky the work-study student will be free to help me. We'll put a drop (10 ul = 1 ug) of DNA in each of about 30 tubes. At about 30-minute intervals we'll take 1 ml of cells from the culture into a tube, incubate this for 15 min, add 10 ul DNaseI to destroy the remaining DNA, incubate for 5 minutes, and then dilute the cells and spread them on plates with and without kanamycin. Once the culture has reached a specific density we'll transfer part of it to starvation medium, and also test aliquots from this at 30 min intervals.
If we're ambitious we may also add the potential inducer cyclic AMP (cAMP) to another part of the culture. In H. influenzae cAMP induces competence, but my 2005 results with A. pleuropneumoniae were very unreproducible.
Then, on Wednesday or Thursday, maybe I can convince the work-study student that counting colonies is a thrilling part of scientific research (like Tom Sawyer white-washing the fence). Otherwise I'll have to count them myself.

Nothing interesting here, folks

OK, so I got to my office this morning all enthusiastic to do the additional runs that would clarify why N. meningitidis has twice as many forward-orientation DUSs in the strand synthesized discontinuously. I did three different things, all of which confirmed that the two-fold difference was just an aberration in the Gibbs analysis.

First, I plotted the distribution of forward-DUSs along both strands of the genome (yesterday I only had time to do it for the reverse-complement strand). This clearly showed that the two strands are the same-- the blue ticks in the figure below (just a close-up of part of the figure) are DUSs on the forward strand, and the red ones are DUSs on the reverse-complement strand.

Second, I completed the control analysis I had to interupt yesterday. This analyzed the reverse-complements of the 'leading' and 'lagging' sequences I had assembled yesterday. It was a way of repeating the analysis on different sequences that had the same information content. Result: very similar numbers of DUSs in both.

Third, I assembled new 'leading' and 'lagging' sequences, using our SplitSequence.pl script to efficiently find the midpoints I'm using as surrogate termini, then reran the Gibbs analysis on these. Result: very similar numbers of DUSs in both, and these DUSs gave effectively identical logos.

So I went back and examined the Gibbs output that had had twice as many DUSs as the others. For unknown reasons, both replicate runs had settled on less-highly specified motifs, and thus included a lot more poorly matched sites in their output. Well, at least I can now very confidently report that there is no direction-of-replication bias in N. meningitidis DUSs.

Two manuscripts submitted!

Within 24 hours of each other! One on the phenotypic and genotypic variation between H. influenzae strains in competence and transformation, and the other on what Sxy does in E. coli.

leading and lagging DUSs

We're adding data about the uptake sequences in the Neisseria meningitidis genome (called DUSs) to our previous analysis of Haemophilus influenzae uptake sequences (USS).  But my initial  analysis of how these DUSs are distributed turned up something unexpected.

For H. influenzae, we've known for a long time that USSs are distributed quite randomly around the genome.  Well, the spacing is a bit more even than perfectly random would be, and USSs are somewhat less common in coding sequences, but these qualifications not important for this unexpected result.  In particular, the two possible orientations of the USSs occur quite randomly around the chromosome.  (I've described this analysis in a previous post, though not the results.)

So I did the same analysis for Neisseria DUSs.  I separated the forward and reverse-complement genome sequences sequence into 'replicated as leading strand' and 'replicated as lagging strand' halves.  This was relatively simple because the origin of replication had been assigned to position #1.  Nobody has mapped the terminus of replication, so I just picked the midpoint (bidirectional replication from the origin is expected).  I wrestled with Word to find this midpoint, but I'll redo this using our 'splitsequence.pl' perl script to divide each sequence exactly in two. Then I used the Gibbs motif sampler (now running happily on our desktop Macs) to find all the forward-orientation DUSs in the 'leading' and 'lagging' sequences.  

The surprise was that it found twice as many DUSs in the lagging strand as in the leading strand.  After mistakenly considering that this could be because there were more genes on the leading strand (irrelevant because genes, and DUSs, occupy both strands) I decided that this must be because DUS were oriented differently around the genome, mostly pointing in the same way with respect to the direction that the replication fork passes through them.  So I did a quick analysis of the locations of forward-pointing DUSs around the chromosome, expecting to find that they were more frequent near one end than the other, but they appeared to be evenly distributed, which would mean that the only other explanation is that I've screwed up. 

Later I'll do some more analyses to sort this out.

Neisseria repeats

I've spent parts of the last coupe of days first discovering that a dataset of uptake sequences (DUS) from the Neisseria meningitidis intergenic regions contained a large number of occurrences of a different motif.  I could easily see that it was longer than the 12 bp DUS but couldn't figure out what the motif was.

I spent a long time suspecting that these repeats were 'correia' elements, a very short but complex transposable element common in Neisseria genomes.  But I couldn't find a clear illustration of the correia consensus, and I couldn't find a good match between the correia sequences I could find and the sequences of the stray motif in my DUS dataset.

Finally I realized that I could try using the Gibbs motif sampler to characterize the motif.  So I took my set of intergenic sequences, used Word to delete all the perfect DUS (both orientations), and asked Gibbs to find a long motif.  I didn't know how long the stray motif actually was, so I tried guessing 20 bp, then 30, then 40.  But this didn't seem to be working - instead of finding a couple of hundred long correia-like motifs it would find a couple of thousand occurrences of something with what looked like a very poor consensus.  So I seeded the sequence set with about 20 occurrences of the motif taken from the dataset where I'd first noticed it.  

Gibbs again returned about 1500 of what looked like poor-consensus occurrences, but this time I had a bit more confidence that this might be what I was looking for, so I trimmed away all the notation and posted them into WebLogo.  This gave me a palindromic repeat that I'll paste below later, and a bit of Google Scholar searching showed me that this isn't correia at all, but a short repeat called RS3, known to be especially common in intergenic sequences of the N. meningitidis strain I'm using.

So now I can write a sensible manuscript sentence explaining what these repeats are and why I'm justified in removing them from the dataset.

Another issue for the new uptake plans

Here's another technical problem for the new plans. We want to reisolate and sequence DNA that competent cells have taken up into their cytoplasm. We'd start with a large number of competent cells carrying a rec-1 mutation that prevents them from recombining DNA with their chromosome, and incubate them with genetically marked DNA fragments. We expect the DNA to be taken up (it would contain one or more USSs) across the outer membrane and then translocated into the cytoplasm. The problems arise because, although the process of translocation starts at a free end of double-stranded DNA, one of the strands is degraded as the DNA is translocated. This means that the DNA in the cytoplasm will be single-stranded.

We can treat the cells with DNaseI to destroy DNA that's left outside the cells or on their surfaces, and then wash them thoroughly t remove the DNase I so it doesn't act on the DNA inside the cells. We can then lyse the cells with SDS in the presence of EDTA, being very gentle so to not break the long chromosome.

Getting rid of chromosomal DNA is very important, as there will be at least 10 times as much of it as the incoming DNA we want to reisolate. If we start with input DNA that is of a uniform and relatively short length, we will be able to use size fractionation to get rid of most of the chromosomal DNA. And we probably can further enrich for the input DNA by fractionating on a column that selects for single-stranded DNA.

One solution would be to affix a sequence tag to the ends of the input fragments, and then use a primer for this tag to PCR-amplify the input DNA. Unfortunately, the leading (3') end of the surviving incoming strand is also thought to be degraded, so the tag would probably be lost. As this is the end the PCR would start from, the PCR then wouldn't work.

We don't want to tamper with the structure of the incoming DNA, as this is likely to interfere with normal uptake in some way we can't control for. And we don't want to use recipients with nuclease knockout mutations, partly because we don't even know which nucleases are responsible and partly because we don't want to pervert the normal uptake/processing events.

One possibility is to use a combination of tagging and random priming, with the lack of tags on the 3' ends compensated by the random primers. Maybe we could test this, using radioactively labelled input DNA with tags. If the method is working, most of the input radioactivity in the reisolated DNA would be converted to double-stranded. Or we could test it using DNA from another species, and sequencing a small fraction of the PCR products to see if they were indeed from the input genome rather than from the recipient.

Because we're really only interested in the relative proportions of different input sequences in our reisolated DNA, we can tolerate a modest fraction being from the recipient genome. But we don't want to waste our expensive sequencing dollars on DNA that's mostly from the recipient.

Planning an uptake/sequencing experiment

A future post-doc and I are re-planning an experiment I discussed a couple of years ago and included in an unsuccessful grant proposal, designed to finally find out the real sequence specificity of the DNA uptake machinery. At that time I only proposed a little bit of sequencing, but sequencing keeps getting cheaper so this time we're going to do it right. This blog post is to help me get the numbers clear.

In my original plan we'd start with a pool of mutagenized (or degenerately synthesized) USS, either as short fragments (say 50bp) or inserts in a standard plasmid. The USS in the pool would have, at each position of the 30bp USS, a 91% probability of carrying the consensus base and a 3% probability of carrying any one of the 3 other bases. The flanking sequences might be mutagenized too, or the mutagenized fragment might have been ligated to a non-mutagenized tag or plasmid sequence. We'd give this DNA to competent cells, reisolate the DNA that had bound to the cells, and sequence it. We'd also sequence the 'input' pool. The differences between the 'input' and 'bound' pool sequences would define the specificity.

Only about 20 positions of the USS show a significant consensus, so in the analysis below I'm going to assume that the USS is 20bp long. On average each USS would have about 2 changes away from the consensus in these 2obp, but the range would be broad. Figuring out how broad is part of the reason for this post.

For example, what fraction of the fragments would have no differences from the consensus in the relevant 20bp? That's (0.91)^20, which Google says is about 0.15. What about fragments with 1 difference? I think that's about 20 * 0.09 * (0.91)^19, because there are 20 different possible locations for the difference. That's about 0.3. I fear the calculation will be more complicated for the larger numbers of differences. A similar calculation as the second one above gives 0.56 as the frequency of USSs with 2 mismatches, but that's unlikely to be correct because the sum of 0.15 + 0.3 +0.56 =1.01, leaving no possibility of USSs with more than two differences from the consensus. So for the analysis below I'll just use some fake numbers that I think are plausible: 0.1 with no difference, 0.25 with one difference, 0.3 with two differences, 0.2 with three differences, and 0.15 with four or more differences.

How is a sequence-specific uptake system likely to act on this variation? Let's assume that fragments with 'perfect' consensus USSs are taken up with probability 1. For the simplest version, let's assume all one-position differences have the same effect, reducing binding/uptake to 0.1, and all two-position differences reduce it to 0.01, etc. The 'bound' pool would then be expected to contain 0.78 perfects, 0.195 one-offs, 0.023 two-offs, 0.0015 three-offs and 0.0001 USSs with four or more differences.

How much sequencing of the 'bound' pool would we have to do to have included all of the three-off sequences (i.e. sequenced each version at least once)? Only one sequence in about 1000 will be a three-off, and there are about 7000 different ways to combine the positions of the changes (20*19*18). But there are three ways to be different at each position that's different... Yikes, I'm in over my head.

OK, let's do it for the one-offs first. There are 20 different locations for the difference, and three possible differences at each location, so that's 60 different versions. About 0.2 of all sequences will be in this class, so we will need to sequence a few thousand USSs to get reasonable coverage. What about the two-offs. There are 20*19=380 possible pairs of locations for the differences, and 9 possible differences for each pair of locations, so that's 3420 different sequences. Only about .023 of the fragments in the pool would have two differences, so we'd need to sequence about 150,000 USSs to get one-fold coverage, say about a million sequences to get good coverage. For the three-offs, the numbers are .0015, 6840, and 27, giving almost 200,000 different sequences, with about 1.5x10^8 sequences needed for one-fold coverage (say 10^9 for reasonable coverage).

If I instead assume that each mismatch reduces binding/uptake by only 0.5, then the 'bound' pool would have 0.3 perfects, 0.37 one-offs, 0.22 two-offs, 0.07 three-offs and 0.03 USS with four or more differences from the consensus. The corresponding numbers of fragments sequenced for reasonable coverage of one-offs, two-offs and three-offs would be about 1000, 100,000, and 10^8.

A more realistic model should have some positions more important that others, because that's the main thing we want to find out. What if differences at one of the positions reduces uptake to 0.1, and differences at the other 19 reduce it only to 0.5? We'd of course recover more of the differences in the less-important positions than of the differences in the important positions.

How does thinking about this change the amount of sequencing we'd need to do? If all we're interested in is the different degrees of importance of positions considered singly, then we'd easily see this by doing, say, enough to get 100-fold coverage of the important single mismatches. Even 100-fold coverage of the unimportant positions would be quite informative, as we only need enough to be confident that differences at the one 'important' position are underrepresented in our sequences. So 10,000 USS sequences from the 'bound' pool would be plenty to detect various degrees of underrepresentation of differences at the important positions.

But what if we also want to detect how differences at different positions interact? For example, what if having two differences beside each other is much worse for uptake than having two widely separated differences. Or if having differences at positions 3 and 17 is much worse than having differences at other combinations of positions? Or having A and C at positions 3 and 17 is much worse than having other non-consensus bases at those positions?

We would certainly need many more sequences to detect the scarcity of particular combinations of two or more differences. The big question is, how many? Consider just the two-offs. 100,000 sequences would let us get almost all of the non-important two-off variants at least once, and most of them about 5-10 times. But that wouldn't be enough to confidently conclude that the missing ones were not just due to chance -- we'd need at least 10 times that much sequence.

How much sequencing is that? If the fragments are 50bp, and we want, say, 10^6 of them, that's 5x10^7 bp of sequence. Future post-doc, that's a modest amount, right?

Given a sufficiently large data set, we do have access to software that would detect correlations between sequences at different positions (we used it for the correlation analysis in the USS manuscript).

Once we had identified candidates for significantly under-represented sequences, I wonder if there if a technique we could use to go back to the pool and confirm that these sequences were genuinely under-represented? Something analogous to validating microarray results with Q-PCR? Maybe the reverse? Ideally, we would have an oligo array with every possible two-off USS, and hybridize our bound pool to it. Probably not worth the trouble.

The other reason I'm writing this post is to figure out how much DNA we'd need to start with, in order to end up with enough sequences to define the specificity. If the 'bound' pool contained 1 microgram of 50bp fragments -- that would be 10^13 fragments. This should be enough to encompass way more diversity than we would ever be able to sequence. To get 1 microgram we'd need to start with an awful lot of cells, but even if we cut this down to 1 ng we'd still be fine.

Outline for Why do bacteria take up DNA

As I said in the previous post, this will be a review article emphasizing how we can decide why bacteria take up DNA, rather than claiming we have enough evidence to decide.

It will begin with an overview about why this question is so important. To make the writing easier I can probably modify text from my grant proposals and from Do bacteria have sex, but this new introduction should be a lot shorter than the coverage of this issue in that article. The basic things to say are:
  1. Recombination between bacteria has been enormously important in their evolution, so we really should try to understand why it happens.
  2. It's usually been assumed that the processes that cause recombination exist because the recombination they cause is selectively advantageous. (This might be viewed as meta-selection - selection acting on how genetic variation arises.) But studies of the evolution of sex in eukaryotes have shown that the costs and benefits of recombination are very complex.
  3. What are these processes? Consider gene transfer separately from physical recombination. The major gene transfer processes are conjugation, transduction and transformation. Some 'minor' processes also contribute to bacterial recombination: gene transfer agent is the only one that comes to mind right now. In other cases, genetic parasites (transposable elements, and probably integrons) hitchhike on processes that transfer genes. I won't discuss these, though I may say that I won't discuss them. Physical recombination acts on DNA that has entered the cytoplasm ("foreign DNA").
  4. The proximate causes of most bacterial recombination are reasonably well understood, at least at the molecular level (what happens inside cells, and when cells encounter other cells, phages or DNA. We know much less about the processes that determine when such encounters happen, but that's probably not a topic for this review.
  5. Some of the ultimate (evolutionary) causes are also understood. We understand the 'primary' evolutionary forces acting on cells, phages and conjugative elements. By 'primary' I guess I mean forces that directly impact survival and reproduction.
  6. These forces appear to provide sufficient explanation for the existence of many of the processes that contribute to bacterial recombination. Specifically, strong selection for replication by infectious transfer to new hosts explains why phages and conjugative elements exist, and strong selection for DNA maintenance and repair explains why the cytoplasmic proteins that cause physical breaking, pairing and joining of DNA exist. This logic needs to be spelled out very clearly.
  7. The need for nucleotides and other 'nutrients' obtained from DNA breakdowncould also explain why cells take up DNA, but this question is more complicated than the others.
  8. Describe the other potential advantages of DNA uptake: templates for DNA repair, and genetic variation by recombination.
To be continued...

Review articles to be written

Two, in fact. I really really need to publish a review about the evolution of competence (=DNA uptake). Something like my Do bacteria have sex review, but updated and focusing much more on the competence and uptake issues. And I've also promised to write a chapter on competence and transformation for a book celebrating a wonderful colleague. The model for this book is Phage and the Origins of Molecular Biology, written as a feitschrift for Max Delbruck. Ideally I'd love to produce something as charming as Andre Lwoff's The Prophage and I in that book, but I think I'd better lower my standards and get the job done.

Starting now.

OK, the first things I need are outlines.

For the book chapter, I was thinking about writing it as several levels of history, maybe interleaved (?): my personal history of working on competence and transformation, the history of research into competence and transformation, and the evolutionary history. But I don't know how I would make my personal history interesting - maybe emphasizing how I came to realize that the standard view is probably wrong?

What about an outline for the evolution of competence review? It should be driven by a question, probably "Why do bacteria take up DNA?", and should emphasize the unanswered questions and the research that's needed, rather than claiming that the answer is now known. Maybe I'll start with the big picture issue of the evolutionary and proximate causes and consequences of genetic exchange in bacteria, summarizing the arguments in Do bacteria have sex?. This introduction would conclude that the only phenomenon requiring more research is competence. Then I'll consider the components of competence (regulation, DNA uptake mechanisms, and the associated cytoplasmic 'DNA processing' events), putting what we know in the context of the evolutionary questions and explaining how additional information would clarify things.