Field of Science

Growing cells and making RNA (...tomorrow...)

Today I'm actually doing something in the lab!

As my latest contribution to the sub-inhibitory antibiotic effect collaboration, I'm growing wild type cells with and without a sub-inhibitory concentration of the antibiotic rifampicin (aka rifampin), collecting the cells in log phase, and making RNA from them.

The plan (discussed previously) is to get the cells growing in their normal medium and then transfer them to parallel flasks with and without the antibiotic. The cells will be allowed to grow and divide for at least two hours (four cell doublings), diluting the cultures with fresh medium as needed to keep the cell density low enough that the cells continue to double exponentially. This will make sure that the cells have fully adjusted their metabolism to the availability of abundant medium with and without antibiotic.

Then I'll collect samples of cells at OD = 0.1 and OD = 0.2 (OD is optical density, a measure of the turbidity of the culture and this of cell density). I'll pellet the cells in the microcentrifuge and freeze the pellets, so I can do the RNA preps later (tomorrow?). Two 2 ml tubes of each OD = 0.1 culture and one of each OD = 0.2 culture, should give me plenty of RNA.

Fine plan, but one immediate hurdle. I meant to inoculate a starter culture of the cells last night, from a frozen stock, but forgot. So this morning I used some frozen cells prepared for just such an eventuality. But after more than two hours the OD of this culture isn't increasing. First thought - maybe my culture medium was no good. H. influenzae grows in brain-heart infusion medium supplemented with hemin and NAD, but I used the old stocks of hemin and NAD I had in my fridge. Maybe they were too old. So I got fresh stocks, supplemented some more medium, and inoculated some more cells from more of the same frozen cells. After an hour their OD hadn't increased either. Then I remembered the great freezer meltdown of 2005. Check the date on the frozen cells - yes, January 2005, probably these stocks went through the meltdown so that most of the frozen cells in those tubes would have been dead. Probably my old medium was fine, but the very small numbers of viable cells in the frozen stocks means that most of the OD I was reading was turbidity due to dead cells.

Finally the ODs are starting to increase. But there's a seminar I really want to go to at 4pm (Peter and Rosemary Grant - very famous evolutionary biologists), so I think I'll have to just let these cells grow overnight and use them as the starter for proper cultures tomorrow.

A new paper on DprA in S. pneumoniae

I'm going to use this post to work through a new paper on the function of DprA (Mortier-Barriere et al. 2007. A key presynaptic role in transformation for a widespread bacterial protein: DprA conveys incoming ssDNA to RecA. Cell 130:824-836). See also my previous post on DprA and ComM. Here's the abstract of the new paper:
Natural transformation is a mechanism for genetic exchange in many bacterial genera. It proceeds through the uptake of exogenous DNA and subsequent homology-dependent integration into the genome. In Streptococcus pneumoniae, this integration requires the ubiquitous recombinase, RecA, and DprA, a protein of unknown function widely conserved in bacteria. To unravel the role of DprA, we have studied the properties of the purified S. pneumoniae protein and its Bacillus subtilis ortholog (Smf). We report that DprA and Smf bind cooperatively to single-stranded DNA (ssDNA) and that these proteins both self-interact and interact with RecA. We demonstrate that DprA-RecA-ssDNA filaments are produced and that these filaments catalyze the homology-dependent formation of joint molecules. Finally, we show that while the Escherichia coli ssDNA-binding protein SSB limits access of RecA to ssDNA, DprA lowers this barrier. We propose that DprA is a new member of the recombination-mediator protein family, dedicated to natural bacterial transformation.
The topic is important because DprA is a ubiquitous and highly conserved bacteerial protein. In naturally competent bacteria it is induced when cells become competent to take up DNA, and without it the incoming DNA gets rapidly degraded. But most bacteria with DprA aren't known to ever take up DNA, so we suspect that protecting incoming DNA isn't its usual function (maybe just a side effect of its usual function). Because almost all bacteria have this protein, its function must be important.

Mortier-Barriere et al. mainly investigated the activities of purified DprA (from S. pneumoniae and sometimes also from B. subtilis) in various combinations with DNA and with two other proteins known to interact with DNA, SSB and RecA. SSB is also often competence-induced, but it's known to have a primary function in DNA replication and repair. RecA is competence induced in some bacteria but not others (not in H. influenzae) and has a primary function in DNA replication and repair.

What the authors' experiments showed:

DprA binds single stranded DNA: They use band-shift assays to show that purified DprA binds single stranded DNA. The lanes with 5-10-fold ratios of DprA to the 90 nt test DNA fragment have about half of the fragment shifted (actually stuck in the wells). The authors never discuss how many bp of DNA a DprA fragment might be expected to interact with. It's a medium-sized protein (about 30 kD) so probably about 10bp? So 10 DprAs might be enough to coat a 9o nt fragment along its length. Once complex had formed it was difficult to disrupt; even a 1000-fold excess of cold fragment added to a previously formed DprA-DNA complex didn't displace all the original fragment from the DprA.

They also examined DprA binding to the circular ssDNA plasmid phiX174. These assays used 4 micromolar DprA and a DNA concentration described as 8 micromolar (nucleotides), so there was one molecules of DprA for every two nucleotide of the plasmid. As phiX174 is 5386 nt long, its molecular concentration was 0.0015 micromolar, giving about 2700 DprAs per plasmid. But considering the ration of protein to nucleotides probably makes more sense.

DprA (and Smf, its B. subtilis homolog) would not bind linear or relaxed-circle double stranded DNAs, but they did bind supercoiled circular DNAs. The authors conclude that this binding is to regions that are locally single stranded (because of the supercoiling?), but the evidence for this isn't very strong (what are "sleeve-like complexes"?). However the evidence is good that, given a DNA molecule that is partly ssDNA and partly dsDNA, DprA binds only to the ssDNA part.

DprA molecules stick to each other: DprA molecules interact with each other in a yeast 2-hybrid assay. They also appear to bind DNA cooperatively - at a ratio of one DprA to 20 nucleotides of phiX174, some DNA molecules formed complexes with DprA while others remained bare. In electron micrographs, distinct DNA molecules were often seen to be attached to each other by globs of DprA. The DprA-coated DNA is lumpy when viewed by TEM, unlike the thick smooth filaments formed by RecA. The authors describe the lumps as "secondary structures/intramolecular bridges".

Abundant DprA can protect DNA from nucleases: DNA that was pre-incubated with a 1000-fold excess of DprA could no longer be digested by the nucleases ExoT and RecJ (opposite polarities) or by the mung bean endonuclease. But this is a LOT of DprA. Probably because DprA molecules stick to each other as well as to DNA, they form what the authors call ' tightly packed discrete complexes' (I'd describe them as 'globs'); these spread out a bit when the salt concentration is increased.

DprA helps complementary single strands anneal: (Fig. 4C, S3CDE) The authors mixed DprA with labeled ssDNA and then added unlabeled complementary DNA strand. The DprA-bound strand left the DprA and instead base-paired with its complement, faster than it would have base paired in the absence of DprA. The ratio of DprA to the 80 nt DNA fragment DNA was 250, which seems very high. The SSB control used a ratio of 100. This experiment tells us that ssDNA would rather associate with SSB than its complement, and rather associate with its complement than DprA, but it also tells us that associating with DprA helps ssDNA find its complement. This may be just because DprA molecules bound to DNA like to stick to each other, or it might be due to something more sophisticated.

DprA interacts with RecA: When cells expressing His-tagged S. pneumoniae RecA were passed over a metal-chelate column, both RecA and DprA were bound. A yeast 2-hybrid screen for proteins interacting with DprA also identified RecA. I don't know whether this interaction depends on both proteins binding to DNA; the cell lysate was not treated with DNase before being passed over the column.

E. coli RecA has an ATPase activity that reflects its ability to unwind dsDNA when annealing a complementary strand. Addition of a low concentration of DprA to RecA bound to a ssDNA filament slightly increased RecA's ATPase activity (by about 20%), but higher concentrations reduced it (to about 50%. when there was twice as much DprA as RecA). E. coli SSB consistently reduced it.

DprA also made it easier for RecA to form filaments on ssDNA. At least some of the DprA remained on the DNA with the RecA. If the DNA was initially coated with SSB, the presence of DprA helped E. coli RecA to form a filament on the DNA. In this case the filaments didn't contain any DprA.

RecA that has formed filaments on ssDNA can help the DNA invade a double-stranded form of the same sequence. But if the ssDNA is first coated with SSB, RecA can't form a filament on it or promote strand invasion. Presence of DprA helps RecA get around the SSB barrier, and allows RecA to promote strand invasion even if the DNA has been previously coated with SSB.

Quibbles and complaints:

The authors cite their 2003 paper as evidence for the competence decrease in a ssb mutant, but this paper only briefly mentions a result and says the data will be in a manuscript in preparation. This means I can't check on how sick the ssb mutant cells are. The introduction also doesn't mention that both S. pneumoniae and B. subtilis have two ssb genes, and that only the one not induced in competence is orthologous to the well-studied E. coli ssb. H. influenzae only has one ssb homolog.

The authors never tell us where the SSB they used came from. I think it is likely to have been E. coli SSB, which can be purchased, rather than either SSB-A (its S. pneumoniae homolog) or SSB-B (a competence-induced S. pneumoniae paralog).

What the authors concluded:

The authors firmly believe that recombination is the reason for DNA uptake (that transformation is the function of competence). They also omit any mention of the consensus that DNA replication and repair are the true function (the evolutionary cause) of RecA and other proteins that contribute to recombination, leaving the naive reader with the impression that these proteins too must exist to create new genetic combinations. Thus I'm not surprised by their conclusion that "DprA is the prototype for a new recombination-mediator protein dedicated to bacterial transformation".

They go on to consider why DprA is so ubiquitous, present and highly conserved in almost all bacteria except those that live as intracellular parasites. Most of the bacteria with DprA homologs are not known to ever be naturally competent. I and others have interpreted this distribution as meaning that DprA has a primary function that is independent of DNA uptake (see previous post). But the authors go the other way. In their last paragraph they use the "it is tempting to speculate that..." qualifier to put forward the idea that all bacteria with DprA homologs are naturally competent.

What I conclude:

In competent cells DprA is likely to slow degradation of incoming DNA by binding to it, rather than by directly inhibiting a nuclease. This is consistent with our unpublished evidence that neither DprA or ComM acts by inhibiting the RecBCD nuclease.

Does anything about these experiments support or contradict the idea that DprA has an important function in non-competent cells? I don't think so. Induction of DprA in competent cells is unlikely to interfere with RecA's recombinational repair activity, but these new experiments don't provide strong evidence that DprA will enhance it.

One critical future goal should be to find out what goes wrong in dprA-knockout cells when they are not competent. As I described in an earlier post, an attempt to do this in E. coli found that the knockout had no detectable effect under a wide range of conditions. The authors of this paper summarize evidence that this is also the case for S. pneumoniae and B. subtilis. Perhaps our 'alternative' perspective on competence will enable us to think of tests the others have overlooked.

As the dprA gene is strongly induced in competent H. influenzae, maybe we should test whether cells with a dprA knockout survive competence (with or without DNA uptake) as well as do dprA+ cells. I may even have some old data that addresses this.

and a book chapter

For weeks now the post-docs have been moaning about their writing work on a book chapter they convinced me we should write. But the work has paid off beautifully.

Today we'll send our chapter in to the editor. It's an excellent overview of everything that's known about competence and transformation in the Pasteurellaceae - something that's never been reviewed before. While the iron is hot we should probably use it as the foundation for one or more shorter reviews to be submitted to journals.


The sxy manuscript has been quickly reviewed and provisionally accepted by the fine journal we sent it to. The reviewers' suggestions are simple to deal with and don't require new experiments, just a bit of new analysis and some minor improvements to the text and figures. One reviewer did suggest an RNA bandshift experiment, to test whether an unknown protein binds to the sxy mRNA secondary structure, and the editor liked this idea. But they both overlooked our compensatory mutation analysis, which unambiguously shows that two of the sxy mutations increase expression because they disrupt base pairing, not because they change the recognition sequence for a protein. We'll point this out in our response and beef up the relevant part of the Discussion so readers don't make the same mistake.

The USS manuscript is hung up (the editor tells us) because the two reviewers disagreed on whether our revisions were satisfactory and the editor doesn't know what to do about it. Although this is an on-line journal promising a fast publication schedule, the review process has been very slow - 9 weeks for the less-favourable review in the first round and 7 weeks for one of the reviews of the revised manuscript (the same reviewer?). We had similar problems with a previous submission to another journal in this family (it's BioMed Central), so I think we'll avoid them in future.

@#!^%$ microarray data!

You may recall (if you have had nothing better to do than read previous posts) that a few posts ago I raised a concern about how microarray data from independent slides should be combined (as ratio of means or as mean of ratios). The issue arose because the undergraduate student who hybridized the arrays and did the original analysis didn't report over-expression of the genes we have found over-expressed in our reanalysis of the data. I wanted to go back and compare her calculations to ours to see where the discrepancy arose.

So today I hoped to do that. I only had a rough draft of the student's undergraduate thesis. This didn't explain how she did her calculations, so I asked for the final version from her supervisor. He provided a copy, but it turns out to be identical to the version I have, although with his comments and suggestions rather than mine. So we have no information at all about how she did her calculations. (More annoyingly, I now suspect that she completely ignored all my carefully thought out suggestions for improving her thesis.)

But I also got a CD containing her data files from the other lab's computer. So I spent much of today going through her lists of genes that were up-regulated or down-regulated at least 2-fold, and comparing them to our lists from our reanalysis of the array data. The lists disagree completely. For the antibiotic we're not very interested in (ery), the genes scored as 'down' in this file are reported as 'up' in her thesis. (Well, the thesis only considered the subset of genes she thought interesting -'virulence genes'.) These same genes are 'up' in our analysis. So maybe she just switched 'up' and 'down' in the file, and discovered and corrected this error while writing her thesis.

But it's worse for the antibiotic we are very interested in (rif). For this, genes listed as 'up' in her file are also 'up' in her thesis. But these are genes that our analysis says are 'down'. And vice versa - the genes we find to be 'up', she lists as 'down'.

So now there are two discrepancies, and thus different places where an error could be. First, maybe she switched up and down in both the ery and rif files, and corrected only the ery error in her thesis. Second, maybe the dye assignments we used for both our analyses (ery and rif) were reversed but the up and down assignments in her lists are correct. In this case she must have mistakenly switched the up and down assignments for the ery analysis in her thesis.

I'm still digging into this accursed data set because we found a surprising pattern of gene induction in the rif-treated cells. But if we've switched the dye assignments then these genes are down, not up. Is this less surprising? I have no idea.

I'd like to throw this whole project out the window. But I've promised to grow some cells and do some RNA preps so the apparent gene induction effect can be tested by my collaborator's technician and student, using real-time PCR on cDNAs generated from independent RNA samples.

How long will this take? - not too long I think. I'll need to start a cell culture the night before. The next morning dilute the cells into medium with and without rif. Let the cells grow for at least 5 doublings (probably about 3 hours) and collect cells in microfuge tubes, and collect more after one more doubling. I won't need large amounts of culture because the real-time PCR analysis needs very little RNA, so one 2.0 ml tube of each culture at each time should be enough. Well, maybe two tubes of the first samples, because I'll need enough RNA to see in a gel. I'll do RNA preps of these cells, using the Qiagen RNAeasy kit; we have the kit and the preps take less than an hour as I recall. I said I'd grow cells and do preps twice, on different days, to get independent replicate RNAs. I'll need to run samples of each prep in a gel to check that the ribosomal RNAs are largely intact. Once I know the RNA concentration, treat 5 micrograms with 'DNA-free' to get rid of chromosomal DNA that would confound the PCR analysis.

I expect that the new RNA analysis will not confirm the surprising result we see in our present analysis, partly because the result is unexpected and thus more likely to be due to an error than to a previously unknown biological process, and partly because we now know that the data is indeed full of errors. But at least this will get me back at the bench, if only for a couple of days.

Metabolic reconstruction of B. subtilis

The last post wandered away from its original topic (metabolic reconstruction of H. influenzae) and into B. subtilis nucleotide catabolism. I then realized that a similar reconstruction of B. subtilis metabolism was likely to be available, and so did a Google Scholar search. Not only did I find a paper describing this, but it's a very recent paper and so should reflect the latest advances in metabolic reconstruction and all the latest experimental data on B. subtilis metabolism (Oh et al. 2007. Genome-scale reconstruction of metabolic network in Bacillus subtilis based on high-throughput phenotyping and gene essentiality data. J. Biol Chem June 17 epub).

The metabolic reconstruction looks to be much too complex to understand in its entirety. They didn't test DNA as a nutrient, but did test nucleotides and nucleosides and deoxyribose (all supported growth as carbon sources both in silico and in vivo, and the nucleotides and nucleosides also served as nitrogen sources. I'm going to email the author asking for his perspective on whether B. subtilis should be able to use DNA as a carbon and nitrogen source.

Metabolic reconstruction of H. influenzae

A commenter on Wednesday's post suggested that I consider using the available 'metabolic reconstruction' of H. influenzae to assess whether the nucleotides from DNA uptake would make a significant contribution to growth. "Metabolic reconstruction?", I said to myself. Sounds like something I ought to know about.

So I found the papers. The first one appeared way back in 1999, but I've been reading a slightly more recent one (Schilling and Pallson, 2000. Assessment of the metabolic capabilities of Haemophilus influenzae Rd through a genome-scale pathway analysis. J. Theor. Biol. 203:249-283). Metabolic reconstruction uses computer simulations of the catalytic activities of the proteins encoded in a genome to infer as much as possible about the organism's metabolism.

H. influenzae
was done first only because its genome sequence was the first to be available. I suspect that more recent analyses of better-known organisms are more sophisticated, but I don't expect anyone to go back and redo the H. influenzae analysis (though maybe that's too pessimistic).

What Schilling and Palssen did: They included only proteins that are metabolic enzymes or membrane transporters - this is less than 25% of the proteins encoded by the genome (~400 genes, 461 reactions). This includes some reactions for which they had biochemical evidence but not identified genes. Their model metabolism did not include any effects of regulation of gene expression or (I think) of regulation of protein activity. No transcription or translation or DNA replication. The model included 51 metabolites required for biomass generation and maintenance (amino acids for protein synthesis, NTPs and dNTPs for RNA and DNA synthesis, phospholipids for membranes, etc).

They simplified the model metabolic structure by subdividing it into six subsystems: amino acids (A), nucleotides (N), vitamins and cofactors (V), lipids (L), central metabolism (C) and transport/energy/redox (T). For each of these they then identified the 'extreme' pathways that (I think) set the metabolic limits. Finally, they used their model to investigate H. influenzae's metabolic capabilities.

Their conclusions are interesting, though some of them are clearly wrong (i.e. their model predicts X, but we know X is not true for real H. influenzae). But the reason I'm reading this paper is to find out whether this model can be used to make predictions about the effect of DNA uptake on metabolism and on growth. I was thinking that the answer is No, mainly because of the simplifying assumptions the model had to make.

The fundamental problem is that I would like a quantitative answer, such as "Taking up 200 kb of DNA per generation would increase the growth yield by 0.5%". But I think that the model will only give qualitative answers about capabilities, such as "Yes, taking up DNA could increase the growth rate". I suppose that kind of answer might be useful, enabling us to make the important distinction between it and "No, taking up DNA can't increase the growth rate, because reaction A is missing".

How would getting one or the other answer change our thinking? If we got a "No, because..." answer, the first step would be to track down the reason(s) and find out whether we'd expect them to hold true in real cells. Is the needed gene really missing? Do we have other evidence that this reason will not apply? (For example, the model predicts that H. influenzae would not be able to use citrulline as its pyrimidine source because a gene is missing, but real cells can do this and we have identified the supposedly-missing gene.) If the reasons for the "No" answer did hold up under scrutiny, we'd need to seriously rethink our hypothesis that cells take up DNA as a nutrient source.

What if we got a "Yes" answer? I don't think we'd do anything differently than we're doing already, but we might be slightly more confident that our hypothesis is reasonable.

I think a genuine "No" answer is unlikely in principle, because, to use the DNA they have taken up for new DNA synthesis, all cells need to do is rephosphorylate the dNMPs that result from nuclease degradation of this DNA. Do H. influenzae have the pathway to do this? They do have enzymes that can catalyze these reactions, but I don't know enough biochemistry to be absolutely certain that these reactions will proceed efficiently.

Could cells use the dNMPs as sources of NTPs for RNA synthesis? This is more complex, as the deoxyribose sugar needs to be removed from each base and replaced with a ribose sugar (cells have no enzyme that could simply create ribose from deoxyribose by adding back the missing oxygen). What cells actually do is remove the deoxyribose and phosphate from the dNMP, and then add ribose plus phosphate to create a NMP which can then be phosphorylated into a NTP. These reactions are included in the metabolic reconstruction; they consume an ATP, but much more will be saved by not having to synthesize the NMP from scratch. The cells also get some energy by metabolizing the deoxyribose-1-phosphate this releases, although this isn't included in the reconstruction. I just checked the pathway: the deoC gene product converts deoxyribose-1-phosphate into glyceraldehyle-3-phosphate plus acetaldehyde - the former feeds into glycolysis but I don't know what happens to the acetaldehyde (maybe ADH converts it to alcohol?).

What about using the dNMPs as sources of carbon (= energy), nitrogen and/or phosphate? Presumably this would only happen when the cell had satisfied its need for dNTPs and NTPs for DNA and RNA synthesis. Our visitor told us that at least some bacteria lack the enzymes needed to break down purine and/or pyrimidine bases; such a cell couldn't use nucleotides as nitrogen sources, and using the sugar for energy might leave cells clogged up with the leftover bases.

This probably isn't a big issue for H. influenzae, as I expect it to only use dNMPs for DNA and maybe RNA synthesis. But if I'm hoping to show that adding DNA to a B. subtilis culture improves growth yield, knowing whether or not it can use the bases could make a big different to my expectations. I've found a paper (Schultz et al. J. Bact. 183:3293) showing that B. subtilis can metabolize purine bases to NH3 and CO2 (with no energy gained but none expended). And an earlier paper from the same group (Per Nygaard's) gives an overview of nucleoside catabolism by B. subtilis; the basic conclusion is yes they can.

So it looks like feeding B. subtilis DNA is a reasonable idea.

Yet another issue in interpreting microarray data

This afternoon I met with my collaborators on the sub-inhibitory antibiotic effects project. We considered where the project could lead (or end): Because the results of the (low-resolution) microarrays suggest some genes are turned up about two-fold by the antibiotic treatment, we plan to do real-time PCR on a few of these genes, looking for the same effect; call this analysis A. (I'm giving fewer details than usual to protect the delicate sensibilities of my collaborator.)

If the results of A show at least a two-fold increase, we do experiment B. If the results of B also show at least a two-fold increase, we could write a small paper. If the results of B don't show at least a two-fold increase, we could repeat the microarray analysis (doing it better this time) and maybe examine protein levels. Maybe, if A gives the two-fold increase, we also consider writing a grant to get money to do this properly. If A doesn't show at least a two-fold increase, we could either end the project, or repeat the microarray analysis anyway in the hope of discovering something the first analysis missed.

But at the end of the meeting the technician raised an issue we hadn't thought about. The student who did the original microarray analysis didn't see the two-fold increase we see in our reanalysis of her data - I've been assuming that this is because she did something wrong in her analysis of the data. However the technician described what the student had done differently than us, and I think it may not be wrong at all - it might be what we should have done.

Here's the issue: In an earlier post I raised the concern of how much we should trust individual data points in a microarray analysis, and pointed out that strongly expressed genes are likely to give more trustworthy signals than weakly expressed genes. But the issue of signal strength and trust may also apply to whole microarray slides, not just individual genes on them. Some RNA preps give better probes than others, due to differences in RNA quality and/or in the various processing steps involved in creating dye-tagged cDNA from the RNA. And some hybridizations give better signals than others, due to variations in the quality of the slide and in the hybridization and washing conditions. An inexperienced researcher will also generate more variation between slides. The net result is that some slide hybridizations (some 'samples, in GeneSpring's terminology) may have substantially stronger signals than others, averaged across all the genes on the slide.

For each gene on each slide, an expression 'ratio' is calculated by dividing the fluorescence signals from the 'test' RNA by the fluorescence signals from the 'control' RNA. Should we put equal trust in slides that gave generally weak signals and slides that gave generally strong signals? Or should we put more confidence in the slides with generally strong signals, because the ratios calculated from those with weak signals will have suffered more from random effects?

What we have done is first calculate the ratio for each slide, and then calculate a mean ratio over the three slides whose signals looked most consistent. But what the student did was to first sum all the 'test' signals for the three (?) slides and calculate their mean, similarly sum all the 'control' signals and calculate their mean, and then use the ratio of these means to decide about significance. Provided there is little slide-to slide variation in overall signal intensity or in the ratio of 'test' to 'control' signals for the gene in question, both methods (mean of ratios or ratio of means) will give similar answers. But if there is substantial variation the answers will differ.

Because the effect for most genes was not much higher than the two-fold cutoff the student used, the differences in analysis methods could have determined whether these genes were just-above or just-below the cutoff. For genes that were below the cutoff in weak signal experiments, but above it in high-signal experiments, she would have seen significance where we didn't. On the other hand, for genes that were above the cutoff in weak experiments but below it in strong ones, we would see significance where she didn't.

Should we worry about this? I've just looked back at the ratios for the different slides - no one slide has conspicuously more significant ratios than the others, as we would expect if one strong slide was biasing the averages. But if it isn't a lot more trouble, it might be wise to go back and redo the calculations the way the student did them, so we can be certain we understand why her conclusions differed from ours.

Should I do this experiment?

A visiting colleague participated in Monday's lab meeting, and his ideas got me thinking seriously about doing an experiment that I've long claimed wouldn't really prove anything.

The question is whether DNA uptake can provide bacteria with enough nutrients to make a difference to their growth rates. I've always argued that getting some nutrients from the DNA is inevitable, so so demonstrating that this increases growth in a lab culture would not be good evidence that selection for these nutrients is why cells take up DNA. But maybe I was wrong.

In the past we have tried to show that H. influenzae can use nucleotides from DNA for growth, but the experiments have had lots of problems, I think in part because H. influenzae is quite fastidious in its nutrient requirements, and in part because we didn't devote a full-press of our time and brain-power to it.

Our colleague Steve Finkel has shown that E. coli can use nutrients probably acquired by DNA uptake for growth, but the effects are modest. This is probably because the cells are only taking up very small amounts of DNA because their competence genes are not induced under normal culture conditions.

Rather than now trying again with H. influenzae, I'm considering trying to demonstrate nutritional benefits of DNA uptake using Bacillus subtilis. B. subtilis is a soil bacterium, very easy to grow and not at all fastidious. I've worked with it before and one of the post-docs did her PhD work on it. B. subtilis takes up lots of DNA under lab conditions, though not during exponential growth. Competence is induced under conditions referred to as 'post-exponential growth', meaning after cell growth has stalled because nutrients are depleted.

Keeping cells competent over a long period might be complicated, because B. subtilis competence and sporulation are induced simultaneously (though apparently in different sub-populations of the culture). I'd first have to read up on the latest work on the regulation of DNA uptake, and try to find the best conditions for observing a growth effect. But if I can show that B. subtilis cultures grow better when they can take up DNA, I think many other biologists would find the result much more compelling than I do. This would be good, because they are not at all convinced by the results that I think are compelling.

Dizzy data

I spent much of yesterday wrestling with a USS result generated by our bioinformatics colleague. It was counterintuitive, in that it seemed to disagree with both our prediction from basic principles and the results of a related analysis. She had a plausible hypothesis that would explain this result, and my wrestling was with how to present the result in our paper, and how to test her hypothesis.

We're quite confident in our prediction that USSs should be more common in less important parts of the genome. This was originally proposed by Ham Smith's group many years ago, and the various components of the paper we're preparing support this. In particular, the colleague has sent me data showing that proteins with no USSs in their genes have sequences that are more strongly conserved across different bacterial families than do proteins with 1, 2 or ≥3 USSs in their genes.

I'd better give 'conserved' an explicit definition here. When we find that the sequences of two proteins are similar, we need to decide whether this is just a coincidence. If the similarity is too strong to be coincidence we conclude that the genes coding for these proteins must have had a common ancestor, and we say that the proteins are homologous (similar due to shared ancestry). (I'm taking the liberty of ignoring convergence as an explanation.) Homologous sequences that have remained very similar despite very long periods of independent evolution since their common ancestor lived are said to be highly conserved; this conservation is usually due to natural selection for maintenance of an important function. Sequences that have become very different (but still too similar to be a coincidence) are said to be only weakly conserved, and these sequences usually have less important functions.

Conservation scores in our analysis were measured as the % identical amino acids in BLAST alignments. This test only used those proteins that had good homologs in each of the three other bacterial genomes used for the test, with 'good' being defined as having a BLAST "E-value" of less than 10e-9. The surprising result comes from examination of proteins that didn't meet this criterion.

A lot of proteins did not have good homologs in any of the three test genomes. I expected this to be because the sequences were only very poorly conserved, and so I expected that these proteins would be even more likely to have USSs than the worst of those meeting the E<10e-9 style="font-weight: bold; font-style: italic;">more likely to have no homologs than proteins with 1, 2, 3 or more USSs. It took me much of yesterday to get to the point where I could write the previous sentence. It seems clear now, but the nature of the data kept making my head spin. It wasn't just that I was particularly thick - it made the post-doc dizzy too.

The colleague's hypothesis is that many of the genes in the no-homologs class lack good homologs not because the homologous genes are so poorly conserved that they've diverged past the E<10e-9 cutoff. Rather she suggests it's because these genes were acquired by lateral gene transfer from distantly related bacteria, and thus have no homologs at all in the test genomes. I'm still wrestling with the best way to test this ('best' determined by the optimal combination of good data and not too much work).


A researcher from another lab stopped by my office this afternoon. I was telling him about our Sxy-CRP work - how we think that Sxy modifies how CRP binds and bends DNA at CRP-S sites. I'd forgotten that he's an expert in purification and analysis of DNA-binding proteins (they do Drosophila chromatin stuff), and when I remembered I dragged him down to the post-docs' office and introduced him to them. He had lots of good ideas.

Then he brought the conversation around to the reason he'd come to my office. He wasn't just socializing, but was hoping that we might know something about the E. coli technique called 'recombineering'. Well, yes indeed, one of the post-docs learned about it last year, got all the strains and papers, and discovered that nothing is ever as easy as it seems.

So now we're doubly blessed. First, this colleague will give us lots of help with the Sxy-CRP work, and second, he's taking on the task of getting recombineering working, and will even be grateful to us for giving him the tools to do so!

Done analyzing USSs and DUSs?

I think I've done enough analysis of the USS and DUS motifs, at least for now.

First, what was I hoping to find out? (What were the questions this might answer?) I had already done a thorough Gibbs-based analysis of the USSs in the H. influenzae genome. But the other genomes had only been searched for 'perfect' and singly-mismatched' USS cores, and only for perfect DUSs. The patterns found were enough to confirm that the genomes have the same general type of USS or DUS as their relatives, but not enough to detect more subtle differences. So I wanted to know how different related genomes' motifs are. I also wanted to get to better idea of the distribution of variation within each genome. Do different genomes have the same proportions of strongly-matched and weakly-matched sites, or do some genomes have a lot more poorly-matched sites than others?

I've now used the Gibbs motif sampler to search both strands of all the genome sequences known to have uptake signal sequences (USS or DUS). I used a 'fragmentation mask' for each search - one complex mask for all of the USS searches and a simple 12-position mask for the DUS searches. From the combined forward- and reverse-orientation sites each pair of searches found I combined enough of the top-scoring sites to give me 1.5 times the known number of 'perfect' 9bp USS cores, or of the original 10bp DUSs. These were used to make the logos.

The first conclusion is illustrated by the two Neisseria logos at the left. The Neisseria motifs are very similar not only in the sequence but in the relative strengths of the different positions. I found only minor differences in DUS site density, and they all had about the same balance of strong and weak matches. Several explanations are possible. First, these species may be very closely related, so their uptake machinery and genomic DUSs haven't had any time to diverge. Second, divergence may be selected against because all mutations that change the uptake machinery's specificity reduce its efficiency. Third, maybe selection favours the ability to take up DNA from different members of the genus. All three 'species' live in the human nasopharynx (though N. gonorrhoeae is more often found elsewhere).

The next three logos are for different species in the Hin clade of the Pasteurellaceae. They show the range of minor variation I found in this group. The overall sequence consensus is very similar, but the relative strengths of different positions are variable, and some of the weaker positions have different bases preferred. I'm fairly confident that these differences reflect significant differences in the accumulated USSs, because each is based on more than 2000 sites. I think these differences probably do reflect differences in the biases of the respective uptake machineries, but demonstrating such subtle difference experimentally would be a lot of work, and probably not worth the trouble.

The final logo is for a genome from the Apl clade of the Pasteurellaceae. As expected from the motif previously derived from perfect+one-off USSs, it has a different core sequence (ACAAGCGGT rather than AAGTGCGGT) and a longer right-hand flanking motif. The surprising thing to me is the weak consensus of the first two motif positions, much weaker than most of the non-core positions. This probably means that it was a mistake to treat the 9bp sequence as the core for this clade. Really the core is only 7bp. This means that my use of 1.5 times the number of 9bp cores was probably too stringent a criterion for the Apl clade USSs. I did notice that this cutoff removed many high-scoring sites from the analysis. I should probably go back and do the cutoff analysis again, after scoring the number of perfect 7bp cores in these genomes. Maybe I'll do that tonight.

Three steps forward, two steps back (Gibbs, of course)

My tests to find out why the Neisseria Gibbs searches found so many (poorlymatched) motif sites indicate that searches using a short expected motif (e.g. the Neisseria 12bp motif) give a lot more sites than searches using a long expected motif (e.g. the Haemphilus 22bp one). This means that my original strategy of specifying an expected number of 1.5 X the observed number of 'perfect' cores has given misleading results.

At first I feared that I would have to do all the searches over, and analyze their new output in some much-more-laborious way. But I think I've come up with a simple new method that's also good science.

Each site the Gibbs search finds is assigned a score, reflecting how well it matches the motif pattern the search has found. Until now I've just used all the sites, regardless of their scores. But now I'm going to take the output from each run and sort the sites by their scores. Then I'll keep enough of the high-scoring sites to give me 1.5 X the number of perfect cores, and discard all the lower scoring sites. Most of the runs I've done gave at least this many sites, so I shouldn't need to redo many searches.

This strategy will produce a much more comparable set of analyses, so I'll be able to fairly compare the results for different genomes.

Before I do the sorting, I'll combine the results of a forward and a reverse search, so the results will reflect both strands of the genome. But I'll need to first make sure I'm using a pair of searches that both settled on the same 'forward' orientation and exactly the same position of the motif. This is mainly an issue for the Pasteurellaceae searches, where some searches have the core starting at position 1 of the motif and some have it starting at position 2 or 3. If I can't find suitable pairs I'll need to do more searches.

Different logos for different species?

I've finished analyzing the Gibbs motif results and converting them to logos. Now I have to decide what they mean.

For now I'll just consider the analyses of the Neisseria genomes. Almost all the searches were run the same way; I'll leave the exceptions to later.

1. Searches gave a lot more sites than I had expected, and the consensus were weaker than I had expected. These two effects are related, as the additional sites the search found were quite weak matches, and these are responsible for the weak consensus. You can see the two effects in the figure at the left, which shows two searches on the same genome (the top one was started with a prior file suggesting base frequencies, so they're not perfectly comparable).

So the scientific issue is: do the Gibbs searches have built-in biases or other effects that cause them to pick up more poor matches in these short motifs than they do with the longer Pasteurellacean motifs? Or are they finding more sites because the motif distribution is indeed different? For example, maybe the specificity of the uptake machinery is broader or maybe biased uptake has been acting for a shorter time. One test is to run some searches that 'expect' fewer sites, and see if they too find this many. Those are running (or at least queue'd). Another test is to run searches for a similarly short motif on one or more Pasteurellaceae genomes, and see if they find a lot more sites. I'll queue those now.