Field of Science

Way back when this blog was young I wrote a post about the 'fraction competent' problem. Put simply (so you won't have to go back and read that post), when we induce competence in a H. influenzae culture we often want to know whether all or only some of the cells are competent, and whether all of the competent cells are equally competent (able to take up similar amounts of DNA).

There's a simple test, based on measuring the frequency of double transformants when cells are given DNA containing two unlinked selectable markers. This test always indicates that some of the cells are taking up multiple fragments of DNA and the others aren't taking up any DNA at all. The differences in competence of cells grown under different conditions, or carrying different mutations, appear to result from differences in the proportion of cells that are competent, not from differences in how much DNA the competent cells take up.

This is an odd result, and I've always been puzzled by it. I've also always mistrusted it because it's quite indirect, though I haven't been able to see how it could be wrong. The post-doc and I spent the day working on our Genome BC proposal, and towards the end we were both grappling with the problem of how to maximize the fraction of the cells we sequence that would be competent. This was difficult because we don't really know what to expect. But then I realized that we could use deep sequencing to settle the question once and for all.

Our plans are ambitious, but we're limiting the sequencing budget to an amount equivalent to the value of the fellowship the post-doc hopes to get from NIH (he'll hear soon) because this is our proposed 'matching funds'. But now, in addition to planning to do one lane of sequencing (= about 300-fold coverage) of two colonies that grew from cells we know were competent because they acquired a selectable marker from the donor DNA, we're going to sequence three random (unselected) colonies. If some cells are fully competent and the rest not competent at all, we predict that at least one of these three will have not acquired any donor DNA at all, and those that do have donor sequences will have replaced about 2% of their genomes. If the cells are all at least a bit competent, then the three unselected colonies will all have some donor DNA, but perhaps quite different amounts.

The library construction and sequencing will cost about $2000 per lane, and the 300X coverage is major overkill, but I think it's worth $6000 to put this question to rest, or to at least get enough preliminary information that we can ask NIH for more.

Background for our Genome BC proposal

I think we need to shift our focus of the recombinome proposals from transformation to recombination (I'm going to stop putting 'recombinome' in quotes, unless a reader has a better suggestion). Maybe not for NIH, where we also talk about uptake biases, but definitely for Genome BC, where we're only proposing to analyze the recombinome.

In this analysis we'll be directly detecting a combination of (1) the transformation-specific biases of DNA uptake and translocation and (2) the non-specific biases of cytoplasmic nucleases, the RecA (Rec-1) strand annealing machinery, and mismatch repair. We'll be able to subtract the former from the latter, using information we'll get from the uptake and translocation parts of the NIH proposal (these are not in the Genome BC proposal).

RecA and its homologs provide the primary mechanism of homologous recombination in all cellular organisms. This recombination generates most of the new variation that's the raw material for natural selection (acting on mutations - the ultimate source of variation). Recombination is also a very important form of DNA repair; bacterial mutants lacking RecA are about 1000-fold more sensitive to UV radiation and other DNA-damaging agents.

We know almost nothing about the sequence biases of RecA-mediated recombination. Such biases are likely to exist (all proteins that interact with DNA have sequence biases), and they are likely to have very significant long term consequences for genome evolution. The better-characterized biases of DNA repair processes are known to have had big effects on genomes; for example, the dramatic differences in base composition between different species are now though to be almost entirely due to the cumulative effects of minor mutational biases compounded over billions of years of evolution.

RecA promotes recombination by first binding to single-stranded DNA, coating the strand with protein. I don't know whether the DNA is entirely buried within the protein-DNA filament, or whether the RecA molecules leave the DNA partly exposed. (Better find out!) The filament is then able to 'invade' double stranded DNA, separating the strands (?) and moving until it finds a position where the DNA in the filament can form base pairs with one of the strands.

(Break while I quickly read a book chapter about this by Chantal Prevost.)

OK, the ssDNA has its backbone coated with RecA, but its bases are exposed to the aqueous environment and free to interact with external ('incoming') dsDNA. The dsDNA is stretched by its interactions with the RecA-ssDNA filament (by about 40% of its B-form length); this may also open its base pairs for interaction with the based of the ssDNA. But the pairs might not open, and the exposed bases of the ssDNA would instead interact with the sides of the base pairs via the major or minor groove in the double helix. A hybrid model (favoured by Prevost) has the exposed bases pairing with separated As and Ts of the dsDNA, but recognizing the still-intact G-C base pairs from one side. Prevost favours a model in which base pairing interactions between the stretched dsDNA and the ssDNA then form a triple helix (the stretching opens up the double helix, making room for the third strand), which is then converted to a conventional triple helix before the products separate.

With respect to our goal of characterizing the sequence bias over the whole genome, she says:
"Taken together, the available information on sequence recognition does not allow one to ratio the sequence effects on strand exchange or to extract predictive rules for recognition."
So there's need for our analysis.

Contact but not yet two-way communication

So I called Genome BC yesterday, leaving a message asking that they call me back with advice about our eligibility for a Strategic Opportunities Fund grant (given our lack of officially eligible matching funds). They checked out our blogs yesterday afternoon (Hi guys!) but haven't called back yet. I hope they call this morning, because if we're eligible we need to spend the next 7 days immersed in proposal-writing.

Later: They called back. Our lack of confirmed co-funding will be a big problem if we still don't have anything by the panel meeting (Jan. 10). We won't have heard from CIHR by them but should have heard about the NIH postdoc. This Genome BC competition is going to be tight, but we're going ahead, partly because most of the work we do for it will also apply to our NIH proposal, which we need to get to work on anyway.

What's my 'dream research question'?

I've agreed to participate in the local Ecology-Evolution retreat, held at a primitive 'Outdoor School' camp north of here. It's the kind of place where you sleep in bunk rooms and need to bring your own sleeping bag; we'll meet Friday evening to Sunday morning (Nov. 6-8).

The plan is to have a new kind of brainstorming session on Sunday morning, with faculty or senior post-docs taking 3-5 minutes to describe a 'dream research question' that they would love to answer if there were no constraints on time, money or personnel. The rest of the attendees will be assigned to teams for each question (randomly with respect to expertise), and will have an hour to come up with a research program. Then each group will present their program (in a few minutes) and we'll all vote on who gets the prize.

So what should my dream question be? I'm leaning towards wanting to understand the organism-level ecology of a single bacterium. What features matter to it, minute to minute and day to day? What is available, and what is limiting? The difficulty here is scale - we are so big that we can't easily imagine what life is like at this scale. See for example Life at Low Reynolds Number by E. M.Purcell (this link is to a PDF). One problem in using this question for this session is that it isn't a single question but a whole world of ecology. Another is that I suspect what's needed is miniaturization skills, and none of us are likely to know anything about that.

Maybe we could do a "How has selection acted?" question. I would want to take the participants away from the more common "How might selection have acted?" and "How can selection act?" questions to focus on identifying what the real selective forces were.

Or maybe "What was the genome of the last universal ancestor?" The problem with this question is that it is probably not answerable at all, regardless of how much time or money or people are thrown at it.

I think I'd better keep thinking...
I also signed up to do a poster about uptake sequence evolution. I'm a glutton for work.

Later: Maybe I could reduce my first idea to a more manageable question, asking about the components of H. influenzae's natural microenvironment that are relevant to competence. How much DNA do cells encounter? What's are the sources of this DNA? How competent are the cells (how much DNA are they able to take up)? What do they do with the DNA they take up?

Are we strategic yet?

We're going to apply for a one-year grant from Genome BC's Strategic Opportunities Fund (application due Nov. 3, which, I fear, is next Tuesday). The basic research proposal isn't that daunting (5 pages, plus five more of references, figures, appendices etc.), but it's accompanied by many pages of stuff we don't have much experience with, such as:
  • A Swot analysis (should it be SWOT?): This is a matrix whose four boxes describe the Strengths, Weaknesses, Opportunities and Threats (S W O T, get it?) associated with our proposal. Strengths and weaknesses are 'internal' factors, and Opportunities and Threats are external. I Googled this, and Wikipedia says that these are to be defined in the light of the goal of the work.
  • A GANTT chart (pause while I Google this too): This appears to be a bar chart showing the how the different stages of the proposed work will overlap. Unlike SWOT analyses, this should be Gantt, not GANTT, as this is the name of the person who first popularized such charts, about 100 years ago. Here's one from Wikipedia.
  • Up to three pages of 'Co-Funding Strategy' (plus an Appendix with no page limit): These Genome BC grants require matching funds (see this post). The proposal forms provide a table to list details of the source or sources of this funding, with space under each to explain how the matching funds will directly support the objectives of the project. Oh dear, we're supposed to have a letter from the agency agreeing to our use of their funds to match Genome BC's grant... I think I'd better call Genome BC in the morning.
  • A page of 'strategic outcomes', also explaining why these are of 'strategic importance' to British Columbia: I only recently realized that people distinguish between strategy and tactics, so I asked my colleagues what strategic might mean in this context (Google wasn't much help). Several didn't know any more than me, one recommended looking up the goals of the funding agency (sensible in any case), and one said he thought this just meant explaining the outcomes in the larger context of our long-term research goals.
p.s. Via Wikipedia, here's a LONG thread on the Tufte discussion boards about Gantt charts.

A recombination idea from the Asilomar meeting

One of the talks at the Analytical Genetics conference was about advances in 'recombineering'. This is a way of dramatically increasing the homologous recombination rate in E. coli by expressing recombination-promoting genes of phage lambda This enhanced recombination is combined with introduction (by electroporation) of DNA fragments containing site-directed mutations (made with PCR), to give a very efficient method of changing chromosomal alleles.

The new work applies this to single-stranded DNA fragments synthesized as oligonucleotides. When the electroporated DNA is single stranded it's possible to eliminate two of the usual three lambda proteins (the two that are toxic to the cell), instead using only the non-toxic Beta protein. Now that the toxic proteins aren't needed the Beta protein can be constitutively expressed from a plasmid without reducing cell viability.

There were also some very interesting discoveries about how mismatch repair acts, and how it can be circumvented by incorporating adjacent mutations to the desired one, but I'll have to post about these later because I don't have my notes handy. For some of the ssDNAs used the transformation frequencies were greater than 10%, so transformants could be identified by simple screening.

Anyway, it occurred to me that this might be useful for enhancing transformational recombination in E. coli and maybe in H. influenzae too. So I'll be getting the beta-expressing plasmid from its creator and introducing it into various E. coli strains. We'll then test whether these strains can be transformed with chromosomal DNA as H. influenzae can. We might also test whether using a synthetic oligo works, as it does by electroporation.

The other thing we could do is transfer the Beta insert to a plasmid that can replicate in H. influenzae (if the one it's in can't). Then we could see how Beta affects transformational recombination. If transformation dramatically increases, as it might, we could offer the plasmid to people working on other Pasteurellaceae species who are trying to get transformation working (or working better) in their species. We could even add this transformation to our 'recombinome' project, asking how Beta-catalyzed recombination differs from Rec-1-catalyzed recombination.

Analytical Genetics at Asilomar

I was hoping to put a photo of the gorgeous Asilomar setting and of clever scientists presenting their work on paper flip-chart pads, but my cursed iphone refuses to send any email while in the USA. The meeting is excellent - a small group of people, all using genetic approaches in microorganisms, giving short talks free of extraneous details.

At lunch today I sat beside someone who's been using Illumina sequencing to characterize the genetic and arrangement diversity that arises in a bacterial culture during growth from a single cell. He had lots of practical advice about our recombinome project. Most importantly, because diversity will have arisen in our recipient cells during the growth they do before we transform them, we need to make a DNA prep of a no-DNA added control part of the culture at the same time as we make a prep of the DNA-added transformant pool at the same time, and sequence this control DNA along with our transformed-cell DNA. And when we sequence our donor DNA, we should use the same DNA prep that we use for the transformation.

Other issues: Yeast extract contains yeast DNA that persists in culture media and can come through the DNA prep and contaminate the Illumina library. We grow our H. influenzae cells in 'brain heart infusion'. I think this contains yeast extract, but it may also contain DNA from the hearts and minds of cows. Illumina may have a protocol for removing DNA from culture media, or we could just DNase-treat it before we autoclave it (I wonder how much DNase, for how long...).

Concentration of DNA in the input sample is critical. There's a new way to use PCR to very accurately measure very low DNA concentrations, called 'Digital PCR'; it uses a version of limiting dilution where only some wells contain a DNA molecule.

We had a wonderful reception at the Monterey Bay Aquarium on Wednesday night. But we only got to see part of the aquarium, so now I'm going back to see the rest.

Damn, reordering the US-variation manuscript yet again...

Sudden realization that I've been stupidly persisting with a stupid topic order. Here's the new order of the Results:
  1. We want to develop rigorous prediction of how molecular drive could cause preferred sequences to accumulate, so we've developed a computer simulation model of it.
  2. How the model works. Fig. 2 = schematic diagram.
  3. Why we focus on equilibrium properties, and how we identified equilibria. Fig. 3 = approach to equilibrium from above and below.
  4. Effect of the mutation rates of the genome and the recombining fragments. Fig. 4A = Mutation rates don't affect equilibrium scores.
  5. Effects of the amount of recombination per cycle. Fig. 4B = Amount of recombination limits uptake sequence accumulation.
  6. Effect of minimum cutoff for recombination. No figure.
  7. Effect of matrix strength (100-fold and 10-fold bias) and scoring algorithm (additive and multiplicative scoring of fragments). Fig. 5A = matrix effects; Fig. 5B = additive and multiplicative scoring effects.
  8. Using the Gibbs Motif Sampler to identify and characterize uptake sequences in the N. meningitidis and H. influenzae genomes. Fig. 6? (and other genomes - Fig. S1)
  9. Proportions of perfect and singly-mismatched uptake sequences in simulated genomes and in real genomes characterized by the Gibbs sampler. No figure, but maybe a table of frequencies.
  10. Spacing of uptake sequences in simulated and real genomes. Fig. 7A&B = simulated genomes; 7C&D = N. meningitidis and H. influenzae genomes.
  11. Simulation of DUS and USS evolution using Gibbs-derived matrices. Fig. 8A&B = uptake sequences accumulate to high frequencies; logos are like those of the corresponding real genomes.
  12. We thus expect the genome motifs to correspond to the uptake biases, but we see serious discrepancies with published H. influenzae uptake measurements (comparable data for N. meningitidis is not available ). We redid these experiments, and see the same discrepancies. Fig. 9 = Lindsay's uptake assays.
  13. Might the problem be that recombination is affected by other biases (in addition to the single-base effects studied so far)? Test for covariation between DUS/USS positions reveals very little. Fig. 10A&B = covariation.
  14. Measures of USS variation in H. influenzae populations also reveal effects of all biases that ultimately influence what recombined into the genome, but these are also discordant with the observed genomic bias. Fig. 11A&B = BLAST analysis of simulated and real genomes.

Why uptake sequences matter (up from the data mines)

I've been slaving in the uptake sequence data mines for the past couple of weeks, and although I'm filling the manuscript with new data I'm having a hard time relating this to the big picture. I want to get a draft with all the data to my coauthor on Saturday, because on Sunday I'm off to a meeting.

This is going to be a great meeting because we aren't allowed to use PowerPoint or overheads. In the past we've had 20 minutes and a whiteboard, but this time it's going to be 15 minutes and paper pads so I really need to focus what I'll say. I'm going to talk about uptake sequences, and I'll need to start by explaining why I think they're so important. Here are some steps I could approach it by:

1. Evolvability: One big theme in the study of evolution is whether the processes that generate variation have themselves been subject to natural selection. Have mutation rates been selected to give the best balance between harmful and beneficial changes? Has the genetic code been optimized to minimize the effect of non-silent mutations on fitness? Have developmental processes been selected to channel changes into beneficial directions? Have the genes and sequences that control genetic recombination been selected to optimize the benefits of recombination?

2. Do bacteria have sex? The evolution of sex is a big unsolved problem in eukaryotes. Researchers reasonably assume that sexual reproduction (diploid meiosis + haploid fusion) exists because shuffling alleles into new combinations increases fitness, but they don't have a satisfactory explanation of why this would be true. In bacteria, natural competence (active DNA uptake) is widely assumed to be an analog of meiotic sex, selected because the genetic changes it causes are often beneficial. (At least some researchers accept (if prompted) that other processes that lead to gene transfer do so by accident.)

3. What's the 'null hypothesis'? To study evolutionary function, we need to first think about null hypotheses. If a process has more than one consequence (more than one candidate 'function') we should start by evaluating their likely relative impact. Direct effects on survival are probably more important than evolvability, and inevitable consequences are probably more important than occasional ones. And we need to consider the harmful consequences as well as the beneficial ones. For DNA uptake, the nutrients in DNA are an inevitable benefit to survival, whereas the genetic changes are occasional, indirect, and sometimes good but more often bad

4. The regulation of competence is consistent with selection for nutrients. Are there aspects of the regulation of competence that are consistent with selection for genetic consequences and not for nutrients? Have other consequences for survival been overlooked? Perhaps stalling of replication forks when nucleotide pools are depleted?

5. What about uptake sequences? Two sentence introduction: Two families of bacteria have DNA uptake machineries that prefer short sequences very abundant in their own genomes. This combination of bias and abundance has been assumed to have evolved to optimize the genetic consequences of DNA uptake (to be an adaptation for evolvability).

6. Are uptake sequences evidence of selection for genetic consequences? I've been claiming that, if the DNA uptake machinery has a sequence bias, and if the incoming DNA sometimes recombines with the chromosome, the preferred sequences are expected to accumulate in the chromosome by molecular drive. This is a suitable null hypothesis, as it follows from both the nutrient and evolvability explanations for DNA uptake. It's actually a stronger prediction of the evolvability model, because only that model requires chromosomal recombination. We've now developed a computer-simulation model that lets us test the predictions. If all the properties of real uptake sequences are consistent with accumulation by molecular drive, then there's no need to invoke selection for evolvability.

Oops, I think I've just used up all of my 15 minutes!

Maybe the spacing is quite random after all

I think I've finally finished the analysis of spacing of real uptake sequences. Here's the figure. (Hmm, Blogger has squished it, but it's still legible.)

The main histograms are the numbers of uptake-sequence spacings in each 100 bp bin, for N. meningitidis DUSs (blue bars) and H. influenzae USSs (red bars). The blip at the far end of each histogram is the number of spacings greater than 5000 bp.

The first bar of each histogram is a lot higher than its neighbours (especially the DUS one, which would be twice as high if I'd drawn it to scale). This tells us that each genome has a lot more close-together uptake sequences than expected from the spacings of the others. We already knew this from published work, and as expected almost all of these close-pair uptake sequences are in inverted orientation, allowing them to act as transcriptional terminators.

The inset in each graph shows the spacings of the uptake sequences that are within 50 bp of each other (center-to-center distance). The N. meningitidis DUSs are mostly 15-25 bp apart, allowing the 12 bp uptake sequences to base pair as a stem with a 3-13 nucleotide loop. But the H. influenzae USS 'pairs' are much closer together, effectively sitting one on top of the other. We think these sequences probably shouldn't be considered as pairs of uptake sequences (as in the published analysis), but as single palindromic uptake sequences that can act in both directions. Here's an example, with the centers of the 'two' uptake sequences separated by 1 bp, and another with the centers separated by 11 bp.

Once I'd done this analysis I realized that I should that these pairs into account in deciding whether the spacing of the rest of the uptake sequences is random. The black line on each graph shows the average spacings of the same number of random positions as there are non-paired uptake sequences (averaged over 10 genomes worth of random positions). Overall the lines fit the histograms quite well. The darker part of the first bar on each graph is the number of unpaired uptake sequences in the 100-bp bin; there are a bit fewer of these than the random positions predict, but that's the only anomaly.

This isn't the result I was expecting, so I'll have do more thinking about how we present the spacing analysis of the simulated genomes.

Spacing analysis back on track

My wizard post-doc is teaching himself R and other useful tools for bioinformatics, and so was able to quickly generate a new DUS position dataset using my Gibbs-derived matrix, and use this to check the spacing. This showed the expected peak at very close spacings,n but I couldn't use his data for my figures because it wasn't derived the same way.

So I redid my analysis of my dataset - no peak. The I replicated the analysis on a different dataset, which gave the expected peak (good -this means the analysis method is fine), but I couldn't use this dataset for my figure because it was a bit too small - it had only 2872 DUS sites, but my analysis needed at least 2902. So I concluded that something must be wrong with the position labels in the original dataset, but couldn't figure out what this might be. I searched and searched my old analysis files (buried in various cryptic folders on my computer and on the Westgrid server, and finally found another one big enough to use. Analyzing it gave the expected peak, and now I have a nice figure showing the not-very-exciting result that the spacings of DUSs and USSs found by Gibbs analysis are like those originally reported for their perfect-consensus counterparts. But presenting this figure will set the framework for the analysis of the spacings of simulated uptake sequences, which I need to redo today (I think my runs have given me the data I need).

And today I need to print out the manuscript and work through it carefully; I've been focusing a lot on the data and not enough on what we should say about it.


My analysis of the spacings of uptake sequences in the Neisseria meningitidis genome, which I thought was quite meticulous, is failing to find the many closely spaced pairs that I know are there! It found many closely spaced pairs in the H. influenzae genome, so I can't figure out what I'm doing wrong.

Spacing of uptake sequences, take 2

The US-variation manuscript is coming together nicely at last, not so much the writing as the data and the overall organization. I've even figured out how to combine all the odds and ends of cool results that I was afraid would have to be left out (the BLAST analysis of USS variation, the covariation analysis, the DNA uptake data). One important part still to do is the analysis of spacing of uptake sequences in real and simulated genomes.

I wrote about this in a couple of posts last June (Spacing of uptake sequences, Real uptake sequence spacings are not random). I had first checked the spacings of the generic uptake sequences selected by our Perl model of uptake sequence evolution, and found that these were far from random, with uptake sequences very rarely found within one fragment length of each other. This was true for a wide range of fragment lengths. Then I compared the spacings of real uptake sequences in the H. influenzae genome with those expected for random locations, and found that both closely-spaced and far-spaced USSs were underrepresented. This had been previously reported for perfect-consensus USS cores, but I did it with the positions found by the Gibbs motif sampler.

What I need to do now: 1. I need to redo the analysis of the generic uptake sequences from the Perl model, because the model has changed slightly. I expect these results to be just about identical to the previous ones. 2. I need to analyze the N. meningitidis genome just as I did the H. influenzae one. Again, this should be done with the positions found by the Gibbs search, not just by counting perfects and one-offs. The only problem is that I need to find the original Gibbs output for the run whose sequences I've been using for various analyses in the manuscript. (I do have the file with all the sequences, but their positions have been deleted.)

The plan is to present the spacing analysis of the real genomes early in the manuscript, with the other Gibbs analyses, and then come back to this with the analysis of the simulated sequences.

Update: I found the missing N. meningitidis Gibbs data and extracted the spacings for both the uptake sequences in the same orientations and the uptake sequences in both orientations. The post-doc taught me how to make histograms in Excel. I found the program that creates the control data (the same-as-US number of random positions in a same-length sequence), and created a random set of both-orientation spacings and of same-orientation spacings. But I forgot that I need to create at least 10 random sets to smooth out the noise. Now I'd better start writing the first part of the text about this.

H. influenzae recombination and the human microbiome project

Hmm, this blog post has no content. It was going to be about how I am hoping to tie our NIH proposal into NIH's human microbiome project, but I got distracted. More later.

Grants and grantspersonship

This week I've been to two sessions about grants we'd like to apply for. The first was a half-day session on NIH grants, and the other was a lunchtime information session on seed money available from our local genome centre. I'll write a little bit here about the genome centre grants, and do a longer post on what I learned from the NIH workshop later.

The genome centre grants range from $25,000 to $200,000, for a single year. They're not renewable, and only about a dozen are awarded. They're not to support ongoing projects, but to provide preliminary data that can then go to other granting agencies or to industry partners.

The applications are due Nov. 3. I haven't read the proposal forms myself yet, but the post-doc says the proposal itself is only 5 pages. And it would be similar enough to parts of our CIHR and NIH proposals (and to the post-doc's fellowship applications) that it shouldn't be onerous to write.

The big complication, at least for us and, judging from the questions, for many others, is the need for matching funds. These funds must have been applied for since May 2008, so we can't use our current CIHR grant. Using it would be tricky anyway, because the sequencing we want to do is not directly related to the goals of that grant. They want 75% of the matching funding to be in place at the time the application is submitted, which appears to rule out the CIHR grant we just submitted, as we won't learn the results until January (though I gather there is some flexibility about this). If one of the post-doc's fellowship applications succeeds we could use that as matching funds - it wouldn't be a lot of money but we don't plan to ask for a lot of money anyway.

So I think we'd better start work on this right away. Step 1 - read the application form.