We're funded!

Our proposal to study the regulation of CRP-S promoters in H. influenzae and E. coli was successful; we have $127,567 per year for 5 years.

SMBE slides

I've just uploaded the slides from my SMBE talk to SlideShow, as suggested by a commenter. The slide show's name there is "R. Redfield's SMBE talk slides". I gave it tags for 'evolution', 'bioinformatics' and 'science'.

I think the talk went well. It was reasonably well attended, given that it was the last long talk of the meeting. I had streamlined it, and given it a very simple straightforward focus on how the issue of uptake sequences relates to the big question of whether bacteria have any processes that evolved by natural selection for producing recombinational variation.

Why our work is important

On Monday I ran my ideas for my SMBE talk past the postdocs, who politely trashed them. I was making all the errors I know not to make. The worst of these was I wasn't telling the audience why they should care about what I was telling them. One of the reasons I don't worry much about competition is that nobody else is working from the perspective I am, but this means I have to spell out the issues at the start of every talk, starting from the very basics.

The big issue is the evolution of 'sex'. The word sex has lots of meanings; here I mean any biological process that evolved because of the benefits of creating new combinations of genes (new genes or new alleles of genes). (If you've stumbled onto this blog by searching for 'sex' with another meaning in mind, you might want to cut your losses now.) I use this definition because it captures the big unsolved question of why so many eukaryotes engage in 'meiotic sex', that is, they produce a diploid genome by merging two haploid genomes and later produce from this four new haploid genomes with new combinations of genes.

Evolution of sex in eukaryotes is a big issue because we biologists don't know why it's worth the trouble. That sounds feeble, but generations of the best minds have rigorously analyzed the genetic consequences without producing any compelling explanation of why the recombined genomes would be sufficiently better than the original ones to compensate for all the biological costs of sex. The costs aren't such a big deal for facultatively sexual organisms like yeast or paramecium (who can reproduce just fine without meiotic sex), but they're enormous for obligately sexual organisms like ourselves and most other plants and animals, which must use meiotic sex to reproduce.

My approach to this problem is to ask whether bacteria have sex; that is, whether they have any processes that evolved because of benefits of creating new combinations of genes. The key word here is 'because', by which I of course mean 'by natural selection for'. We know that bacteria and archaea have processes that cause recombination, and that these processes have been important in the long-term evolution of their genetic capabilities. But I want to find out whether this happens by accident (as side effects of processes evolved by natural selection for other effects), or by natural selection for the new combinations. If the answer is yes (bacteria do have sex), we'll have shown that this selection is ubiquitous, and we'll have an independent (non-meiotic) system in which to investigate it. If the answer is no, we'll have shown that the reasons for meiotic sex are specific to eukaryotes, and that bacteria get all the genetic recombination they need by accidental effects of other processes.

The reason I have almost no competitors is that researchers have traditionally assumed that the processes that cause genetic recombination in bacteria exist because of selection for such recombination, and very few are willing to seriously consider that this assumption should be rigorously tested. This is a good place for one of my favourite quotations:
"I know that most men, including those at ease with problems of the greatest complexity, can seldom accept even the simplest and most obvious truth if it be such as would oblige them to admit the falsity of conclusions which they have delighted in explaining to colleagues, which they have proudly taught to others, and which they have woven, thread by thread, into the fabric of their lives."
Leo Tolstoy
How are we testing this assumption? By looking for evidence (at the molecular level) of how selection has shaped the processes that cause recombination. There are three such processes, but two of them, conjugation and transduction, can be easily shown to cause recombination only as side effects of genetic parasitism by plasmids and phages respectively. That leaves natural competence (DNA uptake) and its genetic consequence (transformation), which is what we work on. Transformation itself arises when the so-called recombination machinery in the cell acts on DNA the cell has taken up. But this machinery exists not because of selection for making new combinations of genes using DNA brought in from outside, but because of selection for the ability to repair and replicate the cell's own DNA. Because transformation itself is an unselected side effect of the replication and repair machinery, we concentrate on understanding how natural selection has acted on natural competence (the DNA uptake process).

I'll explain how we do this in a later post.

The proposal reviews are in...

Yesterday I got the emails telling me that the scientific reviews of our grant proposals were available. The actual funding decisions won't be made for a few weeks.

Regulation of Sxy and CRP-S genes: Score 4.25/5, ranking 9/37. So it might be funded.

Mechanism of DNA uptake: Score 3.8/5, ranking 21/40. This one won't be funded.

Pushing for more open science

I'm going to take advantage of my SMBE talk to proselytize for open science and research blogs.

The idea of open-access scientific publication is spreading. It's relatively easy for scientists to see the benefits of making the results of their research accessible to more people. But hardly anybody seems to consider the potential benefits of opening up the process of doing the science (as I'm trying to do with this blog). The reason is competition; most researchers are afraid that they'll be scooped if their competitors find out what they're doing.

Although competition can be a positive force in science, stimulating us to try to do better work than our competitors, I also think that the secrecy that arises from fear of competitors is also a very negative force. Competitors are all potential collaborators, and focus on competition shuts out the potential for collaboration. I was impressed by a NIH grant-writing workshop that told me that the first think NIH does when people call to ask about funding opportunities for a new line of investigation is to connect them with people doing related work, encouraging them to contact these as possible collaborators.

Of course it's easy for us, because we're working on a problem whose importance isn't widely recognized....

Stringency of Gibbs analysis

In revising our USS manuscript I've been thinking again about the stringency of the Gibbs motif searches. The stringency is set by telling the program how many sites it should expect to find. For most analyses we set this at 2000 (and it would find about 1350) but for the analysis of covariation between positions we set it at 3000 so we would have many poorer matches (it would find about 1650) and thus have more variation to analyze.

The Gibbs analysis assigns a score to each site it finds. With 'expect 2000' most sites have scores close to 1.0, but there are always some (~40 in the set I counted) with scores of zero and a few with scores between 0.5 and 0.9.

I only yesterday discovered how to get Excel to make a histogram; here's the histogram showing the distribution of scores for one of the 'expect 3000' searches. As with the 'expect 2000' analyses, most of the USS sites have scores close to 1.0, and few have scores int he middle range. But there are lots more sites with scores=0, and today I've been checking out how bad these sites are.

I had feared that they were garbage included by mistake, just increasing the background noise but not really resembling the USS consensus. So I was pleasantly surprised to see that these sites have a strong USS signal when displayed as a sequence logo (the top logo in the figure). For comparison I made another logo using only the sites that scored 1.0; that's the bottom logo in the figure.

Of course this analysis doesn't tell us whether the low-scoring sites found by the Gibbs analysis actually function as USS in DNA uptake. Most of them miss the consensus at at least one core position, but the ~25% with perfect cores also have stronger flanking consensus (I'm not sure how to interpret this).

Not that kind of framing

I'm working on the revisions for the USS paper, writing up the new analysis of reading frames and constraints. I put the data in the table on the left; the "Relative to best codons" values estimate how easily the USS-encoded tripeptide would be translated. I'm puzzling over why only 49 USSs are in reading frame A.

The graph below shows USS number plotted as a function of the postulated cause, proteome number (this is plotted the opposite way to that in my previous post on this analysis). This plot more clearly shows that one point (frame A) is an outlier; it has a lot fewer USSs than we

would expect given how often the tripeptide it specifies appears in the proteome.

Could this be because of codon constraints, i.e. because USSs in frame A require the tripeptide KVR to be encoded by inconvenient codons? No, the codon score is quite high (0.77) meaning that this frame's USS-specified codons are commonly used in the proteome.

If anything there should be more USSs in this frame than predicted by the proteome abundance of its tripeptide, because the consensus is weak for the first and second positions of the first codon and the second position of the second codon.

Motifs and elements

The reviewers of our USS manuscript didn't feel that our Gibbs motif analysis of USSs was much of an advance on the previous analyses. It's true that the motif identified by the Gibbs analysis is very similar to that found by searching for perfect USS cores. But the results could have been otherwise, and it's important to have found this out. So in our revisions we need to do a better job of explaining why the motif analysis was needed.

First I should clarify that the USS should be viewed not as a replicative 'element' but as a 'motif'. Both terms can refer to sequences or sequence patterns that are present at multiple sites in the genome but, at least for the purposes of this blog, a replicative element is a DNA sequence whose repeats have arisen by copying and insertion. Transposons and insertion sequences are examples of replicative elements that code for their own replication; Alu sequences are elements that are passively replicated. USSs could have also been elements produced by some sort of copying and insertion process, but we now know they are not.

The term 'sequence motif ' can be used for any detectable sequence pattern that occurs in multiple locations or genomes. It is commonly applied to DNA sequence patterns that have been selected for binding by specific DNA-binding proteins such as polymerases, transcription factors, and repressors, and to amino acid sequence patterns that perform specific functions in proteins. These motifs arise by point mutations in preexisting sequences, not by copying and insertion. Typically they are short (5-25bp), and have much weaker consensuses than do replicative elements, with most or all instances differing at one or more positions from the consensus. (Different copies of replicative elements are often identical over hundreds of bp.)

I won't go into the compelling evidence here, but we now know that individual USSs clearly arise by normal point mutations, not by copying. Although previous analyses of genomic USSs did not explicitly consider the distinction between replicative elements and motifs, they were limited by the need to search for specific sequences. The results therefore reflected only the properties of those specific sequences, with no allowance for the true diversity of functional USSs.

Given our new knowledge that USSs are motifs, any analysis of USS evolution had to be built on a solid understanding of their true diversity. The availability of the Gibbs motif sampler program let us search the whole H. influenzae genome for any patterns, without having to specify any sequence. Once the program found a pattern, it created a list of all the sites in the genome fitting the pattern, with a measure of the strength of each match. Thus it provided an unbiased analysis of the full diversity of USS-related sequences in the genome.

Introducing my USS talk

I'll be giving a talk at the upcoming meeting of the Society for Molecular Biology ad Evolution (SMBE) in Halifax. The focus of the session is on the evolutionary consequences of the mechanisms that cause recombination, and I'll be talking about how the sequence specificity of the H. influenzae DNA uptake machinery has affected the evolution of its genome and proteome.

I'm going to start the talk by describing what usually happened at the end of the talks I used to give on the regulation and evolutionary function of competence. I'd conclude that the regulatory evidence supported the hypothesis that bacteria take up DNA for food, not for its genetic information. Most people find this a very unwelcome idea, and one of the first questions would always be why I thought the uptake specificity wasn't compelling evidence that bacteria take up DNA to get new genetic information. I'd answer by saying that uptake specificity needn't have evolved to promote recombination, and that we were beginning to investigate alternative explanations.

I'm hoping that this introduction will capture people's attention - "There's a controversy! Everyone thinks she's wrong!"

We now have tons of analyses to report: the Perl model of USS evolution, the uptake assays with mutated USSs, the Gibbs motif analyses, the reading frame analysis, the evidence that USSs are not insertion elements but motifs that arise by mutation, the correlation of uptake specificity with Pasteurellacean phylogeny, the lack of coding constraints due to the USS int he H. influenzae genome, the presence of competence genes and genomic USSs across the Pasteurellaceae (including species that can't be transformed in the lab)... Unfortunately for me (fortunately for my audience?) I only have 30 minutes including the question period. I'm going to try to pull a draft talk into shape for Monday, when it will be my turn to do lab meeting.

Reading frames vs USSs

Sorry for the dead air; I was away unexpectedly with no internet access.

We got the reviews back for our USS manuscript. Not too bad. Both reviewers asked for more analysis of the role of reading frames in USS locations. (See previous posts here and here.) This turned out to be both easy and fun to do.

Many USSs are in the protein-coding parts of the genome, and these can be sorted by which of the 6 possible reading frames the respective proteins are encoded. The first two figures show the relationships of the USSs to the reading frames.

The frames aren't equally used. Frames A, B and C have 49, 125 and 425 USSs respectively, and frames D, E and F have 474, 125 and 157. The differences are too large to be explained by chance alone, and we think they arise because USS in the less-used reading frames impose more severe constraints on protein function, so that many USSs arising in these frames are eliminated by natural selection.

The new analysis considers the two factors likely to contribute to this disparity. (Because the flanking segments exert only modest constraints on amino acid sequence, I've limited the new analysis to only the most frequent tripeptides encoded by the USS core in each frame.)

The first factor likely to affect USS reading frame usage is the differing biochemical properties of the tripeptides that USS cores in these reading frames will encode. Some combinations are intrinsically more versatile than others, useful at many different locations in a wide variety of proteins, whereas other tripeptide combinations will only be useful in particular contexts.

The second factor is 'codon bias'. Most amino acids are specifiable by at least 2 (often 4 and sometimes 6) different codons, some of which are more efficiently translated than others, and cells preferentially use the easiest-to-translate codons for proteins they need to make a lot of.

The new analysis evaluates the first factor by comparing the number of USSs in each frame to the total usage of the six tripeptides in all the proteins of the genome (the 'proteome'). If the differing versatilities of the tripeptides is responsible for their differing use in USSs, we should see a correlation between total number and USS-encoded number. The results are shown by the red symbols and line in the graph below. The predicted correlation exists. (The line fitted to the points has a confidence score of 0.86. I know enough statistics to know that this is a reasonably strong correlation.)

The blue symbols in the graph show the results of the other analysis. To look for a correlation between codon bias and USS reading frame usage, I needed a crude score indicating how easily each USS-encoded tripeptide would be translated. I was able to get a table of codon usage for the H. influenzae proteome from the TIGR website; this gave the percent usage of each codon. So for each tripeptide I calculated a 'USS-codons' score as the sum of the percentages of the codons specified by USSs in that reading frame, and a 'best-codons' score as the sum of the percentages of the most commonly used codons for its three amino acids. Then I calculated a 'codon cost' as the ratios of these scores, and multiplied it by 100 so it would fit neatly on the graph.

If codon cost contributes to the disparity in reading frame usage by USS, we expect the blue points to show an inverse relationship; the highest codon costs should be for the least-used reading frames, so the blue line should slope down to the right. Instead we see no correlation at all. This tells us that codon bias makes little or no contribution to persistence of USSs in different reading frames.

We're not surprised by this second result. (In fact one of the postdocs was so sure of it that she didn't think the analysis was worth doing.) Other analysis we've done has indicated that USS are rarely found in categories of genes known to be subject to strong codon bias. But that's for a different paper, and this new analysis will nicely address concerns raised by both reviewers of our present paper.

Next?

OK, so the induction of CRP-S genes experiments established clearly that the ppdA gene's CRP-S promoter isn't being induced by the growth conditions I tried. And that changes in beta-gal activity don't necessarily result from changes in promoter activity. What to do now?

To make progress in these experiments, I need better tools for manipulating genes. I want to put the ppdD::lacZ fusion into the chromosome (it's on a plasmid now) and I want to test the effect of knocking out various genes, and I want to try a 'wild' E. coli strain. All of these require recombining desired genes into the chromosome, so I need to get the recombineering technique working for me. One of the post-docs was working on this, so I just need to take up where she left off.

I also need to get back to the tests I was setting up of components of the laser-tweezers project. I was just about ready to test attaching DNA to beads (have the beads with the bound streptavidin, have the biotinylated nucleotides to put on the ends of the DNA, have the DNA prep). Separately, I need to work with the new post-doc on an antibody-based method to attach H. influenzae cells to other beads, so they can easily be pushed around under the microscope.

Not the result I was hoping for....

Italic
This is beta-galactosidase activity in cultures that have been transferred from LB (= rich medium) to M9 salts plus a tiny bit of casamino acids (starvation medium). Transfer was at time=0.

All cells carried the ppdA::lacZ fusion on the same plasmid. The black line is cells that are sxy+ and crp+. The blue line cells are sxy-, and the red line cells are crp-.

It's clear that beta-gal activity increased similarly in all three cultures. As seen for the cultures in rich medium (previous post), the crp- cells had slightly lower activity, while the sxy+ and sxy- cells had almost identical activity.

So I conclude that the treatments I've tried so far have not changed the Sxy-dependent activity of the ppdA promoter.

Disappointing result

The detailed time course of beta-galactosidase activity in the E. coli strain with the ppdA::lacZ fusion confirmed the results I posted a few days ago. But I'm not going to post these results, because they are superseded (made uninteresting?) by the results of the next experiment.

Having shown that beta-gal activity decreases as cells enter exponential growth, and increases once growth stops, I needed to find out whether these changes depended on the presence of the transcription activators Sxy and CRP. If the answer is yes, then they are due to changes in the activity of the ppdA gene's CRP-S promoter. If the answer is no, then the beta-gal changes are due to something much less interesting (from my present perspective of wanting to find out how E. coli CRP-S promoters are activated). Unfortunately the answer seems to be NO.

Here's the data. The top graph shows culture density as a function of time for the three strains I tested. The first points (t=0) are before the cells were diluted 300-fold. You can see the 'lag' for the first ~30 minutes, then the cells begin growing exponentially. (You can tell that they're doubling at a constant rate because the points fall on a straight line on this log scale.) After about 200 minutes growth slows. The lines aren't joined to the last points because there's a 700 minute gap separating them (this part of the graph isn't to scale). You can also see that one strain grows slower than the others (red line and points); this is the strain whose crp gene is knocked out.

The second graph shows the amount of beta-gal activity in the cultures. Some of the values for the crp- strain (again the red line and points) are a bit low, perhaps because of its slow growth. However the sxy+ and sxy- strains show almost identical patterns (black and blue lines respectively). This means that the changes in beta-gal activity do not depend on Sxy, and thus almost certainly do not reflect changing activity of the ppdA gene's CRP-S promoter. Again the last points are for samples taken after 1200 minutes, when the cultures had been in stationary phase for quite a while.

What could cause the changes in beta-gal activity? One possibility is that the number of copies of the ppdA::lacZ plasmid per cell might continue to increase after cell growth slows. Another is that the lacZ mRNA might be more stable in stationary phase than other mRNAs. There are probably other possibilities too. The only important possibility is that we're wrong about Sxy and CRP being transcriptional activators of this promoter.

What next? I'm right now testing whether the stronger induction seen when the crp+ sxy+ cells were transferred to starvation conditions depends on CRP and Sxy. I hope it does.