Field of Science

Sxy manuscript revised!

On Friday I sent our revised sxy manuscript back to the journal editor. Even though we (i.e. the post-doc whose work this is) didn't do the experiment the editor and reviewers had originally asked us to do, we're quite confident that the editor will now accept the manuscript. That's because the new experiments the post-doc designed were more appropriate than the requested one, and they produced results that beautifully confirmed our model for how sxy expression is regulated.

Thinking statistically about array results

A few days ago I posted about a problem we're having interpreting some microarray data. That problem could be simply resolved by finding the missing notebook, or activating the relevant memory synapses in the brain of the student who did the work. But there's a bigger and more intrinsic problem with this type of experiment, and that's deciding whether observed differences are due to antibiotic effects or to chance.

Consider data from a single array slide, comparing the amounts of mRNA from each of ~1700 genes from antibiotic-treated and untreated cells. If the antibiotic has no effect on a gene we expect the amounts from treated and untreated cells to be 'similar'. We don't expect them to be exactly the same (a ratio of exactly 1.00) because all sorts of factors generate random variation in signal intensity. So we need to spell out criteria that let us distinguish differences due to these chance factors from differences caused by effects of the antibiotic treatment. Here are some (better and worse) ways to do this:

1. The simplest way is to specify a threshold for significance, for example declaring that only ratios larger than 2.0 (or smaller than 0.5) will be considered as significant (i.e. not due to chance). But this isn't a good way to decide.

2. We could instead use the standard statistical cutoff, declaring that only the 5% highest and 5% lowest values would be considered significant. One problem here is that this criterion would tell us that some differences were significant (5% of 1700 x 2 =170) even if in fact the antibiotic treatment had absolutely no effect.

3. We could improve this by doing a control experiment to see how big we expect chance effects to be. One simple control is to use a microarray where both RNAs are from the same treatment. We can then use a scatter plot to compare the scores for each point. The diagonal line then represents the 1.00 ratio expected in the absence of random factors, and the degree of scatter of the ratios away from 1.00 tells us how big our chance effects typically are.

We could then decide to consider as significant only effects that are bigger than anything seen in the control. Sometimes this could be easy. For example, if expression of one gene was 20-fold higher after antibiotic treatment, whereas control effects never exceeded 5-fold, we'd be pretty sure the 20-fold increase was caused by the antibiotic.

But what if the control effects were all below 2-fold, and two of the genes in our treated RNA sample were up 2.5-fold? Does this mean the antibiotic caused this increase? How to decide?

We need to do enough replicates that the probability of any gene's expression being above our cutoff just by chance is very small. For example, we could require that a gene be expressed above the 2-fold cutoff in each of four independent replicates. Even if our arrays had a lot of random variation, comparing replicates can find the significant effects. Say that our controls tell us that 10% of the genes are likely to score above the 2-fold cutoff just by chance, and we do see about 170 scoring that high in our first real experimental array (comparing treated and non-treated cells). If we do another experimental array, we again expect 10% to score above the cutoff, but if only chance determines score, only about 17 of these are expected to have also scored this high in the first replicate. If we do a third replicate, we only expect 1 or 2 of the genes scoring above 2-fold to have scored this high in both of the first two replicates. You see the pattern. So, if our real four replicates include 6 genes that scored above 2 in all four, we can be quite confident that this is not due to chance. Instead the antibiotic treatment must have caused their expression to be increased.

The real analysis is more complicated, in part because different replicates have different amounts of error, and in part because we might want to take into account the actual expression levels rather than using the simplistic cutoff of 2-fold. Our preliminary look at the real replicate data last week suggested that a few genes were consistently (well, semi-consistently) increased. Tomorrow we're going to take another look at the real data and see if we think this is a result we have confidence in.

The USS manuscript revisions are done

Thanks to heroic work by the post-doc co-author. She's returned the revised manuscript to the journal editor along with a carefully composed response to the concerns raised by the reviewers. If we're lucky it will be now found acceptable. If we're modestly unlucky the editor will send it back to the reviewer who didn't like it, and we may have to wait a while for their evaluation (we had to wait ages for the reviews and re-reviews of our last paper with this journal group). If we're really unlucky they'll decide to reject it, in which case we'll immediately submit it somewhere else. We think it's now quite a good paper.

Making sense of old microarray data

Last night I went over the microarray data from the collaborative experiments that are testing how low concentrations of antibiotics affect H. influenzae gene expression. Unfortunately we've been unable to locate the notebook in which the student who did the work recorded the experimental details, and unless we can sort out some key points the array results will be unpublishable.

We have data from 4 or 5 arrays for each of three treatments. I think that all of the arrays have produced quite noisy data, so none is at all convincing in isolation. The different arrays for each treatment used several independent preparations of RNA, and the hope is that, by pooling the replicate array data for each treatment we can see significant differences.

The first image shows part of the data from one treatment, with the lines joining points indicating the amount of RNA produced from one gene relative to the RNA from untreated cells, as measured by the different arrays. This representation makes it easy to compare results from the replicate arrays.

Here we only see points for two replicates, coloured by the strength of their expression in the first replicate. We see that some lines (genes) have increased expression (red) in both replicates, and some genes consistently have reduced expression (blue). So far so good - these may be genes that are affected by the antibiotics we're testing.

The problem created by the lost notebook is that we don't know, for each sample, which RNA came from the untreated cells and which came from the cells treated with antibiotic. The problem this creates is illustrated by the second graph. This shows the data for a replicate where we may have reversed the information from the treated and untreated RNAs. Notice that the blue lines have moved to the top at this point, and the red lines to the bottom. This tells us that genes that appeared to be induced by the antibiotic (red) in the other replicates are here apparently repressed (at the bottom of the graph), and vice versa.

We could just switch the two RNAs to make the data look more reproducible. But in the absence of notebook evidence about which RNA is which this would scientifically suspect, like manipulating data points to better fit expectations. Worse, we thought we did have some evidence (from an image stored in a lab computer) that helped us decide which RNA was which, but using that evidence is what gave us this apparently-switched replicate.

I'm hoping that a response from the student who did the work will reveal that we've misinterpreted what she told us about these RNAs. If this can't be sorted out we'll need to remove this replicate from our data set.

But what does the reduced variation mean???

The only bit of the USS manuscript that still needs work is my interpretation of what we learn about USS evolution from this new pattern of reduced genetic variation. Usually before I do an experiment I have a hypothesis that predicts a certain outcome if true. But I went into this analysis with no expectations (i.e. I didn't take the time to do the thinking before I did the BLASTing).

In circumstances not involving recombination, reduced genetic variation is usually seen as a consequence of stabilizing selection. Thus reduced variation is expected in genes coding for essential proteins. I think genetic recombination can also affect local variation, by changing the effective population size. But at the USS sites I think this is the result of biased uptake plus (unbiased) recombination. The post-doc coauthor of the paper has a nice explanation:
"The simplest interpretation of this reduced variation is that chromosomal sites containing USSs tend to recombine with a pool of internalized fragments containing relatively few mutations within the USS, presumably because of selection for strong USS by the DNA uptake machinery at the cell surface."
We don't have any direct evidence of how often cells take up DNA in their natural environment (on respiratory tract epithelium if they're H. influenzae), nor how much of this DNA comes from other H. influenzae cells rather than from unrelated bacteria or the human host, nor how biased each uptake event is, given the pool of available DNAs, nor whether unidentified sequence factors affect the probability that a homologous fragment will replace the chromosomal copy. But the reduced variation may be telling us that the net effect of these unknown factors dramatically reduces the rate at which existing USSs diverge.

BLAST analysis done, and redone....

I had almost finished redoing all the BLAST analysis of USS variation using the new settings when I realized that my changes to the analysis had created a new problem.

Originally I had set the probability cutoff for the BLAST searches to 10e-10, because I wanted to exclude alignments to USS sequences that weren't true homologs (i.e. to sequences from unrelated positions in the genome that were similar because they were also USSs, not because they both descended from the same ancestral USS). This constraint meant that I got sequences with only 1 (or no) internal mismatches, and I was quite confident that these would all be true homologs. (In retrospect I might have been overconfident.)

After I did the first control analysis (of randomly selected non-USS sequences) and discovered that my BLAST searches were producing patterns unrelated to USS, I lowered the probability cutoff to 10e-4. This meant that I often got several different genomic sequences aligned to the same 39 bp USS sequence. I know that only one of these could be the true homolog, and I assumed that the best match would indeed be the true homolog (I had worked out an easy way to eliminate all but the best match). This would have been true if (1) all of the USS sequences in the H. influenzae Rd genome did indeed have true homologs in each of the other H. influenzae strain genomes, and if (2) I hadn't been searching the two strands of the genome separately. But (1) a small fraction almost certainly don't, and (2) I was, so a significant fraction of the alignments I was scoring were to the wrong sequences.

This wouldn't have been a big deal if including wrong sequences just introduced noise into the analysis, lowering its ability to detect differences caused by the USSs. But a long and heated discussion with one of the postdocs had helped me realize that including non-homologous USSs would have precisely the effect I was attributing to homologous USSs - reducing the amount of variation at USS positions relative to non-USS positions.

I had been thinking that the low probability cutoff (really 'high', as 10e-4 is higher than 10e-10) was good because it produced a set of alignments with more mismatched positions and thus more variation to feed into the analysis. But some of this variation was due to mistaken alignments, so I decided to go back to the original sets of alignments and redo the analysis, using only those that had no more than three mismatched positions. This eliminated almost all of the non-homologous alignments.

The graphs above show the difference between the two analyses for one of the three non-Rd genomes. I've cut off the 5 non-USS positions at each side, so these are only 29 bp wide. The 'better analysis' has lower genetic variation (Y axis goes to 3, not 4), because the alignments I had eliminated were the ones with the most variation (some of it due to non-homologous alignments), but the pattern is the same.

This is reassuring for two reasons. First, the pattern I saw in the earlier analyses (low variation at USS-motif positions) is still there, and still very strong, so I know it wasn't due to including non-homologous alignments. Second, because eliminating most of the problem alignments hasn't caused the pattern to become much weaker, I can be confident that any non-homologous alignments that might have slipped past the new more stringent cutoff aren't responsible for the pattern.

I redid the controls too with the new cutoff, and they give baseline variation of about 1.3%, with standard deviation of about 0.15%. So the least variable USS positions have much less variation than the rest of the genome, and the most variable have about twice as much. This high variation may be simply because the set of USS sequences has overall more variation than the rest of the genome. This is expected to be the case because USS are preferentially located in poorly constrained genomic positions such as non-coding regions and non-essential genes.

Now we just have to get the revised manuscript back to the journal editors before they lose patience with us.

BLAST problem solved



The reason BLAST was finding no variation at the central positions of the 39 nt sequences was that I had set its 'word' length to 20. This told it to begin searching by looking for perfect matches to an initial sequence that was 20 nt long. This meant that it could never find a mismatch at the central position because such a mismatch would have been flanked by matched segments that were, at best, 19 nt long.

So I tested word lengths of 10 and of 8. Both eliminated the central mismatch problem, at the trivial expense of increasing search time to at worst 10 seconds for the whole genome.

I'm gradually getting a better understanding of how BLAST searches work (it takes me a while to absorb the complexities), so I've also improved the searches in other ways - allowing higher "E-values" so I get sequences with more than one mismatch, and setting the maximum number or results to a value appropriate for my database size. I've also improvved my Word/Excel shuffle methods, so I get a cleaner dataset. And I now carefully note the numbers of sequences at the various steps.

The graph above is the control analysis for only one orientation of only one of the geneome sequences. So now I'm ready to search all three genomes against the control dataset and, if this looks good, against the USS dataset.

Aaarrrgghhh!!!

This is why we must ALWAYS do the controls.

The graph to the left (dark red bars) is the first part of my control analysis. It shows the distribution of mismatches across 1677 random 39 bp sequences from the H. influenzae Rd genome, aligned with their best matches in the genome of another H. influenzae strain called PittEE.

It has some disturbing similarities to the distribution of mismatches in the graph I posted yesterday (shown here in a different version, with the blue bars). Both have more mismatches at the edges, and both have none at the central position.

My interpretation is that the BLAST searches I'm using to do the alignments have a bias favouring mismatches at the edges and precluding any mismatches at the central position (the latter might be because I set the E-value cutoff too low). So I either need to do the searches differently, to eliminate these biases, or find a way to correct my USS-sequence searches for this bias.

I've just sent a detailed email off to the BLAST help desk, attaching both of the above figures, and asking if they can suggest changes to my search parameters or a reference document I should read.

More about the reduced variation at USSs

The analysis I showed in my last post deserves more discussion. First, it could be improved in a couple of ways. Second, is this an expected result for sequences subject to a molecular drive resulting form biased DNA uptake and unbiased homologous recombination?

Improvement one: better controls. As it stands, the comparisons of variation at USS positions is compared to adjacent positions that are not part of the USS motif. This is a bit weak, because some of these positions may contribute to DNA uptake but not show up as 'motif'. A better control comparison would be with random segments of the genome. This is easy to do, as I already have a set of 3500 segments of 39bp (like the USS-centered segments) that I used as a control for the analysis of covariation. Well, 'easy' to see how to do (convert this set to a BLAST database, query it with the three genomes, and extract the alignment positions), but it's still a pain doing the Word-Excel shuffle to extract the position information from the six BLAST searches (each strand of each genomes). But this control would let me draw a band across the graph, probably at about 2% variation, indicating how much variation non-USS sequences have.

Improvement two: more data. Some of the error bars are uncomfortably large, because the datasets don't contain a lot of variation (rarely more than 20 mismatches at any one position even over the ~5000 alignments). Using more datasets would help. Only three complete genomes are available through GenBank, but incomplete versions of 9 more genomes are available. I should concatenate the few largest contigs of at least some of these and use them as queries to the USS database.

Improvement 1.1? Controls with more data? If I do the analysis with more genome sequences, should I also query the control database with them?

What does it all mean? I think we need to start by considering the implications of the molecular drive model for sequence variation. (This is the simplest explanation, invoking forces we already know are acting.) Let's start by imagining that the population of H. influenzae genome sequences are at an evolutionary equilibrium. This means pretending that the environment is unchanging and that all genes have evolved to their optimal sequences at mutation-selection equilibrium (that is, mutations that arise are subject only to purifying selection). Pretend too that the USSs in the genome have also evolved to equilibrium, with the molecular drive balanced by both random mutation and any purifying selection the sequences may be subject to due to their cellular functions.

Consider first some specific location of the USS motif, at a place in the genome that has no cellular function at all. It undergoes random mutation, but its recombination is biased by the specificity of the DNA uptake system. Is this sequence expected to have less 'standing genetic variation' than non-USS positions? Will biased DNA uptake act like stabilizing selection, purging genetic variation that reduces the uptake of the segment?

We need to consider the diversity of this specific sequence in a diverse population of H. influenzae cells. To start with the simplest case, assume that this sequence has the optimal uptake sequence, perfectly matching the bias of the uptake machinery. Also assume that most other cells in the population have the same sequence at this USS site. If it mutates at any USS-motif position, it will be further away from the preferred USS and it will preferentially get converted back. I think this means that biased uptake/recombination acts like stabilizing selection, reducing genetic variation.

What if the sequence isn't a perfect-consensus USS? This requires thinking about a more complex situation. Maybe it isn't 'perfect' because it's already been hit by several mutations but not by replacement. Because both processes are stochastic, some sequences will 'get lucky' and some won't, especially if the molecular drive is only a bit stronger than random mutation (so our Perl model tells us). But assume that this is still the usual version of this site in the population, i.e. the available DNA will usually have this sequence. If this sequence in our cell mutates to a worse matches, it will still be preferentially restored by uptake/recombination, purging the variation. If it mutates to a better match, and then dies, its DNA has a better than average chance of being taken up and recombined into another cell, which I think would increase variation.

I think this means that, if we were to examine many individuals for their sequence at the same particular USS site, we expect to see reduced variation at positions that match the USS consensus, and increased variation at positions that don't match it. But that's not what the analysis I've just done looked at. Instead it summarized the variation at each position across ALL of the USS sites in the Rd genome. Most of these positions did match the consensus, so it's not surprising that, on average, they showed reduced variation.

OK, I seem to have convinced myself that reduced variation is what we should expect at USS sites, if they are being maintained by biased uptake and recombination. But my evolutionarily savvy post-docs are suggesting other causes, so we'll need to put our heads together to see if we can reach agreement.

And the result is....

Between-strain variation in nucleotide sequence is greatly reduced at positions that are part of the USS motif. This is clearly seen in the figure to the left, where the blue bars representing the amount of variation for each position are small at the positions where the motif bases are tall (strong consensus). The error bars are the standard deviations of the six datasets (forward and reverse strands of the three readily available genomes) This was about 5000 alignments.

For the flanking AT-rich segments the magnitude of the reduction is proportional to the strength of the motif consensus, but variation in the USS core is most strongly reduced in the initial As.

This result was predicted last year by one of the post-docs, but at that time I didn't see an easy way to test it. How should we interpret it? Perhaps it means that most of the USSs in the genome have evolved to the optimum compromise between the preference of the uptake machinery and the needs of within-cell functions. Why don't the differences in % variation map precisely onto the logo?

BLAST success, analysis success

Everything is working.

I still can't get BLAST to attend to matches close to the ends of the 39 nt fragments, but I'm treating these as mismatches at the innermost position and 'no information' at positions closer to the end. For example, if a sequence matches at positions 4-39, I assume there's a mismatch at position 3 and that I have no information about positions 1 and 2.

I'm searching for the two USS orientations separately (searching the forward and reverse strands of the query genome separately). I'm analyzing the data separately. So far I've analyzed only the forward searches, but I'll need to flip the results I'll get from analyzing the reverse searches.

I'm collecting the output as pairwise comparisons between query and USSs, because this makes it easiest to pull out the positions of the mismatches without the information about what kind of a mismatch each is.

I'm doing the analysis by bouncing the file back and forth a couple of times between Word and Excel, using Word to first insert tabs between the output lines, and then Excel to delete the columns (formerly lines) I don't want (including the actual sequences) and to sort the results by both the match score and the locations of the ends of the matches. Then I use Word to insert tabs between each position of each alignment, and Excel to count the numbers of mismatches at each position and to graph the results.

Next post I'll describe the results.

I can run BLAST locally!

A few suggestions from the post-doc ("Try keeping all the BLAST files in the same folder.") got my Unix problems solved. Commenters on my last post have provided lots of suggestions, some of which are useful and some of which address problems I've already solved. Neil has even set up a NodalPoint page about this problem, but I haven't yet figured out how to edit in my comments.

I've now used standalone (local) BLAST to search the database of 2136 different 39 nt segments of the H. influenzae Rd genome with the source genome (Rd) sequence. This is the 'positive control' for the searches I want to do with the non-Rd genome sequences. This worked, and let me do some trials to work out which options I should specify. Then I did the same searches with the three other genomes I have. This showed me other problems, some of which I haven't worked out yet.

I have two big remaining problems.

First, I need to understand BLAST searches well enough to optimize the alignments. I understand that mismatches close to the ends of the sequences will be under-represented, because of how BLAST works. I think this appeared in the search results - instead of alignments with single mismatches near the ends I think I got alignments that had been shortened by 4 nt. I may be able to minimize this by setting the options appropriately, but I probably can't eliminate it. Luckily the first and last 5 nt are the least important for this analysis.

Second, the information I need to get from the non-Rd searches is the locations of mismatches within the 39 nt segments. For this I found that representing the output as pairwise alignments made easiest to extract the information that specifies the positions of the mismatches within the alignments. This is relatively straightforward (yes Neil, with Word and Excel) provided the mismatches are internal to 39 nt alignments, but will need some more sophisticated tricks for alignments that are truncated at one or the other end. Another problem for extracting the position info is that half of the 39 nt sequences are in the opposite orientation to the others, and so they are reversed in the output.

The guy in the next office strongly recommended BBEdit, so I bought that. He said it does a lot of the editing chores that I'd otherwise need to write Perl scripts to do. Sounded great. But BBEdit has wandered far from its "Bare Bones Software" (="BB") roots, and learning how to use it will take some time...

Preparing to run BLAST locally

I asked the guy in the next office for advice about using BLAST to characterize the pattern of variation in and around USS sites in the sequenced H. influenzae strain genomes. His advice was to set up BLAST on one of our own computers (not a big deal, he said), to create a BLAST database containing all of the ~2000 USS sequences (39 nt each) from the H. influenzae Rd genome, and then to 'query' this database with the genome sequences of each of the other strain genomes in turn. This has the big benefit of creating only one output file for each genome, rather than the one file for each SS site I would have if I used the 39 nt USS sequences as queries.

Finding the downloadable files at NCBI was quite easy, and I now have a BLAST folder in the applications folder of our fast Mac (Stingray). I also had no trouble converting my file of ~2000 USS sequences into FASTA format, with each entry given an identifying number (I don't need that but BLAST does). And I downloaded the genome sequence files - only 3 other H. influenzae genome sequences turned out to be directly available from GenBank; the others apparently still must be obtained from the various sequencing centers. One of the post-docs may already have these, but if not I think I don't really need data from more than three anyway.

I also found the instructions for formatting commands for BLAST searches, and (must do first) how to use the 'formatdb' program to convert my FASTA file into a recognized BLAST database file. BUT, as usual, I'm stuck at the very basics of using Unix commands in the Mac Terminal interface. I'm hoping one or other of the post-docs will be able to set me straight today.

Molecular drive on USSs drives tripeptide frequencies!

Here's a nifty new result. I was working on the Results section of the USS manuscript revisions, to include a result I posted about last month.

I had found that the distribution of coding-region USSs between the six possible reading frames paralleled the abundance of the tripeptides these USSs would encode in the H. influenzae proteome. This suggested that the reason some frames had more USSs than others was because USSs in some frames encoded more versatile tripeptides than others.

I now decided that this analysis needed some controls, so I scored the frequencies of the same USS-encodable tripeptides in the proteomes of several other bacteria: E. coli, B. subtilis, and (because it has the same base composition as H. influenzae) Listeria monocytogenes. The relative proportions of these six tripeptides seemed comparable to those of H. influenzae, but I noticed that the raw numbers were smaller than I would expect given the sizes of the respective proteomes (all the control bacteria have larger genomes than H. influenzae). So I normalized the tripeptide counts by dividing by the total number of amino acids in each proteome, to get what I could call tripeptide density.

The results are in the top graph. The blue bars are the H. influenzae tripeptide densities, the other colours are the tripeptide densities of the controls. Note that the H. influenzae proteome has disproportionately high densities of the USS-specifiable tripeptides.

So the 'control' needs another control - the densities of tripeptides that aren't encodable by USSs. I chose these by taking the same amino acids in backwards order (e.g. VAS instead of SAV). This is good because it doesn't change the abundances of the single amino acids making up the tripeptide. Here's that graph. Again the blue bars are H. influenzae and the other colours are the control proteomes. And note that now the blue bars are nothing special - H. influenzae has the same densities of these tripeptides as do the control proteomes.

So what does this mean? It means that the biased uptake and recombination of USS-containing DNA fragments has not been neutral for the proteome. USS not only accumulate at protein-coding positions where they don't change the amino acids, they accumulate at positions that wouldn't otherwise have used these amino acids.

My collaborator in Ottawa has already shown that the corresponding 'preferred ' USS-encodable tripeptides are overrepresented in proteomes containing USS. What I've done here is extended that result to show that even the tripeptides we thought were least versatile have become overrepresented in the H. influenzae genome.

Clarification of the analysis I want to do

Commenters are offering help and advice! Damn, I love open science! But I need to clarify what I'm trying to do.

I have a FASTA file with about 2000 segments of the H. influenzae strain Rd genome, each exactly 39bp (a few are shown on the left (yes, I know this is not FASTA format)).





Genbank has genome sequences of 12 other H. influenzae strains (shown on the left). I want to take each of the 2000 Rd sequences and find the sequences of its homologs in the other strains' genomes. If all the genomes have homologs of all the sequences, that will give me, in principle, 24,000 39bp sequences.






I've done this manually for a few of the 39bp sequences. Often BLAST says it found no matches - I don't really understand why.

When BLAST agrees to do what I want, it produces alignments like those on the left. The upper one shows an Rd sequence to which all the genomes have perfect matches; the lower one shows an Rd sequence several of whose homologs have a single mismatch to (yikes, is that syntax correct?).

What I need to do is automate this process, so BLAST searches the 12 genome sequences with each of the 2000 queries. Ideally the output will be in a form that I can easily use for the next step.

Because each of my ~2000 query sequences is centered on one of the ~2000 USS motifs in the Rd genome, they can all be aligned with each other (look back at the top figure to see this). What I want to do is create a meta-alignment of all the 24,000 homologs to these 2000 sequences. Then I will use this meta-alignment to score, for each of the 39 aligned positions how often one of the homologous sequences has a mismatch to its query sequence. That is, I want to align all 24,000 sequences (or all 24,000 patterns of dots), and create a histogram showing how often each position differs from its Rd homolog.

Only some of the 39 positions make significant contributions to the USS motif; seven internal positions and the five at each end are not part of the motif. I hope this analysis will show me whether positions that contribute to the USS motif are more or less variable than nearby USS-independent positions.

Maybe we CAN do this analysis...

One of the reviewers of our USS manuscript asked that we compare the sequences of the USSs we'd found in the Haemophilus influenzae Rd genome with the homologous sequences in the genomes of other sequenced H. influenzae isolates. We've been planning to dismiss this as both too difficult ("beyond the scope of the present work") and unlikely to yield any relevant information. I still expect that such a comparison will probably show that the USS sequences are just like the rest of the genome, with a few differences here and there. But I've just realized that the analysis might be much easier than I had thought.

The Gibbs motif search provided us with a file containing about 2000 sequences, each about 40bp centered on a USS. In principle it should be straightforward to use BLAST to search the other available H. influenzae genome sequences for matches to each of these USS sequences. The USS sequences are long enough that truly homologous sequences will be found; I expect most of these to be perfectly matched, with the occasional single-base mismatch.

I think I can easily use my highly developed 'find and replace' MS Word skills to convert the list of USS sequences to the FASTA format needed by BLAST. I don't know how to set up a search that uses multiple input sequences to search a few genomes, but I'm pretty sure that the guy in the next office will be able to help me. I'm not sure what form the results will take, and this may be the complicating factor. The standard BLAST display isn't designed for this kind of analysis. But output will probably be provided as a text file..

OK, I'm doing a test run, searching for one fake USS sequence against the 13 (wow!) H. influenzae genome sequences in the BLAST microbial genomes database. Result: "No significant similarity found. For reasons why click here." OK, I should be using the 'Search for short nearly exact matches' rather than regular BLAST. Hmmm..., various complications. I think I need to try this using a real USS sequence rather than a made up one. I'll play around more later.

USS manuscript angst

I'm not making much progress on revising the USS manuscript, I think partly because, deep down, I'm worried that our logic is a bit weaker and more circular than I'm happy with. So I'll try to lay it out below.

Our goal for our USS research is to find out whether these sequence motifs are abundant simply because of the molecular drive caused by the biased DNA uptake machinery, or because natural selection favours cells that take up DNAs from closely related cells/species. This latter explanation arises from the common assumption that cells take up DNA because recombination generates benefits, and is inconsistent with our proposal that cells take up DNA simply as food.

We seem to be trying to discriminate between these explanations by finding out whether the bias of the uptake machinery matches the motif accumulated in the genome. But I think this is not a good test, because such a correspondence is consistent with both explanations. Furthermore, the DNA uptake data in the manuscript aren't compelling enough to push this very weak argument.

But why am I thinking of doing a big rewrite at this stage? Why don't we just make the minimum improvements necessary to satisfy the reviewers and send the damn thing back in? Now that I've reread the whole manuscript and reminded myself of what we said, I guess I'd better reread the reviews, and the responses suggested by the post-doc author, and see if I think we can do this.

Revising the USS manuscript

Our manuscript about the within-genome diversity of uptake signal sequences (USSs) and their effect on DNA uptake was provisionally accepted, but we need to do substantial revisions to satisfy the reviewers, especially as we really don't want to have to do more experiments for this manuscript. On rereading the Introduction, I realized that changing the perspective might be good.

In the present version of the Introduction we do the following (each point is a paragraph):
  1. Introduce natural competence.
  2. Describe the general process of DNA uptake.
  3. Explain that H. influenzae and Neisseria preferentially take up DNA fragments with specific sequences.
  4. Describe how the H. influenzae uptake specificity was determined.
  5. Describe how the USSs in the H. influenzae genome were characterized once the genome sequence became available in 1995.
  6. Describe the USSs in the genome and possible non-uptake functions for these USSs.
  7. Describe the distribution of USSs in the genomes of related bacteria.
  8. Say that USSs arise by point mutation (not insertion) and that molecular drive can explain their accumulation in the genome.
  9. Provide a mechanistic explanation of why the DNA uptake machinery might be biased towards a specific sequence.
  10. Explain why we need a better understanding of the USS motif and the specificity of the uptake machinery.
  11. Summarize the results that the manuscript presents.
My alternative perspective would frame the results in the context of repeated sequence evolution rather than of competence. It would go as follows:
  1. All genomes contain repeats, which usually fall into one of three kinds: dispersed multi-copy elements that spread by copying and insertion (selfish DNA); motifs that arise by point mutation and interact with specific cellular DNA-binding proteins (e.g. binding sites for transcription factors); and simple-sequence repeats that in bacteria arise by polymerase slipping and function to promote gene activation/inactivation by frameshifting. This paper is about an abundant repeat that doesn't fit any of these.
  2. These repeats are called uptake signal sequences. Describe what's known about length, copy number, consensus, distribution in genome and between species. How we know they are motifs and not insertions.
  3. Possible within-cell functions and relevant information from bioinformatics.
  4. Occur only in a few naturally competent bacteria. Discovered because are preferred by uptake machinery. Similarity between preferred sequence and genome abundance can't be coincidental.
  5. Describe competence. Usual explanation for USS is selection as mate-choice marker. Why this is problematic.
  6. Describe alternative explanation = molecular drive: USSs accumulate as a side effect of uptake bias and recombination.
  7. Provide a mechanistic explanation of why the DNA uptake machinery might be biased towards a specific sequence.
  8. Explain why we need a better understanding of the USS motif and the specificity of the uptake machinery. Focus on molecular drive as an important phenomenon in its own right, as well as on why cells take up DNA.
  9. Summarize the results that the manuscript presents.
I think this might be a better outline - appealing more broadly to people interested in genome evolution as well as to those interested in competence. But I wonder if it's worth making this big a change to the manuscript at this stage.

Should we write a proposal to NIH?

Yesterday one of the post-docs and I discussed whether we should submit a proposal to NIH. Yes, we did just get one proposal funded, to work on the regulation of competence genes in H. influenzae and E. coli. But her project is completely different, and some parts of it are going to be expensive.

She's studying when, how and why different lineages of H. influenzae lose (or maybe gain) the ability to take up DNA and recombine it into the chromosome. We already knew that this occurs in various bacteria, and she's now completing a thorough analysis of the variation in DNA uptake and recombination ability in a broad selection of H. influenzae strains.

Some of these strains were chosen because their genomes have been or are in the process of being completely sequenced (one of the benefits of working on a sometime pathogen), and her next goal is to analyze these sequences for differences that could explain their different phenotypes. This bioinformatics work won't cost much except her time; we've already bought a nice fast computer for it. And her time doesn't cost the lab anything, because she's supported by a lovely post-doctoral fellowship from NIH.

But the next steps will be expensive. She wants to use the bioinformatics information to design investigations into the genetic differences of strains that haven't been sequenced. Her original plan was to develop a microarray chip containing all the genes and alleles that the bioinformatics and other work suggested might be involved. This still seems like a good approach, but the field is changing so fast that better ways to survey genomes are becoming available faster than we can keep up. One thing they have in common is that they'll all cost a bundle.

Subsequent work will also be pricey. We'll probably want to follow up the H. influenzae findings with investigation into related bacteria. This will be beyond the scope of the present post-doc, so we'll need new post-docs or grad students or technicians, as well as money for the tools and techniques.

The other reason to write a proposal to NIH is that proposal-writing is the best framework I know of for clear thinking about research plans.

Why I don't work on sex in eukaryotes

I just got back from the SMBE meeting and the final meeting of the CIfAR (formerly CIAR) Evolutionary Biology program. I made a point of going to most of the sessions on deep phylogeny, and they reminded me of why I decided not to work on the evolutionary origins of sexual reproduction in Eukaryotes.

John Logsdon's talk reminded me of why and how I originally was planning to work on this. "The Evolution of Sex" is still one of the biggest unsolved problems in evolutionary biology; we don't know how or why eukaryotic sexual reproduction (by meiosis and gamete fusion) originated and we don't know why it's been so evolutionarily successful. Most research into this is theoretical, or involves comparing populations that reproduce with and without sex.

My plan (10 or 15 years ago) was to instead investigate the origin of sexual reproduction by first identifying the deepest-branching evolutionary lineage that has sexual reproduction or homologs of meiosis-specific genes, and then determining the functions of these homologs and/or of sexual reproduction in the present-day members of this lineage.

When I first made this plan, the field of deep phylogeny was full of optimism that we would soon have a reliable phylogenetic tree showing the true relationships of all eukaryotes. The figure to the left shows the state of the tree then. The groups called Diplomonads, Trichomonads and Microsporidia were thought to be the first to have branched from the line that eventually led to plants and animals. Microsporidians were known to have (various weird forms of) sexual reproduction, while sex was unknown in Trichomonads and Diplomonads, consistent with an origin between the times of these divergences. The precise branching order wasn't yet firm, but most researchers felt that all we needed was a few more sequences of a few key taxa, and maybe a few refinements to the analytical methods used to infer relationships from sequences.

But the adjacent figure shows the present state of this phylogeny. It's from a recent review by a collaboration of CIfAR researchers, all but one of whom were at last week's meetings (Keeling et al. Trends in Ecology & Evolution 20:670-676). There's a pdf here; I think it's open access.

The new tree organizes the eukaryotes into five main lineages, with no real information about the order in which these groups originated (the 'root' (the center of the figure) is shown as what's called a 'star' phylogeny). The division into five groups is itself very speculative, as there are very deep divergences within each group. Bottom line? we don't know which eukaryote lineages arose first, and few researchers are now confident that we will ever know.

And where is sexual reproduction in this tree? Just about everywhere; the apparently asexual groups are dispersed with the sexual groups. So I don't regret my decision 7 years ago to abandon my goal of placing the origin of sex on the eukaryote tree.

And what about the goal of identifying the genes specific to meiosis, and characterizing their distribution in eukaryotes? A number of genes have been described as 'meiosis-specific', and in his SMBE talk John reported on their distribution and relationships to other genes, especially in those unicellular eukaryotes that might be representatives of early branches. Most of these genes are present in most of the species he's looked at, and in most cases they appear to have arisen by gene duplication.

One of the biggest challenges with this work is being sure that a gene is indeed meiosis-specific in all of the species it's found in - a gene that is meiosis-specific in yeast could have a related but not meiosis-specific role in a distant relative. This is very difficult to determine, especially because sex has never been observed in some of these species (if they are sexual, they're also very secretive). I'm glad that John Logsdon has been working on this, rather than me.