RRResearch: More about the reduced variation at USSs

The analysis I showed in my last post deserves more discussion. First, it could be improved in a couple of ways. Second, is this an expected result for sequences subject to a molecular drive resulting form biased DNA uptake and unbiased homologous recombination?

Improvement one: better controls. As it stands, the comparisons of variation at USS positions is compared to adjacent positions that are not part of the USS motif. This is a bit weak, because some of these positions may contribute to DNA uptake but not show up as 'motif'. A better control comparison would be with random segments of the genome. This is easy to do, as I already have a set of 3500 segments of 39bp (like the USS-centered segments) that I used as a control for the analysis of covariation. Well, 'easy' to see how to do (convert this set to a BLAST database, query it with the three genomes, and extract the alignment positions), but it's still a pain doing the Word-Excel shuffle to extract the position information from the six BLAST searches (each strand of each genomes). But this control would let me draw a band across the graph, probably at about 2% variation, indicating how much variation non-USS sequences have.

Improvement two: more data. Some of the error bars are uncomfortably large, because the datasets don't contain a lot of variation (rarely more than 20 mismatches at any one position even over the ~5000 alignments). Using more datasets would help. Only three complete genomes are available through GenBank, but incomplete versions of 9 more genomes are available. I should concatenate the few largest contigs of at least some of these and use them as queries to the USS database.

Improvement 1.1? Controls with more data? If I do the analysis with more genome sequences, should I also query the control database with them?

What does it all mean? I think we need to start by considering the implications of the molecular drive model for sequence variation. (This is the simplest explanation, invoking forces we already know are acting.) Let's start by imagining that the population of H. influenzae genome sequences are at an evolutionary equilibrium. This means pretending that the environment is unchanging and that all genes have evolved to their optimal sequences at mutation-selection equilibrium (that is, mutations that arise are subject only to purifying selection). Pretend too that the USSs in the genome have also evolved to equilibrium, with the molecular drive balanced by both random mutation and any purifying selection the sequences may be subject to due to their cellular functions.

Consider first some specific location of the USS motif, at a place in the genome that has no cellular function at all. It undergoes random mutation, but its recombination is biased by the specificity of the DNA uptake system. Is this sequence expected to have less 'standing genetic variation' than non-USS positions? Will biased DNA uptake act like stabilizing selection, purging genetic variation that reduces the uptake of the segment?

We need to consider the diversity of this specific sequence in a diverse population of H. influenzae cells. To start with the simplest case, assume that this sequence has the optimal uptake sequence, perfectly matching the bias of the uptake machinery. Also assume that most other cells in the population have the same sequence at this USS site. If it mutates at any USS-motif position, it will be further away from the preferred USS and it will preferentially get converted back. I think this means that biased uptake/recombination acts like stabilizing selection, reducing genetic variation.

What if the sequence isn't a perfect-consensus USS? This requires thinking about a more complex situation. Maybe it isn't 'perfect' because it's already been hit by several mutations but not by replacement. Because both processes are stochastic, some sequences will 'get lucky' and some won't, especially if the molecular drive is only a bit stronger than random mutation (so our Perl model tells us). But assume that this is still the usual version of this site in the population, i.e. the available DNA will usually have this sequence. If this sequence in our cell mutates to a worse matches, it will still be preferentially restored by uptake/recombination, purging the variation. If it mutates to a better match, and then dies, its DNA has a better than average chance of being taken up and recombined into another cell, which I think would increase variation.

I think this means that, if we were to examine many individuals for their sequence at the same particular USS site, we expect to see reduced variation at positions that match the USS consensus, and increased variation at positions that don't match it. But that's not what the analysis I've just done looked at. Instead it summarized the variation at each position across ALL of the USS sites in the Rd genome. Most of these positions did match the consensus, so it's not surprising that, on average, they showed reduced variation.

OK, I seem to have convinced myself that reduced variation is what we should expect at USS sites, if they are being maintained by biased uptake and recombination. But my evolutionarily savvy post-docs are suggesting other causes, so we'll need to put our heads together to see if we can reach agreement.

3 comments:

Brian HaugenJuly 17, 2007 at 8:30 PM
How to download a concatenated file of all H. influenzae genomes from NCBI Entrez Nucleotide

Note: I don't know if this is the best way, just a way.

1. Go to NCBI Entrez Nucleotide.

2. Search for "Haemophilus influenzae [Organism] WGS" and get 576 results , 288 each from GenBank and Refseq. ("WGS" stands for whole genome sequencing, I think, but does not include completed genomes, for some reason.)

3. On the "Limits" tab, select either the "GenBank" or "RefSeq" results from the pop up menu titled "Only from:" to get the 288 GenBank records or the 288 RefSeq records. I am not sure which you should use, but I lean towards GenBank over RefSeq, as it is the submitted form of the record, so my example will proceed using the GenBank results.

4. From the "Display" pop-up, choose "FASTA", and it will give you the FASTA of the first five results (Link may not work).

5. Next, choose to display a maximum 500 of the results, which gives us the FASTA of all 288 results in a web page format (Sorry, no static link).

6. Finally, from the "Send to" pop-up, choose "file", and your web browser should start downloading a text file of the results. .zip Archive of the resulting download (5 MB).

From here, you could probably use this file to make a BLAST database, however, there are 9 nearly empty FASTA records that should be deleted. You can do this many ways, but Iike to do so graphically using the free TextWrangler, by the makers of BBEdit. Once you delete these, you end up with a multi-FASTA file containing genomic sequences from an additional 9 Hin strains:

22.4-21 (44 contigs), 22.1-21 (18 contigs), 3655 (23 contigs), PittAA (40 contigs), PittHH (59 contigs), PittII (25 contigs), R2846 (20 contigs), R2866 (4 contigs), and R3021 (46 contigs). I certainly can't speak to the quality or completeness of these sequences, but you can download my results (.zip, 5 MB), if they would be useful.

I am sure there is some better way to do this, but I haven't been able to find an FTP server where I can locate these files. A particular problem with this method is that it tends to slow your web browser to a crawl. I accomplished steps 4,5, and 6 with a liberal use of the Safari "stop loading page" button. Basically, I let the intermediate step pages begin to load, then stop them before they complete, so I can choose the next setting. With Firefox, I was unable to complete this tutorial, because the browser would become completely non-responsive for at least 10 minutes. Be careful!
Brian HaugenJuly 18, 2007 at 7:42 AM
I made a 5 minute movie screencast showing the above procedure at on my website.
UnknownJuly 18, 2007 at 10:36 AM
Hi

You can use Geneious with the same search pattern you used:

Open Geneious
Got to NCBI-> Nucleotide
On the search tab click more options
On the drop down list select "Organism"
Enter Haemophilus influenzae
Click on the plus sign and enter "WGS" in the new search box
Click search and you getthe same 576 results.

All the resulting sequences will be displayed on the tab below the search. The donwload takes a couple of minutes.
Go to Edit->Select all and then on Tools->Concatenate ...

That's it. The free version has the same features.

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS