RRResearch: genetic variation

The analysis I showed in my last post deserves more discussion. First, it could be improved in a couple of ways. Second, is this an expected result for sequences subject to a molecular drive resulting form biased DNA uptake and unbiased homologous recombination?

Improvement one: better controls. As it stands, the comparisons of variation at USS positions is compared to adjacent positions that are not part of the USS motif. This is a bit weak, because some of these positions may contribute to DNA uptake but not show up as 'motif'. A better control comparison would be with random segments of the genome. This is easy to do, as I already have a set of 3500 segments of 39bp (like the USS-centered segments) that I used as a control for the analysis of covariation. Well, 'easy' to see how to do (convert this set to a BLAST database, query it with the three genomes, and extract the alignment positions), but it's still a pain doing the Word-Excel shuffle to extract the position information from the six BLAST searches (each strand of each genomes). But this control would let me draw a band across the graph, probably at about 2% variation, indicating how much variation non-USS sequences have.

Improvement two: more data. Some of the error bars are uncomfortably large, because the datasets don't contain a lot of variation (rarely more than 20 mismatches at any one position even over the ~5000 alignments). Using more datasets would help. Only three complete genomes are available through GenBank, but incomplete versions of 9 more genomes are available. I should concatenate the few largest contigs of at least some of these and use them as queries to the USS database.

Improvement 1.1? Controls with more data? If I do the analysis with more genome sequences, should I also query the control database with them?

What does it all mean? I think we need to start by considering the implications of the molecular drive model for sequence variation. (This is the simplest explanation, invoking forces we already know are acting.) Let's start by imagining that the population of H. influenzae genome sequences are at an evolutionary equilibrium. This means pretending that the environment is unchanging and that all genes have evolved to their optimal sequences at mutation-selection equilibrium (that is, mutations that arise are subject only to purifying selection). Pretend too that the USSs in the genome have also evolved to equilibrium, with the molecular drive balanced by both random mutation and any purifying selection the sequences may be subject to due to their cellular functions.

Consider first some specific location of the USS motif, at a place in the genome that has no cellular function at all. It undergoes random mutation, but its recombination is biased by the specificity of the DNA uptake system. Is this sequence expected to have less 'standing genetic variation' than non-USS positions? Will biased DNA uptake act like stabilizing selection, purging genetic variation that reduces the uptake of the segment?

We need to consider the diversity of this specific sequence in a diverse population of H. influenzae cells. To start with the simplest case, assume that this sequence has the optimal uptake sequence, perfectly matching the bias of the uptake machinery. Also assume that most other cells in the population have the same sequence at this USS site. If it mutates at any USS-motif position, it will be further away from the preferred USS and it will preferentially get converted back. I think this means that biased uptake/recombination acts like stabilizing selection, reducing genetic variation.

What if the sequence isn't a perfect-consensus USS? This requires thinking about a more complex situation. Maybe it isn't 'perfect' because it's already been hit by several mutations but not by replacement. Because both processes are stochastic, some sequences will 'get lucky' and some won't, especially if the molecular drive is only a bit stronger than random mutation (so our Perl model tells us). But assume that this is still the usual version of this site in the population, i.e. the available DNA will usually have this sequence. If this sequence in our cell mutates to a worse matches, it will still be preferentially restored by uptake/recombination, purging the variation. If it mutates to a better match, and then dies, its DNA has a better than average chance of being taken up and recombined into another cell, which I think would increase variation.

I think this means that, if we were to examine many individuals for their sequence at the same particular USS site, we expect to see reduced variation at positions that match the USS consensus, and increased variation at positions that don't match it. But that's not what the analysis I've just done looked at. Instead it summarized the variation at each position across ALL of the USS sites in the Rd genome. Most of these positions did match the consensus, so it's not surprising that, on average, they showed reduced variation.

OK, I seem to have convinced myself that reduced variation is what we should expect at USS sites, if they are being maintained by biased uptake and recombination. But my evolutionarily savvy post-docs are suggesting other causes, so we'll need to put our heads together to see if we can reach agreement.

Everything is working.

I still can't get BLAST to attend to matches close to the ends of the 39 nt fragments, but I'm treating these as mismatches at the innermost position and 'no information' at positions closer to the end. For example, if a sequence matches at positions 4-39, I assume there's a mismatch at position 3 and that I have no information about positions 1 and 2.

I'm searching for the two USS orientations separately (searching the forward and reverse strands of the query genome separately). I'm analyzing the data separately. So far I've analyzed only the forward searches, but I'll need to flip the results I'll get from analyzing the reverse searches.

I'm collecting the output as pairwise comparisons between query and USSs, because this makes it easiest to pull out the positions of the mismatches without the information about what kind of a mismatch each is.

I'm doing the analysis by bouncing the file back and forth a couple of times between Word and Excel, using Word to first insert tabs between the output lines, and then Excel to delete the columns (formerly lines) I don't want (including the actual sequences) and to sort the results by both the match score and the locations of the ends of the matches. Then I use Word to insert tabs between each position of each alignment, and Excel to count the numbers of mismatches at each position and to graph the results.

Next post I'll describe the results.

Field of Science

RRResearch

More about the reduced variation at USSs

BLAST success, analysis success