Does variation in sequencing coverage help explain apparent variation in recombination?

Preamble:  The grad student and I have been looking at the Illumina sequencing coverage of the DNAs used for our DNA-uptake project, and considering how best to exclude from analysis the bits of the genome that consistently have very low coverage.  We're probably just going to exclude all positions that have fewer than 10 reads in the control 'input' DNA sample.  But thinking about this got me thinking about the extent to which position-specific differences in coverage could influence the estimates of recombination frequency in the sister project I'm doing with the former post-doc.

Here's a figure showing how dramatic the variation in coverage is (two 50 kb segments of the genome):


Here's a quick overview of the recombination project.  We transformed the Rd strain with DNA from the divergent NP strain, pooled 10,000 novR colonies, and genome-sequenced the pooled DNA at 20,000-fold coverage to measure the frequency of NP alleles at the ~35,000 SNP positions where the two strains differ.  This gave a genome-wide map of recombination frequency that showed surprisingly high variation.  The graph below shows the reproducibility of this variation across independent samples (one with colonies selected for novR and the other for nalR).



We need to check that some of the apparent differences in recombination frequency at different positions aren't actually due to differences between the sequencing efficiencies of the Rd and NP SNP alleles at these positions.  The former post-doc and I had a Skype conversation about this this morning.  Here's our plan.

He has the control sequencing data:  coverage for each position in the control NP genome sample and in the control Rd sample, each aligned to its own reference genome.  To simplify the comparisons he'll first normalize each data set to its mean coverage.  If we were to plot the SNP-position coverages against each other we'd expect to see something like this:



For each of the ~35,000 SNPs, the ratio of its NP allele and Rd allele coverages (call it the 'bias ratio') tells us how much sequencing biases could influence its apparent recombination rate.  

Now all the former postdoc needs to do is calculate the correlation between the bias ratios and the estimated recombination rates across the genome.  If he sees little or no correlation, then sequencing biases are not contributing to the measured frequencies of NP alleles in the recombinant-genome DNA pool and we can continue to search for the factors that do contribute.  But if he sees a solid correlation then we need to investigate further.

Steps for the former post-doc:  
  1. Normalize control-genome sequencing coverages to their means.
  2. Calculate bias ratios for each SNP.
  3. How much of the recombination variation is explained by the bias ratios? 

Making sense of RNA-seq comparisons

***Hey, it's RRResearch's 10th blogiversary!***

Back to work:

I'm working on the toxin-antitoxin manuscript, and trying to use the RNA-seq data to decide which genes have changed expression in which mutants.  This information should help us understand how the toxin acts to prevent DNA uptake, and what else it might affect.

The comparisons should be straightforward because the former undergrad/summer student left me with a superb set of analyses and R scripts, including EdgeR and DESeq2 analyses comparing expression of different strains at the various time points.  But I'm having a hard time making sense of the results, because some comparisons that I expect to give few significant differences give many, and others give very few.

There are also big inconsistencies between the EdgeR and DESeq2 results.  For example, in one comparison of two mutant strains (taxx vs antx, at time M1), EdgeR finds no genes that are significantly different but DESeq2 finds about 70 genes, with a cutoff for 'significant' that requires both of the following:
  1. must have padj or FDR score less than 0.05<0 .05="" li="">
  2. must have at least a twofold change in expression
In fact, all but two of the genes in this EdgeR comparison have FDR values greater than 0.5, and almost all have FDR values of 1.00.


I had a Skype conversation with the former post-doc this morning, and he suggested an analysis that might clear things up for me.  But I'll need to ask the former student to do it for me, or to modify the R scripts so I can do it.

Step 1:  Identify all the genes whose expression is significantly changed at each of the MIV timepoints (M1, M2 and M3), relative to their expression in log phase in sBHI (M0 timepoint).

Step 2:  Using the same cutoff, examine the genes that differ significantly between different strains at a single time point.  How many of these are also among the 'MIV-induced' set for this timepoint?

If we find that a particular genetic difference (wildtype vs a mutant, or two mutants vs each other) causes changes in the same genes that are changed by MIV in wildtype cells, we could conclude that the genetic difference affects the cellular response to MIV.  If there's no more overlap than we'd expect by chance, then no.