A new plan for contamination correction

The grad student and I had an intense Skype discussion with the former postdoc this morning.  We realized that there's a much easier way to estimate the total Rd contamination in the uptake samples, so we're going to do this.
Preamble:  The former postdoc isn't convinced that this will be better than what we were doing. He likes the simplicity of using the dual-genome alignments directly to calculate uptake ratios, and doesn't think that all the missing data from multiply-mapped positions is a concern.  I don't like discarding what is otherwise good data, and hope that using all the data we can will reduce the noise due to low coverage positions in our uptake ratios.  (The grad student prudently kept his opinions to himself.)
Easy contamination estimate:  We actually already have this data for the eight NP samples, and can easily get it for the eight GG samples.  Way back, the postdoc determined how many of the reads in each sample mapped to the Rd part of the concatenated Rd-NP genome, and how many mapped to the NP part (table below).  Reads from repeats and no-SNP segments were excluded.  The fraction mapping to Rd (Rd/total in the table below) is the fraction of Rd DNA contaminating the sample.  (This data is from a file named meanreads per genome.xlsx.  UP1, 2 and 3 are the replicate long-fragment uptake samples, UP4, 5, and 6 are the replicate short fragment uptake samples, and UP13 and UP15 are the corresponding long and short input samples.)


It's possible that different genome regions are contaminated to different extents (e.g. if some Rd and/or NP sequences are poor substrates for library formation or sequencing), but we'll proceed on the assumption that contamination level is fairly even across the genome, perhaps checking later**.

How to correct the uptake coverage levels for Rd contamination:  We need to only correct the parts of the genome where Rd reads are aligning.  The obvious way to identify these regions is to align reads from pure Rd DNA to the NP reference genome.  At each position the relative coverage is a measure of the extent of contamination we expect due to the Rd contamination in the uptake samples.  Nicely, this coverage measure will intrinsically include the effects of any differences in the sequence-ability of different parts of the Rd genome.

So:
  1. (blue figure )  Align the reads from a pure Rd sample to the NP genome.  (I don't think an Rd control was included in the sequencing done specifically for this project, but we have sequenced pure Rd DNA in other experiments.)  The reads from Rd-unique parts of the Rd genome will be discarded as unmappable.
  2. (blue figure ) Record the Rd coverage at each position.
  3. (yellow figure ) For each NP uptake sample to be corrected for contamination, align its reads to the NP reference genome and record the coverage at each position.  Calculate the average coverage (or normalized coverage) across the whole genome.  
  4. (yellow figure ) Use the NP uptake sample's contamination level (from the table above) to calculate the average amount of this coverage that is due to Rd contamination.  (E.g., if sample UP3 has average coverage of 350 when aligned to NP alone, then 350* 0.0884 = 30.9 of this comes from contamination.) 
  5. (blue figure ) Normalize the position-specific Rd contamination levels (from point 2 above) to the Rd read coverage predicted to be due this contamination level (from point 4 above, 30.9 in the example).
  6. (yellow figure ) Subtract the normalized Rd contamination from the total coverage at each genome position to get its contamination-corrected coverage.


If the uptake coverages weren't normalized at step 3, normalize them before combining them to get the mean coverages over the three uptake samples.

Then compare the resulting uptake-ratio maps to those we previously obtained.

Now, let's see what the former postdoc and grad student think of this plan.  It might not be exactly what we discussed.


** How to check if contamination is evenly distributed across the genome:  The former postdoc has a list of ~34,000 well-behaved SNPs that distinguish Rd and NP.  After mapping an NP uptake sample onto the NP reference genome, at each position we can score the frequency of the Rd and NP alleles.  Since the background error rate is low, we can just take the fraction of Rd alleles as the contamination level at each position.  Now map these contamination levels across the NP genome.  We expect to see a fairly constant level.  At positions where Rd has no homolog to the NP sequence, there will be no data points because there will be no SNPs to score.  There will certainly be noise in these levels.  But maybe we'll see smooth variation that suggests that some parts of the NP genome have higher Rd contamination than other parts.

No comments:

Post a Comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS