Field of Science

How will contaminating Rd reads map onto NP?

Based on the analysis in the previous post, we can calculate lower-bound and upper-bound estimates for the Rd contamination levels  for each of our 12 'uptake' DNA samples (3 replicates each of 4 treatments).  And we can correct the lower-bound estimate (and maybe the upper-bound one too) to get an estimate of the true contamination level for each sample. But what do we do with this information?

To think about this, first consider how we expect the contamination to affect the apparent coverage for each position in the NP (or GG) genome.  

We can map each sample just to its own NP genome.  The NP-derived reads (bright red and blue) will map to their correct positions (at internal repeats we'll see the average coverage).  Now the Rd-derived reads from positions that have strong similarity to NP locations (dark red in the figure) should map to their homologs, including all the Rd reads from repeats and no-SNP segments.   I think that the Rd-derived sequences that don't have NP homologs (yellow in the figure) and thus can't be mapped onto NP will be unmappable and given Q=0 scores. This will be about 10-15% of the Rd reads (a known value).  So most NP positions will have their NP read coverage plus (say) 20% extra coverage from the Rd reads. But some NP positions (again 10-15%) don't have Rd homologs (blue), and these will have only their NP coverage.

Grad student and former postdoc, how should we handle this?  If we have a table listing which NP positions have no Rd homolog, we could just do the contamination correction on the positions that do have homologs.  But how well this works depends on how well we understand the limits of the algorithm that maps the reads onto the reference sequence.

No comments:

Post a Comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS