To think about this, first consider how we expect the contamination to affect the apparent coverage for each position in the NP (or GG) genome.
We can map each sample just to its own NP genome. The NP-derived reads (bright red and blue) will map to their correct positions (at internal repeats we'll see the average coverage). Now the Rd-derived reads from positions that have strong similarity to NP locations (dark red in the figure) should map to their homologs, including all the Rd reads from repeats and no-SNP segments. I think that the Rd-derived sequences that don't have NP homologs (yellow in the figure) and thus can't be mapped onto NP will be unmappable and given Q=0 scores. This will be about 10-15% of the Rd reads (a known value). So most NP positions will have their NP read coverage plus (say) 20% extra coverage from the Rd reads. But some NP positions (again 10-15%) don't have Rd homologs (blue), and these will have only their NP coverage.
No comments:
Post a Comment
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS