Field of Science

New contamination analysis suggests alternate strategy

Yesterday's post considered the factors affecting our current strategy for removing the contamination Rd-derived reads from our 'uptake' samples.  Since then I've looked at the read-mapping results using our new perfected reference sequences.  The problem of reads that have no SNPs and thus map ambiguously to both Rd and NP/GG is worse than we thought.  Below I'll describe this data and the new strategy I want to consider.

The former post-doc sent us a spreadsheet with all the summary data from mapping each of our samples ('input', 'uptake' and controls) to various combinations of genomes.  I'll start by considering what we see when we map the 'input' genomes (NP and GG) and the recipient genome (Rd) onto artificial genomes consisting of an Rd-genome 'chromosome' and an input-genome 'chromosome (NP or GG).  These samples have no contamination, so any reads that map to the wrong 'chromosome' will do so because of these factors (also described in the previous post):

  1. Sequencing errors:  Some NP-derived (or GG-derived) reads will have sequencing errors that create the Rd alleles of SNPs, causing them to align better to the Rd homolog. 
  2. Shared segments: Although the Rd and NP (or GG) genome sequences differ on average by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical, including each genome's six copies of the ribosomal RNA genes.  Reads from within these segments cannot be attributed to one source and are randomly assigned locations by the mapping program.  Half of them will be assigned to Rd and half to NP (or GG), regardless of the actual level of contamination, and all will be given MAPQ scores = 0.
Here are the results.  I've deleted the columns with the raw coverage data and just show the % of reads in each sample that were incorrectly assigned.



Depending on the combination of reference genomes used, between 4.75% and 7.61% of the reads are assigned to the incorrect genome, either due to sequencing errors or because the segment they cover is identical in both genomes (does not include any SNP positions).  

The likelihood that a read will be mis-mapped depends on its length - short reads are more likely to not include any SNP positions.  This probably explains the lower mis-assignment rates of the 'long' NP and GG samples.

Very little of this mis-assignment is due to sequencing errors.  The table below shows how the % mis-assignments change when the reads with MAPQ scores of 0 are removed (XXX rows).


Excluding the ambiguously mapping reads dramatically reduces the mis-mapping, to between 0.05% and 0.21%.  This residual mis-mapping reflects reads that did not match the correct genome, but matched a SNP allele in the incorrect genome.  This will be due to sequencing errors that change the SNP allele.

With this information we can calculate the actual fraction of reads that are ambiguously mapped, as 
2 * (mis-mapped reads - seq. errors)

This is a lot more than I expected!  Only a small part of the mis-mapping is due to the 6 ribosomal RNA operons.  Each operon has about 4.5 kb of identical sequence, which will cause about 25 kb (1.4%) of the genome to map ambiguously.  Most of the ambiguous mapping must thus be due to identical sequences in other parts of the genome.  These are likely to be mainly in coding sequences that are either highly conserved or that are identical by descent due to previous recombination between recent ancestors of these strains.

New plan for contamination correction:  Our original contamination-correction plan was to remove the contaminating reads from the uptake samples.  We planned to use mapping to concatenated 'genomes' to identify and discard reads that mapped unambiguously to Rd and that mapped ambiguously to both genomes.  But the above analysis shows that we would be discarding a large fraction of the non-contaminating (NP or GG) reads from many parts of the genome (probablythe most conserved parts).

So here's an alternative.  Instead of trying to remove the contamination from the 'uptake' samples (the numerator in our uptake-ratio calculation), let's add fake contamination to the 'input' samples (the denominator in our calculation).  This idea was suggested by some fake-contamination samples the former post-doc created to do some previous tests.

Here's how it would work:  
  1. First we estimate the actual contamination level of each 'uptake' sample, by mapping it onto the concatenated pair of genomes and using the above analysis to correct for the mis-mapping.   (This calculation is shown below.)
  2. Then we create, for each uptake sample, a fake 'input+contamination' sample that matches the Rd contamination of the corresponding uptake sample, by adding the appropriate number of (randomly-selected?) Rd reads to its reads. 
  3. Then we map each sample's reads (uptake and input+contamination) to the genome it's supposed to represent, not paying any attention to possible contamination.
  4. Then we normalize the coverages of all these mapped samples to the same size (reads per million bp or ???).
  5. Then we calculate the uptake ratios for each uptake sample, by dividing the normalized 'uptake' coverage at each genome position by the corresponding normalized 'input+contamination' coverage. 
  6. Finally, we calculate mean uptake ratio at each position for the sets of three replicate uptake samples.  


Actual contamination:  The calculations to estimate the actual contamination level of each 'uptake' sample took a bit of algebra to set up (see *** below), but I checked them a couple of ways and I'm pretty sure they're correct.


The final column is the calculated contamination level for each sample.  This is the fraction of the DNA in the sample that came from the chromosomes of the recipient Rd cells instead of from donor NP or GG DNA taken up by these cells.  These values are surprisingly close to the values in yesterday's post, which were calculated in a completely different way, using only the reads that did not map ambiguously.  That's fine - it tells us that there's nothing special about the ambiguously mapping reads.

I'm not absolutely sure that this strategy will eliminate the effects of contamination (we can test it with a fake-contaminated uptake sample).  If it does, we should see that the uptake ratio peaks are better correlated with the quality-scores of the uptake sequences (the large scatter we saw previously should be reduced).  And it will certainly reduce the problems associated with the many regions with low or zero coverage created by discarding reads with MAPQ=0.

What next: The first step is for the grad student and former post-doc to read and comment on this post.  If they agree that what I'm proposing a reasonable approach, we'll test it.  

What about using spanning coverage instead of single reads? The grad student is now generating the spanning-coverage, but the goal of using it to correctly position the ambiguous reads is challenging.  In principle we won't need to tag individual reads as ambiguous if we use this new contamination-correction strategy.

***  The contamination calculation:

The input and control analysis above tells us what fraction of the reads mapping-to-Rd in each uptake sample results from mis-assignment of ambiguously mapping reads rather than from contamination.

I worked out the calculation by instead considering an imaginary sample whose contamination level C was known:



1 comment:


  1. I think I mostly follow the logic above, though I have to try and work through a couple parts of it, and I wanted to look at some of the other numbers in the big giant summary table.

    One thing: Even though it's a 14% loss of reads, this is not a loss of 14% of the genome. The number of low/zero coverage positions, I'd still expect to be much lower than this.

    The spanning coverage would mitigate the problem, but we'd still have it, so it's worth working through in any case. Using spanning coverage helps us in a few ways, but one of the biggest is that it should mitigate this problem. Doing it is just a technical filtering issue at this point. When BWA aligns two reads discordantly (e.g. far apart), and one of them has an ambiguous mapping, it actually looks for a concordant alternate mapping for that read using a full Smith-Waterman alignment (so more sensitive). If it finds one, it keeps this mapping instead.

    That would mean we'd map a whole bunch of the multiply mapping reads and exploit 300 bp instead of 150 bp. We'll still have the same issue, but I would expect a lot less of it.


    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS