OK, here's another try at understanding the interactions of repeated sequences, reads with no SNPs, and RD contamination in our DNA uptake sequencing data.
A reminder of what the problem is that I'm trying to solve: We have three 'uptake' samples for each treatment in our big analysis of DNA uptake specificity, but they can't be directly compared because, in the different samples, the recovered 'donor' DNA (from strain NP) probably has different levels of contamination by the recipient strain Rd. We have no direct way to measure this contamination, but I think we can infer it from other features of the samples. (The next post considers how to correct for the % contamination.)
This time I've simplified the diagrams by considering only a short segment containing one rDNA repeated sequence (purple) and one segment with no SNPs (green). When the reads from the contaminated DNA are aligned to the concatenated Rd-NP genomes, reads from both repeats and no-SNP locations will be 'multiply-mapping', and the alignment program will randomly assign each of these reads to one of the places they align with.
Diagram A. The first diagram is alignment of reads of the NP input sample to the concatenated Rd-NP genomes.
At the rDNA repeat and the no-SNP segment the coverage mapped to NP is reduced to 50%, and the coverage mapped to Rd is increased to 50%. These 50% coverages will be constant over all repeats and no-SNP segments.
The multipy-mapping reads (repeats and no-SNP) will be given quality scores Q= 0. Some reads that are poor quality for other reasons may also be given this score - the grad student is going to check how big a contribution this is. We can then calculate the fraction of reads that are multiply-mapping and use this later to improve our estimates of the Rd contamination in the uptake samples.
The next two diagrams separately show the expected mapping of each of the components of an NP uptake sample that is contaminated with 20% Rd DNA.
Diagram B. Here's the expected NP contribution:
All the unique NP reads map at the unique NP positions; this is only 80% of the coverage because 20% is Rd contamination. Only 40% of the multiply-mapping reads are assigned to the multiply-mapping positions in the NP genome. The other 40% are assigned to their Rd homologs.
Diagram C. Here's the expected Rd contribution:
The other 20% of the reads map to Rd. At unique Rd positions this gives 20% coverage, but at the multiply-mapping positions this is reduced to 10% because the other 10% are assigned to their NP homologs.
Diagram D. Now here's the total mapping of the NP uptake sample (including the contamination): (This is what we would actually see when we aligned an NP uptake sample with 20% Rd contamination to the concatenated Rd-NP genomes.)
The NP-aligned sequences stay at 80% coverage for the unique NP sequences, but coverage at the multiply-mapping segments increases to 50% due to the 10% from the Rd sequences. Similarly the coverage at the Rd unique sequences stays at 20%, but increases to 50% at the multiply-mapping segments.
OK, how can we use this analysis to work backwards? We need to start with Diagram D and infer that there is the 20% Rd contamination assumed in Diagrams B and C .
If the data were as clean as I've drawn it we could just look at the Rd coverage in Diagram D and fit a horizontal line to the 'typical' 20% coverage. But the real data may be much noisier than this, so it will probably not be possible to decide where to put such a line.
What if we recalculated Diagram D, this time excluding all the reads with Q=0 scores?
Diagram E.
Perhaps now we can draw the desired line through the high-coverage positions. But this is still a bit subjective, so here's another approach:
First, we calculate mean coverage of the uptake sample over the Rd
genome (Q=0 reads excluded). This gives us a
lower bound for the true contamination.
Then, from Diagram D, we calculate mean coverage of the uptake sample over the Rd genome including the Q=0 reads. This gives us an upper bound for the true contamination.
Now, can we refine these estimates to get the actual contamination? To do so we need to know what fraction of Rd positions in Diagrams D and E were excluded by excluding multiply-mapping reads, and what fraction were over-covered by including them.
Back at Diagram A, we calculated the fraction of reads that had Q=0 scores due to multiple-mapping. If we assume that
true coverage is independent of SNPs and repeats, this fraction should also approximate the
fraction of the genome whose coverage was lost due to multiply-mapping reads (call it '%Q'). We can use this %Q to correct the contamination
estimates.
Correction for lower-bound estimate: Divide by (1 - %Q).
Can we also correct the upper bound estimate? I don't think so (at least not simply), because the degree of over-coverage at the multiply-mapping positions is not fixed - the higher the true contamination is, the smaller the over-coverage effect. That is, we know that the multiply-mapping positions will have 50% coverage, but unless we already know the amount of contamination we can't correct this for the amount of contamination.
However we do have a refined contamination estimate from the lower-bound estimate (call it 'Cu'), and I think we can use this to correct our upper-bound estimate. If we've done this right we should get the same Cu value back.
Correction for upper-bound estimate: Subtract (%Q(50%-Cu)).
Will the real value be midway
between the estimates? No, it will be
closer to the lower bound. The smaller the true contamination is, the more it will be overestimated by including the multiply-mapping reads.
Then what? Hmm, I see that I haven't thought this part through. I think I'll start a new post for that.
We need to do the lower-bound and upper-bound calculations and corrections for each of our 12 'uptake' samples (3 replicates each of 4 treatments). This will give us good estimates of the % contamination for each sample.
e can map each sample only to its own Np genome (or GG). Now the Rd-derived reads from positions that have strong similarity to NP locations should map there, and the Rd-derived sequences that cannot be mapped onto NP will be given Q=0 scores.
and reduce the coverage levels in proportion to the contamination.