In principle, at each position in the genome, we just divide the sequencing coverage for the recovered DNA samples (taken up by cells) by the coverage for the input DNA.
(We are using DNAs from two donor strains 'NP' and 'GG'). Below I'll just describe things for NP, but exactly the same issues apply to GG.)
Complications (some solved, some not):
- Reads from repeated sequences such as ribosomal RNA genes (5 copies) can't be mapped to a single position in the reference genome. Solution: the mapping software randomly assigns ambiguous reads to one of the possible sources, and flags the read with a quality score = 0.
- We have 3 replicates for each uptake treatment, but only one for each input DNA prep. Solution: take the mean coverage over the 3 uptake samples
- Different samples have different total numbers of reads. Solution: Normalize each coverage to coverage per million reads. Do this before calculating the means.
- The recovered donor DNA (strain NP) in the input samples is contaminated with a small fraction (5-17%) of DNA from the recipient (strain Rd). Solution: identify the Rd-derived reads by aligning the sample reads to a concatenated Rd-NP double-genome. Ignore the reads that align to Rd.
- This dual-alignment creates a new problem: Any NP reads that don't include at least one SNP position distinguishing NP from Rd now align ambiguously. Half of these NP reads will be mistakenly mapped to the Rd genome and ignored, and they all will have quality scores = 0. A quick check suggests that this mis-mapping affects a substantial fraction of the genome.
- Because we need to divide the uptake coverages by the input coverage, we have been doing the dual-alignment on the input DNA reads even though we know that they have no Rd contamination.
Currently we have simply discarded all reads with Q=0 before dividing mean NP-specific uptake coverage by NP-specific input coverage to calculate the uptake ratios. Unfortunately discarding Q=0 reads means that our input data has many segments with very low or zero NP-specific coverage (everywhere that that Rd-NP sequence identity is high), which makes our uptake ratio calculations unreliable. We suspect that the actual coverage of the input DNA is much more even.
Now that we better understand what's going on (see previous post), I'm hoping we can come up with a better way to do the analysis.
We need to:
- Identify where NP repeat sequences are being ambiguously aligned. If half of them are being mis-assigned to Rd we need to correct for this, and we need to note that the coverage values for these segments are the repeat averages.
- Identify where unique NP reads are being ambiguously aligned because they are identical to Rd sequences, and correct for the 50% mis-mapped reads.
- Identify how much Rd contamination is in our uptake samples, and correct for this contamination.
If we do this properly, I think we can calculate our uptake ratios without discarding the Q=0 reads.
I. Normalize the coverage: Align the reads in all NP samples to the NP reference genome. This genome will have been tweaked to be identical to the input genome. For each sample, normalize the coverage at each position to coverage per million reads (or whatever is customary or convenient).
I. Normalize the coverage: Align the reads in all NP samples to the NP reference genome. This genome will have been tweaked to be identical to the input genome. For each sample, normalize the coverage at each position to coverage per million reads (or whatever is customary or convenient).
II. Examine coverage for the input sample (blue boxes below): Only reads from repeated sequences (internal-repeat reads) will give ambiguous alignments and Q=0 scores (grey boxes in the maps below). Examine genome coverage of only the Q=0 reads. Are these just at the rDNA operons?
III. Align input reads again, this time to the concatenated Rd-NP dual genome (grey box below): Normalize the coverage as before. Now all the reads that don't contain SNP positions will align ambiguously and have Q=0 scores (as will the internal-repeat reads). Record the normalized coverage by these Q=0 reads on the Rd and NP genomes (yellow and orange boxes below).
IV. Realign uptake reads to the concatenated Rd-NP dual genome: Separately align each uptake sample's reads to the concatenated Rd-NP dual-genome (big green box below). Normalize the coverage as before. The rDNA segments will have higher Rd coverage than the rest.
V. Estimate % contamination in each uptake sample: From each samples' normalized Rd coverage values (green line in the green box), subtract the normalized Rd Q=0 coverage (grey line in the green box, values from Step III, little yellow box) to get the levels of Rd coverage due to contamination. Use these values to estimate a mean contamination level for each uptake sample (dashed green line in little green box).
VI. Correct coverage of uptake samples for contamination: For each uptake sample, multiply each position's normalized coverage (from step I) by (1 - contamination) to get corrected coverage values (pink box).
VII. Calculate mean uptake coverage: At each genome position, take the average of the three corrected normalized coverages.
VIII. Calculate uptake ratios: Divide the mean uptake coverage (from step VII) by the normalized input coverage (from step I).
Now I just need the grad student and former postdoc to read this and check my logic.