How to analyze the DNA uptake data

I'm working on getting a clear plan for how we will infer uptake bias for each genome position (genome-wide uptake ratios) from our DNA uptake sequencing data.

In principle, at each position in the genome, we just divide the sequencing coverage for the recovered DNA samples (taken up by cells) by the coverage for the input DNA.

(We are using DNAs from two donor strains 'NP' and 'GG').  Below I'll just describe things for NP, but exactly the same issues apply to GG.)

Complications (some solved, some not):
  1. Reads from repeated sequences such as ribosomal RNA genes (5 copies) can't be mapped to a single position in the reference genome.  Solution: the mapping software randomly assigns ambiguous reads to one of the possible sources, and flags the read with a quality score = 0.
  2. We have 3 replicates for each uptake treatment, but only one for each input DNA prep.  Solution: take the mean coverage over the 3 uptake samples
  3. Different samples have different total numbers of reads.  Solution:  Normalize each coverage to coverage per million reads.  Do this before calculating the means.
  4. The recovered donor DNA (strain NP) in the input samples is contaminated with a small fraction (5-17%) of DNA from the recipient (strain Rd).  Solution: identify the Rd-derived reads by aligning the sample reads to a concatenated Rd-NP double-genome.  Ignore the reads that align to Rd.
  5. This dual-alignment creates a new problem:  Any NP reads that don't include at least one SNP position distinguishing NP from Rd now align ambiguously.  Half of these NP reads will be mistakenly mapped to the Rd genome and ignored, and they all will have quality scores = 0.  A quick check suggests that this mis-mapping affects a substantial fraction of the genome.
  6. Because we need to divide the uptake coverages by the input coverage, we have been doing the dual-alignment on the input DNA reads even though we know that they have no Rd contamination.
Currently we have simply discarded all reads with Q=0 before dividing mean NP-specific uptake coverage by NP-specific input coverage to calculate the uptake ratios. Unfortunately discarding Q=0 reads means that our input data has many segments with very low or zero NP-specific coverage (everywhere that that Rd-NP sequence identity is high), which makes our uptake ratio calculations unreliable.  We suspect that the actual coverage of the input DNA is much more even.

Now that we better understand what's going on (see previous post), I'm hoping we can come up with a better way to do the analysis. 

We need to:
  • Identify where NP repeat sequences are being ambiguously aligned.  If half of them are being mis-assigned to Rd we need to correct for this, and we need to note that the coverage values for these segments are the repeat averages.
  • Identify where unique NP reads are being ambiguously aligned because they are identical to Rd sequences, and correct for the 50% mis-mapped reads.
  • Identify how much Rd contamination is in our uptake samples, and correct for this contamination.
If we do this properly, I think we can calculate our uptake ratios without discarding the Q=0 reads. 

I.  Normalize the coverage:  Align the reads in all NP samples to the NP reference genome.  This genome will have been tweaked to be identical to the input genome.  For each sample, normalize the coverage at each position to coverage per million reads (or whatever is customary or convenient).

II.  Examine coverage for the input sample (blue boxes below):  Only reads from repeated sequences (internal-repeat reads) will give ambiguous alignments and Q=0 scores (grey boxes in the maps below).  Examine genome coverage of only the Q=0 reads.  Are these just at the rDNA operons?

III. Align input reads again, this time to the concatenated Rd-NP dual genome (grey box below):  Normalize the coverage as before.  Now all the reads that don't contain SNP positions will align ambiguously and have Q=0 scores (as will the internal-repeat reads).  Record the normalized coverage by these Q=0 reads on the Rd and NP genomes (yellow and orange boxes below).  


IV.  Realign uptake reads to the concatenated Rd-NP dual genome:  Separately align each uptake sample's reads to the concatenated Rd-NP dual-genome (big green box below).  Normalize the coverage as before.  The rDNA segments will have higher Rd coverage than the rest.  

V.  Estimate % contamination in each uptake sample:  From each samples' normalized Rd coverage values (green line in the green box), subtract the normalized Rd Q=0 coverage (grey line in the green box, values from Step III, little yellow box) to get the levels of Rd coverage due to contamination.  Use these values to estimate a mean contamination level for each uptake sample (dashed green line in little green box).

VI. Correct coverage of uptake samples for contamination:  For each uptake sample, multiply each position's normalized coverage (from step I) by (1 - contamination) to get corrected coverage values (pink box). 


VII.  Calculate mean uptake coverage:  At each genome position, take the average of the three corrected normalized coverages.

VIII. Calculate uptake ratios:  Divide the mean uptake coverage (from step VII) by the normalized input coverage (from step I).

Now I just need the grad student and former postdoc to read this and check my logic. 






A new kind of problem

The PhD student and I are analyzing the data from his mapping of H. influenzae's uptake of genomic DNA.

The data was generated by Illumina sequencing of genomic DNA samples before and after they had been taken up by competent cells.  Using a rec2 mutant as the competent cells lets us recover the DNA intact after uptake.

He has done quite a bit of analysis of the resulting uptake ratios (ratio of 'recovered' coverage to 'input' coverage) but now we're going back and sorting out various anomalies that we initially ignored or overlooked.

One big issue is the surprisingly uneven sequencing coverage of the genome that we see even when just sequencing sheared genomic DNA (the 'input' samples).  The graph below shows the sequencing coverage of the first 10 kb of the genome of strain 'NP'.  The orange dots are from a short-fragment DNA prep (sheared, most fragments between 50 and 500 bp) and the blue dots are from a large-fragment DNA prep (sheared, most fragments between 1-5 and 9 kb).


Over most of this segment (and over most of the ~1900 kb genome), coverage of the 'short' sample is about 200-400 reads, about twice as high as the ~150-250 read coverage of the 'long' sample.  But in three places coverage of both samples falls to near-zero or zero.  Similar losses of coverage occur at many places throughout the genome, though this segment has more than usual.

What could cause this?  In principle it could be something about the DNA preparations, about the library construction, about the actual sequencing, or about the data processing and alignments to the reference genomes.  
  1. DNA prep artefacts: Long and short DNAs come from the same initial genomic DNA prep.  They were sheared differently: a Covaris G tube for the long-fragment prep, sonication for the short-fragment prep.  Perhaps there are places in the genome where breaks always occur?
  2. Library construction artefacts: This used Illumina Nextera XT kits (a transposon method).  Might there be strong sequence biases to this process?  The distribution of the spanning coverage of the reads indicates that the short library inserts were mostly 100-350 nt (mean about 200 nt) and the long library inserts were mostly 200-500 nt (mean about 325 nt).
  3. Sequencing artefacts:  The former post-doc used an Illumina NextSeq500 to collect 2-3 million paired-end reads of 2x150nt from each sample.  
  4. Analysis artefacts: We've been tracking down an artefact caused by indel errors in another genome used in this experiment, but the NP strain I'm analyzing now doesn't have these errors.  However it's possible that something about this reference sequence is causing reads that align to the dip positions to misalign or be discarded.

Very few fragments in the short DNA prep were longer than about 500 bp, and few library molecules had inserts longer than 350 bp, but one of the dips in coverage extends over more than 1 kb.  This means that the sequence feature responsible for this dip must either extend over more than 400 bp, or must cause loss of sequence over this distance.  

One thing I noticed is that the short coverage is more severely affected than the long coverage.  It falls lower and extends broader.  This is the opposite of what I would expect if reads containing a particular sequence were unsequenceable or misaligned.  This pattern shows very clearly in the graph below, which shows the ratio of coverage by the long-fragment and short-fragment samples. We see that the long coverage is generally about 2-fold higher than the short coverage, but that the ratio flips and falls to zero or near zero at the three low-coverage segments.

 

One thing we need to do is examine the locations of reads that were discarded because of low quality scores, either because of sequencing problems or because of ambiguous mapping (e.g. rDNA repeats).  Finding that reads at the dip positions were low quality would be a big clue.

Next day:  After a Skype conversation with the former post-doc I think I understand what's going on.  The dips aren't positions of low coverage, but positions were reads have been excluded from the dataset because of planned analyses using samples contaminated with DNA of the recipient strain Rd.

To remove the contaminating DNA, the reads have been mapped not just to the donor strain genome but to a concatenated fake genome consisting of both the donor and recipient genomes.  In the 'uptake' samples, the contaminating Rd reads align to the recipient (Rd) genome, not the donor genome.  But any reads spanning segments where the donor and recipient sequences are identical map 'ambiguously' to both genomes, and these are flagged as 'data quality = 0' and excluded from the analysis.  (The same 0 flag is given to repeated sequences (mainly the rDNA genes) because they also map to multiple positions, this time within a single genome.)

So the dips aren't places where there is no sequencing coverage, just places where the donor and recipient genomes are identical over the length of the pair of reads.

Now the difference between the long-fragment and short-fragment samples (orange and blue dots) makes sense!  The short fragment sample gave a sequencing library with shorter inserts (shorter spanning coverage), and shorter sequence intervals are more likely to be identical between the donor and recipient genomes.

-->