RRResearch: Coverage and contamination in our DNA-uptake dataset

We finally have the corrected reference genome sequences for our DNA uptake deep-sequencing data. The genome sequences from NCBI had errors and polymorphisms relative to the specific strains we used in the experiments, so the former post-doc used the sequence data from this experiment to create reference sequences that perfectly match the DNAs we used.

The bright-coloured table below shows 8 of our 16 samples. These are the samples that examined uptake of DNA from strain 'NP' (86-028NP) into strain 'Rd'. Not included are the equivalent 8 sample that used DNA from strain 'GG"' (PittGG). These reads have been mapped onto a reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome, and all reads that map equally well to both genomes have been removed (reasons for this are explained below). (The numbers in the table are from our earlier preliminary data analysis, so they will probably change a little bit with the new analyses.)

Issues to consider:

Reads or spanning coverage: Should our analysis use coverage data derived from considering each sequencing read independently (what we have been doing) or from the 'spanning-coverage' sequences created by considering each pair of paired-end reads plus its intervening sequences?

We agree that spanning coverage will be better. The former post-doc was going to create the BED files with the spanning coverage data but he got bogged down in other work, so the grad student is now working on this. For now we're using the read-based BED files.

Mean coverage: The numbers in the 'NP mean' column give the average number of reads from each sample that aligned to each genome position. Mean coverage ranges from 166 reads (sample UP2) to 482 reads (sample UP9). This is plenty for our purposes.

Variation in coverage across the genome: Our preliminary analyses did not give us a good view of this, for complicated reasons. In the next day or so the grad student should be generating these graphs for the 'input' samples (the UP13 and UP15 samples mapped onto NP, and the UP14 and UP16 samples mapped onto GG). This will let us see how even the coverage is across each genome, and how reproducible the variation is for each genome's two input samples (short and long) .

Our concern is that there may be positions or segments with very low coverage (less than 10 reads). Such low coverage would reduce our ability to detect the effects of DNA uptake biases, which requires dividing the mean 'uptake' coverage at each position by the 'input' coverage. If the low coverage is due to random noise in the analysis, it will be evened out for the 'uptake data because we have three replicates for each DNA length,. But we only have a single replicate for each DNA length, so the divisor would still retain all the noise.

The relative roles of noise and biases will be checked by comparing the long and short input samples to each other, and the replicate uptake samples to each other. Given our overall high coverage thbig problem may not be random events but sequence biases in the procedures used for library creating and sequencing. Unfortunately these would not be averaged out over the three uptake replicates.

One complication is the presence in the genomes of repeated sequences, especially the six copies of the 16S and 23S ribosomal RNA genes. The mapping software doesn't know which copy reads from within these segments are derived from, so for each such read it randomly choses one repeat copy as the mapped location, and assigns the read a map quality score (MAPQ) of 0. This random assignment is not a problem when the sample consists of reads from a single genome with no contamination, but it contributes to the difficulty of eliminating reads from Rd DNA that contaminates the uptake samples (described below).

Presence of contaminating Rd DNA in the uptake samples: The 'uptake' samples consist of NP (or GG) DNA recovered after being taken up by Rd cells. The recovery process is not perfect, and each sample contains a significant amount of contaminating Rd DNA, whose reads need to be removed before the effects of uptake biases can be determined.

The former post-doc found that the Rd reads could be removed by aligning each uptake sample to a fake reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome. In this 'competitive' alignment, almost all the contaminating Rd-derived reads are mapped to the Rd 'chromosome' (where they can be ignored) and almost all the desired NP-derived reads are mapped to the NP 'chromosome' (where they can be analyzed for evidence of uptake biases).

Why 'almost all' rather than 'all': Several factors will cause reads to be mapped to the wrong 'chromosomes' in this contamination-removal step.

Sequencing errors: First and least important, some NP-derived reads will have sequencing errors that create the Rd alleles of Rd-NP SNPs, causing them to align better to the Rd homolog. This will slightly raise the apparent baseline level of Rd contamination but isn't a serious problem. The reverse will also happen, but we can also ignore them, especially because Rd-derived reads are the minority.
Repeated segments: Second, the rRNA genes of Rd and NP have identical sequences, so when these are randomly assigned locations (see above), half of them will be assigned to Rd and half to NP, regardless of the actual level of contamination, and all will be given MAPQ scores = 0. This is potentially a serious problem, since it reduces the apparent NP-derived coverage of these regions by 50%. Because there are no uptake sequences in the rRNA genes, provided we treat the input and uptake samples identically this will only affect our ability to see effects of uptake biases in sequences within a few hundred bp of the repeated segment.
Shared segments: Third, although NP and Rd genome sequences differ, on average, by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical. The mapping software will treat reads from these segments just like it treats reads from the rRNA repeats, randomly placing them at one of the two possible positions (in Rd or NP) and assigning them MAPQ scores = 0.

(Factors 2 and 3 are not significant right at the edges of repeated and shared segments, but become more important as the distance from the nearest SNP approaches the average read length. Use of spanning coverage rather than single reads should improve our ability to identify and remove the contamination.)

The former postdoc handled factors 2 and 3 by removing from the analysis all reads that had MAPQ=0. The data in the table above shows the read coverages AFTER the reads with MAPQ=0 have been removed. The UP13 and UP15 input samples still have a small fraction (about 0.1% ) of Rd contamination rather than 0% contamination. This may be due to Factor 1 above.

For the ribosomal repeats (and other repeats) removing all the MAPQ=0 reads may be valid, because we genuinely do not know where they came from. But it's more complicated for the shared segments. If uptake coverage were even over the genome, we'd know that most of the reads came from NP. But uptake coverage is not even, especially for the 'short' DNA samples (that's the whole point of the experiment). Positions close to uptake sequences will have high NP coverage, and only a small contribution from Rd contamination. Positions far from uptake sequences will have very low NP coverage, so the even contribution from the Rd contamination will have a big effect.

Field of Science

RRResearch

Coverage and contamination in our DNA-uptake dataset

No comments:

Post a Comment