The data was generated by Illumina sequencing of genomic DNA samples before and after they had been taken up by competent cells. Using a rec2 mutant as the competent cells lets us recover the DNA intact after uptake.
He has done quite a bit of analysis of the resulting uptake ratios (ratio of 'recovered' coverage to 'input' coverage) but now we're going back and sorting out various anomalies that we initially ignored or overlooked.
One big issue is the surprisingly uneven sequencing coverage of the genome that we see even when just sequencing sheared genomic DNA (the 'input' samples). The graph below shows the sequencing coverage of the first 10 kb of the genome of strain 'NP'. The orange dots are from a short-fragment DNA prep (sheared, most fragments between 50 and 500 bp) and the blue dots are from a large-fragment DNA prep (sheared, most fragments between 1-5 and 9 kb).
Over most of this segment (and over most of the ~1900 kb genome), coverage of the 'short' sample is about 200-400 reads, about twice as high as the ~150-250 read coverage of the 'long' sample. But in three places coverage of both samples falls to near-zero or zero. Similar losses of coverage occur at many places throughout the genome, though this segment has more than usual.
What could cause this? In principle it could be something about the DNA preparations, about the library construction, about the actual sequencing, or about the data processing and alignments to the reference genomes.
- DNA prep artefacts: Long and short DNAs come from the same initial genomic DNA prep. They were sheared differently: a Covaris G tube for the long-fragment prep, sonication for the short-fragment prep. Perhaps there are places in the genome where breaks always occur?
- Library construction artefacts: This used Illumina Nextera XT kits (a transposon method). Might there be strong sequence biases to this process? The distribution of the spanning coverage of the reads indicates that the short library inserts were mostly 100-350 nt (mean about 200 nt) and the long library inserts were mostly 200-500 nt (mean about 325 nt).
- Sequencing artefacts: The former post-doc used an Illumina NextSeq500 to collect 2-3 million paired-end reads of 2x150nt from each sample.
- Analysis artefacts: We've been tracking down an artefact caused by indel errors in another genome used in this experiment, but the NP strain I'm analyzing now doesn't have these errors. However it's possible that something about this reference sequence is causing reads that align to the dip positions to misalign or be discarded.
Very few fragments in the short DNA prep were longer than about 500 bp, and few library molecules had inserts longer than 350 bp, but one of the dips in coverage extends over more than 1 kb. This means that the sequence feature responsible for this dip must either extend over more than 400 bp, or must cause loss of sequence over this distance.
One thing I noticed is that the short coverage is more severely affected than the long coverage. It falls lower and extends broader. This is the opposite of what I would expect if reads containing a particular sequence were unsequenceable or misaligned. This pattern shows very clearly in the graph below, which shows the ratio of coverage by the long-fragment and short-fragment samples. We see that the long coverage is generally about 2-fold higher than the short coverage, but that the ratio flips and falls to zero or near zero at the three low-coverage segments.
One thing we need to do is examine the locations of reads that were discarded because of low quality scores, either because of sequencing problems or because of ambiguous mapping (e.g. rDNA repeats). Finding that reads at the dip positions were low quality would be a big clue.
Next day: After a Skype conversation with the former post-doc I think I understand what's going on. The dips aren't positions of low coverage, but positions were reads have been excluded from the dataset because of planned analyses using samples contaminated with DNA of the recipient strain Rd.
To remove the contaminating DNA, the reads have been mapped not just to the donor strain genome but to a concatenated fake genome consisting of both the donor and recipient genomes. In the 'uptake' samples, the contaminating Rd reads align to the recipient (Rd) genome, not the donor genome. But any reads spanning segments where the donor and recipient sequences are identical map 'ambiguously' to both genomes, and these are flagged as 'data quality = 0' and excluded from the analysis. (The same 0 flag is given to repeated sequences (mainly the rDNA genes) because they also map to multiple positions, this time within a single genome.)
So the dips aren't places where there is no sequencing coverage, just places where the donor and recipient genomes are identical over the length of the pair of reads.
Now the difference between the long-fragment and short-fragment samples (orange and blue dots) makes sense! The short fragment sample gave a sequencing library with shorter inserts (shorter spanning coverage), and shorter sequence intervals are more likely to be identical between the donor and recipient genomes.