RRResearch: limits set by sequencing errors

The post-doc and I have also been grappling with how the high error rate of the sequencing technology we'll use affects our ability to detect donor alleles in our big pool of recombinant chromosomes. Because this rate is usually about 0.5-1% per position, at positions where donor and recipient sequences have different bases, about 0.2% of the time sequencing will report a donor allele where the real molecule had a normal recipient allele. If we have 500 reads covering a position, on average one of them will appear to be a donor allele even if there has been much less than 0.2% recombination at that position. Because of random events, some positions with little or no recombination will have no sequences reporting the donor allele, but others will have 2 or 3 or more.

At face value this sets a lower limit to the rates of recombination we'll be able to detect. If a SNP position undergoes 0.1% recombination, the difference in the number of donor alleles reported will not be significantly different from the background due to error. This limit is largely independent of the amount of sequencing we do. In principle one could do enough sequencing to even out the random fluctuations in the numbers of errors seen so that a 0.1% recombination rate at a SNP position would be significantly above background, but 'enough' sequencing would be absurdly expensive.

Luckily for us, another post-doc (the clever partner of mine) has just pointed out that we can use the co-occurrence of donor alleles at adjacent SNP positions in a single sequencing read as evidence of recombination. That's because, once the software has excluded those reads whose sequences appear unreliable, we expect most single-nucleotide errors to occur independently of each other - that is, finding an error at one position does not change the probability of finding an error at another position in the same read. We can of course control for this by looking at our control sequencing of unrecombined genomes and at positions where donor and recipient are identical.

But we expect most recombination tracts to be much longer than the ~75 bp covered by a single read (we're going for these rather than the cheaper 50 bp reads). So if we see donor alleles at two SNP positions in a single read, we can be pretty sure* that they arose by one recombination event, not two coincidental sequencing errors. Of course this logic can only be applied where SNPs are close enough to be sequenced in the same read. The post-doc has now checked this, and tells me that 77% of the SNPs in the two sequences we'll use are within 50 bp of each other (the median separation is only 14 bp).

(*The limit becomes the square of the error rate, about 4 x 10^-6.)

It gets better. Because we're going to do 'paired-end' sequencing, we'll actually have two 75 bp sequences for each DNA fragment in our big recombinant pool, separated by about 400 bp of not-sequenced DNA. Provided most recombination tracts are longer than 500 bp (we expect this, and will know for sure from the clones we'll sequence), seeing donor alleles at SNP positions in both end reads will also be evidence of recombination rather than coincidental random errors.

I expect that this analysis will be a pain to do (for the post-doc, as he's the only one with the skills to do it), but it greatly improves our ability to detect low recombination frequencies.

Field of Science

RRResearch

limits set by sequencing errors

No comments:

Post a Comment