The postdoc derailed our consideration of contamination in his 'uptake' pool of DNA fragments by raising the issue of errors in the Illumina sequencing. We had discussed this issue long ago, before we had any data, and then forgot about it in the rush to analyze the results. How embarassing!
The expected level of sequencing errors is somewhere between 0.1 and 1%. We have two sets of estimates from our data, but they're very discordant.
One set of estimates comes from the frequency of sequences in the uptake pool that differ from the 225158 perfect consensus sequences at only one of the 31 degenerate positions. At the positions that are most important for uptake, positions 7 and 8, there are only 215 and 156 such fragments. If we make the extreme assumption that they all arose by sequencing errors of perfect-consensus fragments, the error rate must be less than 0.1% (If we allow some contamination and/or some uptake of the mismatched fragments the error rate would be even lower.
The other set of estimates comes from the control non-degenerate bases that precede (4) and follow (5) the degenerate sequence. We know what the base should be at these positions, so we can just count the differences. These are shockingly high; for different positions they range from 0.5% to 9.1%. Because of a weird pattern in the identity of the error bases, we suspect that these values have been confounded by misalignment problems, arising because the oligo synthesis or the sequencing erroneously skipped one or more positions. We'll try to sort this out this morning by looking directly at the non-degenerate positions in a few of these reads. If the differences at the 10 control positions are really due to base-identification errors we should see them in almost half of the reads.
A mathematical theory of communication
12 hours ago in Doc Madhattan