RRResearch: Error and contamination in our uptake data

Here's a figure showing our basic uptake data; it shows how changes at individual bases affected DNA uptake from our giant pool of degenerate uptake sequence fragments. At some positions (19-23, 30, 31) changes had no effect; these positions contributed nothing to uptake specificity. At other positions there were small (1, 11-14, 18, 24, 29) or moderate (2-5, 10, 15-17, 25-28) effects. And at four positions (6-9) there were severe effects.

The evidence of what was not taken up isn't compromised by concerns about sequencing error or about contamination of the recovered (taken up) DNA with DNA that remained outside the cells or on their surfaces.

But we also want to analyze the fragments that were taken up, as these can reveal the effects of interactions between different positions. But we need to be sure that this is genuine uptake. Sequencing error and contamination will have only small impacts on the positions with modest and moderate uptake effects. But error and contamination become very important when we want to consider what was taken up at the positions with severe uptake effects. Thus we want to know how much of the apparent residual uptake at positions 6, 7, 8 and 9 is genuine and how much is only apparent, due to either sequencing errors or contamination.

It's easiest to analyze the contributions of error and contamination when we consider the subsets of fragments that have only one difference from the perfect consensus uptake sequence (we call these 'one-off' fragments). The second version of the figure (again, waiting for Blogger to fix this new problem) shows the one-off uptake data. When only position 6, 7, 8 or 9 had a non-consensus base, uptake was reduced to 0.4%, 0.1%, 0.1% and 0.4% respectively. In principle, this apparent uptake could be entirely genuine, entirely due to sequencing errors, or entirely due to contamination, and considering each of these extreme possibilities lets us set limits to their contributions.

Consider position 7: The set of 10,000,000 sequence reads from the recovered pool contains about 225,000 perfect consensus sequences but only 215 sequences mismatched at only position 7. At one extreme, these 215 could have all arisen by errors in sequencing the perfect-consensus fragments; this would imply an error rate of about 0.1%. At another extreme, these 215 could all be due to contamination; this would imply that about 3.6% of the DNA in the recovered pool actually came directly from the input pool. At the third extreme, these 215 could all have been taken up by the competent cells.

What other evidence can constrain these extremes? The above 'upper limit' error rate of 0.1% is already quite small, at the low end of estimates of typical Illumina error rates. Our preliminary analysis of the 10 control positions in the sequenced fragments indicates a much higher (not lower) rate, but this analysis is confounded by frameshift errors that we hope to disentangle today. But I don't expect the control positions to give us a rate lower than 0.1%, so we won't be able to formally exclude the possibility that all 215 one-off-at-7 sequences arose by sequencing errors. Late-breaking data: Our collaborator, who did the sequencing, has just provided the error-frequency data from the control lane: 0.44%.

We can constrain the upper-limit analysis of contamination rates using data from uptake experiments using radiolabeled fragments. When cells were given a DNA fragment carrying a random sequence rather than a USS, uptake was reduced more than 100-fold. So contamination is expected to be less than 1%, but we don't have any direct way to test for it in the experiment that generated our sequence data.

We do have direct evidence of how well fragments mismatched only at position 7 are taken up (here). But this estimate is about 5%, a lot higher than the 0.1% upper limit set by the one-off sequence data.

One other analysis of our data is important here. The postdoc made logos of all of the sequences in the uptake dataset that were mismatched at any particular position, to compare to the logo for the complete uptake dataset (some are shown below). For most positions we still see the basic preferences. But in the ~75,000 sequences with position 7 mismatched, there is no information in other positions. He originally interpreted this as meaning that fragments mismatched at position 7 were taken up in a different way, but it's also entirely consistent with most of the sequences arising from contamination. We've just reexamined this raw data (without the logo analysis), and there are weak signals, suggesting that some of the sequences are not contamination.

What does this error and contamination analysis tell us? Basically, for the four positions where uptake is severely affected by sequence changes, I don't think we can use the sequences in the uptake dataset to make inferences about uptake of mismatched fragments.

But we haven't yet analyzed the predicted effects of sequencing errors on the full set of sequences, only the one-offs, so maybe this will give more useful constraints. There were almost 10,000,000 sequences in the recovered dataset with the consensus base at position 7, and about 75,000 with one of the three non-consensus bases. For these 75,000 to have all arisen by sequencing errors, the error rate would have to be 0.75%. If the error rate was indeed 0.44%, then uptake (or contamination) could be responsible for only about 40% of the not-consensus-at-7 sequences. However this is a very simplistic analysis - we need to lay out all the sources and sinks to do the error analysis properly. (Later: No, this is the complete analysis - it's not overly simplistic at all.) Sequences that result from errors in sequencing of fragments with the consensus C at position 7 are expected to show the same logo as that of the full recovered dataset. The only ways to get sequences that don't show this logo are (1) by contamination or (2) by uptake if the specificity for other positions is eliminated when position 7 is not a C.

Field of Science

RRResearch

Error and contamination in our uptake data

No comments:

Post a Comment