The postdoc gave me the actual numbers for the fragments with single mismatches at position 7: The input set contained 5940 of these, and the recovered (uptake) set had only 215. If we hypothesize that all of these 215 arise from contamination, then 3.6% of the fragments in the recovered pool come directly from the input pool. Because we know the exact sequence distribution of the uptake pool fragments (we sequenced 10^7 of them) we can correct the distributions in various subsets of the recovered pool for this possible contamination.
The plan is to do the main analyses with and without the correction. We don't actually know how much contamination there is, but 3.6% is the upper limit. Any results that don't change when the correction is applied are robust.
The analysis I'm most concerned about is the test for interactions between bases at different positions in the uptake sequence. The measure of interactions between positions that don't have big effects on uptake is likely to be robust, as these samples are large and removing 3.6% is unlikely to make much difference. For positions with very strong effects (6, 7, 8 and 9), the contamination correction will definitely reduce the ability to detect any interactions (because the sample size will get much smaller)...
What we see when we ignore possible contamination: When all the sequences with a mismatch at a weak position (e.g. 5) are aligned, we see an increase in the importances of some other positions, and we think this means that the effects of the positions are interdependent. But when all the sequences with a mismatch at a very strong position (e.g. 8) are aligned, we see that the importances of the other positions all shrink dramatically. This could mean that when base 8 is incorrect the DNA is taken up by some sequence-independent process, or that the fragments with incorrect base 8 contain out-of-alignment uptake sequences that our analysis overlooked (we know this occurs). But it could also mean that the fragments with incorrect base 8 were not taken up at all, but entered the recovered pool as contamination. So we need to correct for the maximum possible contamination (3.6%) and see how the importances change.
How should the corrections be done? We have position-weight matrices for the recovered and input pools, and for each subset of this data (e.g. for all fragments with mismatches at position 5, or 8, or 14). We think that, to correct a recovered-pool matrix for contamination, we just need to subtract from it 3.6% of the corresponding value in the corresponding input-pool matrix. This is easy, but when the postdoc tried it he sometimes got negative numbers (whenever 3.6% of an input value was larger than the recovered value. He thinks this means we need to use a more complicated calculation, but I wonder if it just means that, at this position of the matrix, the corrected value is indistinguishable from zero. We both think that it might be wise to consult a mathematician at this point.
The Newton Medal is (bit) late
38 minutes ago in Doc Madhattan