The postdoc's manuscript on uptake bias is inching towards completion. He's added most of the references and updated the figures, and we've only discovered one new analysis that needed to be done. But including this analysis at the right place in the Results makes writing the rest of the Results a lot more straightforward, so we're ahead of the game.
What is this analysis? Removing, from our dataset of 10^7 sequence reads of DNA fragments that the competent cells took up, some sequences that may have been interpreted incorrectly. The incorrect interpretation happens because the sequence responsible for their uptake isn't correctly aligned in our analysis. Here's a figure explaining the problem:
Below these are two reads that were misaligned because they contained either an insertion or a deletion of a single base. We think these insertions and deletions arose during synthesis of the pool of degenerate fragments. Although these fragments still contain good uptake sequences (red arrows), the incorrect alignment doesn't recognize this. Instead, the fragments appear to have been taken up despite having very poor agreement with the consensus.
Below these misaligned reads is a sequence that is correctly aligned but that contains a second match to the core consensus, indicated by the green arrow. This second match was created by several changes downstream of the consensus uptake sequences, but it isn't recognized by the analysis because it is out of alignment and, in this case, in the other orientation. The presence of two uptake sequences means that we can't attribute their uptake to the one sequence that's correctly aligned.
Sequences with these artefacts couldn't be removed from the dataset before the original analysis, because they couldn't be identified until we were able to score each fragment for matches to the 'uptake motif' that the initial analysis produced. Now that we've identified them, we can consider whether they would have confounded any of the analyses.
The main concern is the reads with insertions or deletions. Because the initial filtering required that the 10 control bases all be perfectly matched, most of these were removed, and the 10^7 recovered reads we analyzed only included about 1500 with insertions or deletions that misaligned the core. That's too few to have misled the initial analysis, but it is a concern for the analyses of possible contamination and sequencing errors, and for the analysis of interaction effects. The postdoc has now finished checking for effects on the interaction analysis (none) and still needs to check for contamination and error effects.
Kurt Gödel's Open World
6 hours ago in The Curious Wavefunction