Field of Science

Two steps forward, one step back (the postdoc's uptake bias paper)

The postdoc's manuscript on uptake bias is inching towards completion.  He's added most of the references and updated the figures, and we've only discovered one new analysis that needed to be done.  But including this analysis at the right place in the Results makes writing the rest of the Results a lot more straightforward, so we're ahead of the game.

What is this analysis?  Removing, from our dataset of 10^7 sequence reads of DNA fragments that the competent cells took up, some sequences that may have been interpreted incorrectly.  The incorrect interpretation happens because the sequence responsible for their uptake isn't correctly aligned in our analysis.  Here's a figure explaining the problem:

The top sequence is the consensus of the fragment we used.  The lower-case bases at each end were not degenerate and function as controls.  The first step in the analysis was to align each sequence read to this consensus at its left end, and below the consensus we see three correctly aligned reads, with their core uptake sequence indicated by the yellow arrows.

Below these are two reads that were misaligned because they contained either an insertion or a deletion of a single base.  We think these insertions and deletions arose during synthesis of the pool of degenerate fragments.  Although these fragments still contain good uptake sequences (red arrows), the incorrect alignment doesn't recognize this.  Instead, the fragments appear to have been taken up despite having very poor agreement with the consensus. 

Below these misaligned reads is a sequence that is correctly aligned but that contains a second match to the core consensus, indicated by the green arrow.  This second match was created by several changes downstream of the consensus uptake sequences, but it isn't recognized by the analysis because it is out of alignment and, in this case, in the other orientation.  The presence of two uptake sequences means that we can't attribute their uptake to the one sequence that's correctly aligned.

Sequences with these artefacts couldn't be removed from the dataset before the original analysis, because they couldn't be identified until we were able to score each fragment for matches to the 'uptake motif' that the initial analysis produced.  Now that we've identified them, we can consider whether they would have confounded any of the analyses.  

The main concern is the reads with insertions or deletions.  Because the initial filtering required that the 10 control bases all be perfectly matched, most of these were removed, and the 10^7 recovered reads we analyzed only included about 1500 with insertions or deletions that misaligned the core.  That's too few to have misled the initial analysis, but it is a concern for the analyses of possible contamination and sequencing errors, and for the analysis of interaction effects.  The postdoc has now finished checking for effects on the interaction analysis (none) and still needs to check for contamination and error effects.

2 comments:

  1. Truly fascinating Rosie! Where do you hope to submit this beautiful work?

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS