This is why we must ALWAYS do the controls.
The graph to the left (dark red bars) is the first part of my control analysis. It shows the distribution of mismatches across 1677 random 39 bp sequences from the H. influenzae Rd genome, aligned with their best matches in the genome of another H. influenzae strain called PittEE.
It has some disturbing similarities to the distribution of mismatches in the graph I posted yesterday (shown here in a different version, with the blue bars). Both have more mismatches at the edges, and both have none at the central position.
My interpretation is that the BLAST searches I'm using to do the alignments have a bias favouring mismatches at the edges and precluding any mismatches at the central position (the latter might be because I set the E-value cutoff too low). So I either need to do the searches differently, to eliminate these biases, or find a way to correct my USS-sequence searches for this bias.
I've just sent a detailed email off to the BLAST help desk, attaching both of the above figures, and asking if they can suggest changes to my search parameters or a reference document I should read.
End of summer &
15 hours ago in The Phytophactor