While there I sat down with my former post-doc to discuss our manuscript on uptake sequence variation. We agreed that it needed major reorganization more urgently than it needed simple editing and polishing, so we worked out a new structure that we think will hold the ideas together much better. I made some new figures (cartoons of our explanation and of how our simulation model works) and rearranged the text into its now order, but I needed a paper copy to do any serious editing. Now I'm home I've printed out the rearranged draft and am hoping to do one quick pass through it and then set it aside till our grant proposals are done.
But of course I immediately got distracted by the data. I want to be able to say something about how our new analysis of uptake sequences as motifs gives us insight into their evolution. The most promising angle is the distribution of good and poor matches to the motif, which I can analyze because the Gibbs Motif Sampler assigns a score to each occurrence it finds.
I dug out a set of 4646 DUSs Gibbs found in the N. meningitidis genome, and sorted them by score. And now I've spent a lot of time trying to force Excel to draw a proper histogram. I get the histogram values using a math teachers' web site called Illuminations, and paste it into Excel, but Excel refuses to use the data ranges as the X-axis (instead using its line numbers). I've found a work-around - the graph is very ugly but here it is. The red bars are the numbers of DUS with each score range (0-.02, .02-.04, etc) and the lilac bars are the cumulative numbers with increasing scores. So about 0.5% of DUS have zero scores, very few have non-zero scores lower than 0.5, about 2% have scores in each category from 0.5 to 0.98, and more than 40% have scores greater than 0.98.
I made weblogos for the top-scoring 50% and bottom-scoring 50% of the DUS occurrences (my previous analysis had only looked at the high-scoring ones). Here they are; the bottom 50% logo isn't evenly weak at all positions, instead it's quite strong at some positions and very much weaker at others. I don't know what I'm going to do with this analysis... I guess I could make a range of logos, maybe for the 10 deciles (is that the word I want?), to see how the consensus decays. And I should probably do the same thing for the H. influenzae USSs.