Here are the logos for the N. meningitidis and H. influenzae uptake sequences after sorting the occurrences by the scores that the Gibbs motif Sampler assigned them. (I'm pretty sure that each score is a measure of how well that occurrence's sequence matches the position weight matrix that Gibbs determined for this data set, but I don't know how the calculation is done.)
The top set are the logos for 5381 N. meningitidis DUSs. The numbers are different than in yesterday's post because I realized I had been analyzing a N. gonorrhoeae data set. The overall picture is the same for N. meningitidis and N. gonorrhoeae - low-scoring DUS retain strong consensus for most of the central positions but have only very weak consensuses for the other positions. The drop-off is quite steep. The shapes of the logos are about the same for all the occurrences with scores lower than about 0.95.
The H. influenzae dataset is even more skewed; almost 60% of the USSs have perfect scores, and about 8% have zero scores. But the consensus decays fairly evenly across the positions, and even the zero-score occurrences have the full motif. Like the N. meningitidis DUS, the shapes of the USS logos are about the same for all occurrences with scores below 0.95.
I think the question in my mind was whether there is a obvious place to draw a line between 'real uptake sequence' and 'degenerate sequence that doesn't deserve to be treated as an uptake sequence'. Unfortunately the analysis is complicated by the different sizes of the datasets - the N. meningitidis set has almost twice as many sites as the H. influenzae set.
OK, I've dug out another set of H. influenzae runs, done with a high 'expected' setting to maximize the number of sites found. This has 3466 USSs, with a lot more having zero scores than in the previous set. Now the first and last Gs in the core are seen to be weaker in USSs with low scores, though not in the larger set of USSs with zero scores. Overall the consensus still remains constant as the scores and consensus strengths decrease. Notably, the flanking AT-rich segments remain as important in poorly matched USSs as the core does.
Naming a viral disease around the world
21 hours ago in Rule of 6ix