In revising our USS manuscript I've been thinking again about the stringency of the Gibbs motif searches. The stringency is set by telling the program how many sites it should expect to find. For most analyses we set this at 2000 (and it would find about 1350) but for the analysis of covariation between positions we set it at 3000 so we would have many poorer matches (it would find about 1650) and thus have more variation to analyze.
The Gibbs analysis assigns a score to each site it finds. With 'expect 2000' most sites have scores close to 1.0, but there are always some (~40 in the set I counted) with scores of zero and a few with scores between 0.5 and 0.9.
I only yesterday discovered how to get Excel to make a histogram; here's the histogram showing the distribution of scores for one of the 'expect 3000' searches. As with the 'expect 2000' analyses, most of the USS sites have scores close to 1.0, and few have scores int he middle range. But there are lots more sites with scores=0, and today I've been checking out how bad these sites are.
I had feared that they were garbage included by mistake, just increasing the background noise but not really resembling the USS consensus. So I was pleasantly surprised to see that these sites have a strong USS signal when displayed as a sequence logo (the top logo in the figure). For comparison I made another logo using only the sites that scored 1.0; that's the bottom logo in the figure.
Of course this analysis doesn't tell us whether the low-scoring sites found by the Gibbs analysis actually function as USS in DNA uptake. Most of them miss the consensus at at least one core position, but the ~25% with perfect cores also have stronger flanking consensus (I'm not sure how to interpret this).
1 hour ago in Variety of Life