RRResearch: More about Gibbs scores

Yesterday the post-doc and I were again trying to understand the scoring system used by the Gibbs Motif Sampler. It seems peculiar to us because, in a search where most occurrences of the motif get scores close or equal to 1.00, it produces a lot of sites that score zero, as illustrated in the histograms above. In general, sites that are scored higher also better match the position-weight matrix that describes the motif, but the scaling seems weird. All of these sites, even the ones scoring 0.00, are pretty good matches to the motif (see below, copied from earlier post), and we don't understand why there are a lot of sites with 0.00 scores but few or (usually) none with scores between 0.01 and o.50.

The Gibbs documentation isn't very helpful, describing the score as "the probability of the motif belonging to the alignment". And the legend provided with the scores at the end of the run only says "Column 6 : Probability of Element". I decided to email my helpful Gibbs expert how the score is calculated, but while searching for his email address I found that I'd already asked him in early 2007 (!) Here's what he said:

Well, it’s a little confusing. For the near optimal solution (the one labeled NEAR OPTIMAL RESULTS), the probability given is the probability of the particular site according to the motif model for that motif type. It is calculated by computing the posterior probability of each of the sites according to the motif probability model. In your first example, its calculated as P(A in position1)*P(C in position 2)…. The probability is normalized so that the sum of the probability of the particular site belonging to each possible motif type and background sums to 1. The first site below is pretty definitely a good fit for the model, the second is not so great and might be better described as a background site. Both of the statements you list below are true in the sense that the probability correlates with how well the site fits the model.

Why are there low probability sites in the model? When the MAP is calculated, the program tries to maximize the MAP. Sometimes, there is a marginal improvement to the MAP by adding these low probability sites. The frequency solution will usually not include these sites, so in some ways it gives a better picture of the motif model.

First about the 'normalization': I don't understand what he means, but I think I don't need to worry about it. I've drawn a sketch of what I think might be going on - the output first calculates a score for every position in the genome, and then puts only those above some criterion into its alignment. But before it reports those aligned sequences it recalculates their scores, maybe using some normalization procedure that stretches out the middle, cramming most of the scores that would have been in the lower half down to zero.

Now about the 'background': In one sense his statement that the low-scoring sites in the alignment should be treated as background seems to be wrong, because even the sites with scores of 0.00 have significant matches to the matrix (see figure below). But maybe sites with this strength of match to the matrix are expected to occur by chance in a genome sequence of this length and base composition, so we can't assume that they result from selection for the process/effect of interest. In that sense they might indeed be considered background. I would think that the decision about background status would also have to take into account how often such sites are expected by chance and how often they're seen. For example, if we found 100 occurrences where theory predicts 10, we would conclude that most are due to non-background forces (though we wouldn't know which ones).

Field of Science

RRResearch

More about Gibbs scores

No comments:

Post a Comment