Comments on RRResearch: Scoring USS-like sequences in our model (so blind!)

Right - I do remember your PWM stuff and the nice ...

2008-05-14T16:05:00.000-07:00

Right - I do remember your PWM stuff and the nice sequence logos.

The point about PWMs is that they are more informative than frequency matrices. Instead of asking "how common is base b in column i", you're asking "how common is base b in column i as opposed to base b anywhere else?" The observation that any base has a 1/4 (-ish) chance of occurring at any position is taken care of in the formulae (such as the one I gave) that calculates the weights.

Seems like what you need to know is, how well do scores distinguish true USS from false. I see ROC curves and cross-validation in your future :)

Hi Neil,We already have position-weight matrices u...

2008-05-14T12:26:00.000-07:00

Hi Neil,

We already have position-weight matrices up the wazoo - they're what I was generating from genome sequences using the Gibbs motif sampler analysis (see all those posts from last year), and what underlies the sequence logos we have. The tricky part is deciding how to use this information to generate scores on new sequences.

In the past (with this Perl model) we have added the scores for the different positions, as you suggest in your point 5. This is fine when we want to compare the scores of individual windows, or of several window positions in very short sequences.

But we need to score long fragments and whole genomes for the motifs, and an additive system has such a high background that the motif signals are overwhelmed. The background is inevitable, because in any random sequence 1/4 of all positions will match the best base for that position. (Numbers will vary a bit depending on the base composition...)

That's why we're now thinking of instead MULTIPLYING the scores of the different positions in the window to give the score. We think this will cause well-matched sequences to give much higher scores than poorly matched ones (orders of magnitude higher), which is what we need.

I haven't yet thought much about the biological implications of using a multiplicative rather than an additive method. Perhaps it's equivalent to assuming that bases at different positions in the USS interact in their effects on uptake.

I think a much better scoring method would be to u...

2008-05-13T22:39:00.000-07:00

I think a much better scoring method would be to use a position weight matrix (PWM). There's a short (not well-written) introduction to PWMs on Wikipedia.

The basic procedure goes like this:

1. Get a high-confidence (experimentally-validated) set of USS 10-base core sequences.
2. Line them up below one another; for instance the 4 sequences from your figure in fasta format would look like:

>seq1
AAAGTGCGGT
>seq2
AACGTGCGTA
>seq3
TACGCAGGTA
>seq4
TGCACAGCTA

3. Calculate a frequency matrix from the alignment; this looks like the first table in your figure, but it's better if the rows are bases ACGT and the columns are positions 1-10 in the motif.

4. Now you want to convert the frequencies to weights. One formula to do this is:

W(b,i) = ( F(b,i) + sqrt(N)/4 ) / ( N + sqrt(N) ) / p(b) / log(2)

where F(b,i) is the frequency of base b in column i; N is the number of sequences (= column sum); p(b) is the background frequency of the base (which you might estimate as 0.25 or the frequency in the genome); sqrt(N)/4 and sqrt(N) are "pseudocounts" (just statistical corrections of frequency).

Sorry, Blogger comments are not designed for equations!

Since you know that some bases are more important than others with respect to uptake, you might want to devise your own weighting scheme too. For instance if column 9 had 10 times the effect on uptake of any other column, you might multiply its weight x10; but you would need a relative scheme that weighted all columns accordingly.

5. Now it's just a case of scanning a query sequence 10 bases at a time and summing the weights across each 10-base window to obtain a score. You may want to convert to a relative score (such as 0-100) using (score - minimum score)/(maximum score - minimum score).

There is an excellent tutorial on PWMs from the Davuluri group available as a PDF. The relevant slides are 14-17.

This article: The statistical significance of nucleotide position-weight matrix matches and the citations on that page (at the bottom) are also a good introduction to the topic.

There are also quite a few software packages that can calculate matrices given alignments, such as the prophet + profit tools in EMBOSS.

Hope this helps. If you could get a good set of 10-base USS sequences together, I'd be happy to code this up in Perl in the next week or two.