Field of Science

Scoring USS-like sequences in our model (so blind!)

Months ago (last fall?) a post-doc and I spent what seemed like a lot of time at the whiteboard in the hall, considering different ways our planned USS model might score the sequences it was considering for their similarity to a USS motif.

We eventually settled on the crude system shown on the left (yellow table). It evaluates how well the DNA sequence in a 10-base window matches the USS core consensus. Each match to the consensus earns a point, with the total score for the sequence being the sum of the points it's earned. At the time, we realized that this way of scoring had two (or three?) big problems, but we needed something simple to get the model working so we settled for this.

The first problem is that the score is not very sensitive to how good the match is. The yellow numbers beside the table show the scores earned by specific sequences. A sequence matching at all 10 positions is only 11% better than a sequence matching at 9 positions, even though we know from real uptake experiments that some single base changes can reduce uptake by more than 95%. The second problem is that this method treats all 10 positions in the motif equally. But again our uptake experiments have shown that some positions in the motif affect uptake much more strongly than others.
The third problem is that random sequences have very high scores, and adding a single perfect-match USS to it increases this baseline score only slightly.

This morning the post-doc and I reconsidered the scoring system. We expected that finding a solution to these problems would be very difficult, but we quickly came up with a much better way, illustrated by the blue table on the right of the figure. The new method is to multiply the scores of the individual positions rather than summing them. This causes the scores of well-matched sequences to be dramatically higher than those of poorer matches. And we expect (though we haven't tested this yet), that the baseline score of a random sequence will be much smaller. For now we've given all but the consensus base scores of 1, but these could be larger or smaller; for example some bases at some positions could be worth only 0.1 of a point.

Now that the program is working, implementing a multiplicative scoring system should be simple. I'm tempted to try it right now, but I have lots of other things I should be doing, and I'd probably just get bogged down in technical problems anyway.

3 comments:

  1. I think a much better scoring method would be to use a position weight matrix (PWM). There's a short (not well-written) introduction to PWMs on Wikipedia.

    The basic procedure goes like this:

    1. Get a high-confidence (experimentally-validated) set of USS 10-base core sequences.
    2. Line them up below one another; for instance the 4 sequences from your figure in fasta format would look like:

    >seq1
    AAAGTGCGGT
    >seq2
    AACGTGCGTA
    >seq3
    TACGCAGGTA
    >seq4
    TGCACAGCTA

    3. Calculate a frequency matrix from the alignment; this looks like the first table in your figure, but it's better if the rows are bases ACGT and the columns are positions 1-10 in the motif.

    4. Now you want to convert the frequencies to weights. One formula to do this is:

    W(b,i) = ( F(b,i) + sqrt(N)/4 ) / ( N + sqrt(N) ) / p(b) / log(2)

    where F(b,i) is the frequency of base b in column i; N is the number of sequences (= column sum); p(b) is the background frequency of the base (which you might estimate as 0.25 or the frequency in the genome); sqrt(N)/4 and sqrt(N) are "pseudocounts" (just statistical corrections of frequency).

    Sorry, Blogger comments are not designed for equations!

    Since you know that some bases are more important than others with respect to uptake, you might want to devise your own weighting scheme too. For instance if column 9 had 10 times the effect on uptake of any other column, you might multiply its weight x10; but you would need a relative scheme that weighted all columns accordingly.

    5. Now it's just a case of scanning a query sequence 10 bases at a time and summing the weights across each 10-base window to obtain a score. You may want to convert to a relative score (such as 0-100) using (score - minimum score)/(maximum score - minimum score).

    There is an excellent tutorial on PWMs from the Davuluri group available as a PDF. The relevant slides are 14-17.

    This article: The statistical significance of nucleotide position-weight matrix matches and the citations on that page (at the bottom) are also a good introduction to the topic.

    There are also quite a few software packages that can calculate matrices given alignments, such as the prophet + profit tools in EMBOSS.

    Hope this helps. If you could get a good set of 10-base USS sequences together, I'd be happy to code this up in Perl in the next week or two.

    ReplyDelete
  2. Hi Neil,

    We already have position-weight matrices up the wazoo - they're what I was generating from genome sequences using the Gibbs motif sampler analysis (see all those posts from last year), and what underlies the sequence logos we have. The tricky part is deciding how to use this information to generate scores on new sequences.

    In the past (with this Perl model) we have added the scores for the different positions, as you suggest in your point 5. This is fine when we want to compare the scores of individual windows, or of several window positions in very short sequences.

    But we need to score long fragments and whole genomes for the motifs, and an additive system has such a high background that the motif signals are overwhelmed. The background is inevitable, because in any random sequence 1/4 of all positions will match the best base for that position. (Numbers will vary a bit depending on the base composition...)

    That's why we're now thinking of instead MULTIPLYING the scores of the different positions in the window to give the score. We think this will cause well-matched sequences to give much higher scores than poorly matched ones (orders of magnitude higher), which is what we need.

    I haven't yet thought much about the biological implications of using a multiplicative rather than an additive method. Perhaps it's equivalent to assuming that bases at different positions in the USS interact in their effects on uptake.

    ReplyDelete
  3. Right - I do remember your PWM stuff and the nice sequence logos.

    The point about PWMs is that they are more informative than frequency matrices. Instead of asking "how common is base b in column i", you're asking "how common is base b in column i as opposed to base b anywhere else?" The observation that any base has a 1/4 (-ish) chance of occurring at any position is taken care of in the formulae (such as the one I gave) that calculates the weights.

    Seems like what you need to know is, how well do scores distinguish true USS from false. I see ROC curves and cross-validation in your future :)

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS