Months ago (last fall?) a postdoc and I spent what seemed like a lot of time at the whiteboard in the hall, considering different ways our planned USS model might score the sequences it was considering for their similarity to a USS motif.
We eventually settled on the crude system shown on the left (yellow table). It evaluates how well the DNA sequence in a 10base window matches the USS core consensus. Each match to the consensus earns a point, with the total score for the sequence being the sum of the points it's earned. At the time, we realized that this way of scoring had two (or three?) big problems, but we needed something simple to get the model working so we settled for this.
The first problem is that the score is not very sensitive to how good the match is. The yellow numbers beside the table show the scores earned by specific sequences. A sequence matching at all 10 positions is only 11% better than a sequence matching at 9 positions, even though we know from real uptake experiments that some single base changes can reduce uptake by more than 95%. The second problem is that this method treats all 10 positions in the motif equally. But again our uptake experiments have shown that some positions in the motif affect uptake much more strongly than others.
The third problem is that random sequences have very high scores, and adding a single perfectmatch USS to it increases this baseline score only slightly.
This morning the postdoc and I reconsidered the scoring system. We expected that finding a solution to these problems would be very difficult, but we quickly came up with a much better way, illustrated by the blue table on the right of the figure. The new method is to multiply the scores of the individual positions rather than summing them. This causes the scores of wellmatched sequences to be dramatically higher than those of poorer matches. And we expect (though we haven't tested this yet), that the baseline score of a random sequence will be much smaller. For now we've given all but the consensus base scores of 1, but these could be larger or smaller; for example some bases at some positions could be worth only 0.1 of a point.
Now that the program is working, implementing a multiplicative scoring system should be simple. I'm tempted to try it right now, but I have lots of other things I should be doing, and I'd probably just get bogged down in technical problems anyway.
 Home
 Angry by Choice
 Catalogue of Organisms
 Chinleana
 Doc Madhattan
 Games with Words
 Genomics, Medicine, and Pseudoscience
 History of Geology
 Moss Plants and More
 Pleiotropy
 Plektix
 RRResearch
 Skeptic Wonder
 The Culture of Chemistry
 The Curious Wavefunction
 The Phytophactor
 The View from a Microbiologist
 Variety of Life
Field of Science


Leave the leaves? Not good gardening advice3 days ago in The Phytophactor



The problem with our democracy isn't gerrymandering. It's integers4 days ago in Genomics, Medicine, and Pseudoscience

What areas of chemistry could AI impact the most? An opinion poll5 days ago in The Curious Wavefunction




Posted: July 22, 2018 at 03:03PM3 months ago in Field Notes



Bryophyte Herbarium Survey1 year ago in Moss Plants and More


Harnessing innate immunity to cure HIV2 years ago in Rule of 6ix

WE MOVED!2 years ago in Games with Words





post doc job opportunity on ribosome biochemistry!3 years ago in Protein Evolution and Other Musings

Growing the kidney: reblogged from Science Bitez3 years ago in The View from a Microbiologist

Blogging Microbes Communicating Microbiology to Netizens4 years ago in Memoirs of a Defective Brain



The Lure of the Obscure? Guest Post by Frank Stahl6 years ago in Sex, Genes & Evolution


Lab Rat Moving House7 years ago in Life of a Lab Rat

Goodbye FoS, thanks for all the laughs7 years ago in Disease Prone


Slideshow of NASA's StardustNExT Mission Comet Tempel 1 Flyby7 years ago in The Large Picture Blog

in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
3 comments:
Markup Key:
 <b>bold</b> = bold
 <i>italic</i> = italic
 <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
I think a much better scoring method would be to use a position weight matrix (PWM). There's a short (not wellwritten) introduction to PWMs on Wikipedia.
ReplyDeleteThe basic procedure goes like this:
1. Get a highconfidence (experimentallyvalidated) set of USS 10base core sequences.
2. Line them up below one another; for instance the 4 sequences from your figure in fasta format would look like:
>seq1
AAAGTGCGGT
>seq2
AACGTGCGTA
>seq3
TACGCAGGTA
>seq4
TGCACAGCTA
3. Calculate a frequency matrix from the alignment; this looks like the first table in your figure, but it's better if the rows are bases ACGT and the columns are positions 110 in the motif.
4. Now you want to convert the frequencies to weights. One formula to do this is:
W(b,i) = ( F(b,i) + sqrt(N)/4 ) / ( N + sqrt(N) ) / p(b) / log(2)
where F(b,i) is the frequency of base b in column i; N is the number of sequences (= column sum); p(b) is the background frequency of the base (which you might estimate as 0.25 or the frequency in the genome); sqrt(N)/4 and sqrt(N) are "pseudocounts" (just statistical corrections of frequency).
Sorry, Blogger comments are not designed for equations!
Since you know that some bases are more important than others with respect to uptake, you might want to devise your own weighting scheme too. For instance if column 9 had 10 times the effect on uptake of any other column, you might multiply its weight x10; but you would need a relative scheme that weighted all columns accordingly.
5. Now it's just a case of scanning a query sequence 10 bases at a time and summing the weights across each 10base window to obtain a score. You may want to convert to a relative score (such as 0100) using (score  minimum score)/(maximum score  minimum score).
There is an excellent tutorial on PWMs from the Davuluri group available as a PDF. The relevant slides are 1417.
This article: The statistical significance of nucleotide positionweight matrix matches and the citations on that page (at the bottom) are also a good introduction to the topic.
There are also quite a few software packages that can calculate matrices given alignments, such as the prophet + profit tools in EMBOSS.
Hope this helps. If you could get a good set of 10base USS sequences together, I'd be happy to code this up in Perl in the next week or two.
Hi Neil,
ReplyDeleteWe already have positionweight matrices up the wazoo  they're what I was generating from genome sequences using the Gibbs motif sampler analysis (see all those posts from last year), and what underlies the sequence logos we have. The tricky part is deciding how to use this information to generate scores on new sequences.
In the past (with this Perl model) we have added the scores for the different positions, as you suggest in your point 5. This is fine when we want to compare the scores of individual windows, or of several window positions in very short sequences.
But we need to score long fragments and whole genomes for the motifs, and an additive system has such a high background that the motif signals are overwhelmed. The background is inevitable, because in any random sequence 1/4 of all positions will match the best base for that position. (Numbers will vary a bit depending on the base composition...)
That's why we're now thinking of instead MULTIPLYING the scores of the different positions in the window to give the score. We think this will cause wellmatched sequences to give much higher scores than poorly matched ones (orders of magnitude higher), which is what we need.
I haven't yet thought much about the biological implications of using a multiplicative rather than an additive method. Perhaps it's equivalent to assuming that bases at different positions in the USS interact in their effects on uptake.
Right  I do remember your PWM stuff and the nice sequence logos.
ReplyDeleteThe point about PWMs is that they are more informative than frequency matrices. Instead of asking "how common is base b in column i", you're asking "how common is base b in column i as opposed to base b anywhere else?" The observation that any base has a 1/4 (ish) chance of occurring at any position is taken care of in the formulae (such as the one I gave) that calculates the weights.
Seems like what you need to know is, how well do scores distinguish true USS from false. I see ROC curves and crossvalidation in your future :)