RRResearch: How should our USS model score USS sequences?

Our computer simulation model of USS evolution is coming together, but we've hit a snag. I'm hoping that trying to explain it here will give me a clearer understanding of how to get around it.

A critical step in the model is scoring each simulated DNA fragment for USS-like sequences; the score then determines the probability that this fragment will replace its homologous sequence in the evolving genome. Our basic plan is to use a 'sliding window' the width of the USS motif we're using. The window would begin at position 1 of the fragment sequence, score the sequence in the window, and then move over one position, score again, and continue until it reached the end of the fragment. The final score would be the sum of the scores of the sequences at each window position. I expected this to be time consuming ('computationally intensive') but straightforward, but I was wrong.

The simplest scoring scheme would be to just check, at each window position, whether the sequence in the window exactly matches the USS consensus. (For simplicity here I'll consider only the standard 9bp USS core AAGTGCGGT, but I want the model to consider the full 29bp USS motif.) If the sequence in the window exactly matches the core we add 1 to the running score, otherwise we don't. The final score would then tell us the number of perfect-match USSs in the fragment. How the model would use this number to decide the probability of uptake is still to be decided, but I'll defer this problem to another post.

A slightly more subtle scheme would give partial scores for window positions containing sequences that nearly match the perfect USS core consensus. For example, any sequence that was mismatched at only 1 of the 9 positions could score 0.3, and any that were mismatched at 2 positions would score 0.1. Worse mismatches would score 0.

Both of the above simple schemes would work, but they do so because they ignore differences in the importances of different components of the USS, and in the tolerance for specific alternative vases at different positions. I had been planning to use a scoring matrix giving a value to each base at each position. Here's a very simple version of such a matrix, with the preferred base at each position scoring 1 and the other bases scoring 0. With this matrix a prefect USS core in the window would score 9, a singly mismatched USS would score 8, a doubly mismatched USS would score 7, etc.

Even a sequence that matched the USS consensus at only 1 position would get a non-zero score - it would be 1/9 of the score of a perfect USS. This creates two kinds of problems. First, we want only reasonably USS-like sequences to get significant scores, but under this scheme a random 9bp sequence will, on average, get a score of 2.5. Second, because the window evaluates 1000 positions in a 1kb sequence, the average score of a random sequence will be about 2500, and the average score of a fragment containing a perfect USS will be about 2506.5. The scoring scheme thus is far too weak in its ability to discriminate between fragments with and without good matches to the USS. A bit of thinking shows that it doesn't matter how big or small we make the individual numbers in the matrix.

But maybe I see the solution. What if only the best base at each position has a positive value in the scoring matrix, and the other bases have negative values? We'd want to adjust the negative values so that the average random sequence would get a score of zero. I still see lots of potential problems here, but maybe this is the way to go.

Added a bit later: There's a much simpler solution. Use any matrix we like, and just calculate the average score expected for random sequences, and subtract this from the actual score for each window position. The calculation is simple - just add up all the fractional scores for the different bases, remembering to correct for base composition. And rather than doing this correction at each window position, do it after the window has completed scanning all the positions in the fragment (subtract the product of the expected average score for each window position times the number of window positions scored).

For random sequences this should give fragment scores centered on zero. I don't know how broad the distribution would be, so I don't know how strongly a single USS would shift the distribution. We would like it to move the score out beyond almost all of the random fragments.

One other problem is what to do with negative scores. For random sequences these should be as common as positive scores, but if we simply treat them as zero scores then the effective average score will be half of the mean of the positive scores. Maybe we should subtract a number bigger than the expected average score, and treat all negative scores as zero.

Field of Science

RRResearch

How should our USS model score USS sequences?

No comments:

Post a Comment