RRResearch: Biological relevance of USS scoring systems

In response to a recent post about how our USS-evolution model will score USS-like sequences, a commenter (Neil) says "I see ROC curves and cross-validation in your future." Google tells me that ROC (receiver operating characteristics) curves are graphs representing the relationship between a signal receiver's sensitivity and its specificity. They thus represent the receiver's ability to detect true positive signals and its tendency to falsely report events that aren't true signals.

Does this apply to USS? That's actually a scientific question about the nature of USS - are they really signals? USS stands for 'uptake signal sequence', a name chosen because they were assumed to have evolved so competent bacteria could distinguish closely related 'self' DNA fragments from 'foreign' DNA. Under this view the uptake machinery could be viewed as a signal receiver that needs to distinguish true USS (those on self DNA) from false-positive USS (chance occurrences of USS-like sequences in foreign DNA. (Note for new readers: the conventional 'core' USS of Haemophilus influenzae is the sequence AAGTGCGGT.)

But we don't think that USS are signals, at least not in this sense. Rather, our working hypothesis (null hypothesis) is that USS-like sequences accumulate in genomes as an accidental consequence of biases in the DNA-uptake machinery and recombination of 'uptaken' DNA with the genome. (I put 'uptaken' in quotes because it's not a real word; I'm using it because it's clearer than 'incoming', the word we usually use.) So rather than wanting a perfect scoring system to distinguish 'true' USS from 'false' ones, we would want it to reflect the real properties of the receiver (the DNA uptake machinery of real cells).

Unfortunately we don't know nearly enough about the uptake machinery to accurately describe it in a model. We know that some positions of the USS are more important than others, and that sequences outside the core matter. We have detailed analyses of the USS-like sequences that have accumulated in real genomes (see all my old posts about the Gibbs motif sampler), but we don't know how these sequences reflect the properties of the uptake machinery that caused them to accumulate. That's one of the things we hope to use the model to clarify.

For now, we don't really want to use a 'perfect' scoring system in our model. Instead, we can treat different scoring systems as different hypotheses about how differences in DNA sequences affect uptake (different hypotheses about how the machinery works). So we will start with very simple systems, such as those described in the previous post. Once we know how those affect simulated uptake and the long-term accumulation of high-scoring sequences in the genome, we can then clarify how what we know about the real uptake machinery constrains our ideas of how these sequences evolve.

Field of Science

RRResearch

Biological relevance of USS scoring systems

No comments:

Post a Comment