I've only just realized how strongly the terms we use have been affecting our thinking about uptake sequences. I'll try to write about the problem here, but the nature of terminology problems makes it hard to write about them...
First some history of uptake sequences: It's been known since 1974 (Scocca, Poland and Zoon) that competent H. influenzae cells take up H. influenzae DNA fragments in preference to fragments from unrelated species, and since 1979 that this is a sequence preference (i.e. not a modification preference). In 1980 Danner et al. showed that H. influenzae cells preferentially take up fragments containing the 11bp sequence AAGTGCGGTCA, and that ethylation of positions in and close to this sequence interferes with uptake. They correctly concluded that the species preference is seen because H. influenzae DNA is greatly enriched for this sequence (or for similarly-acting sequences). [They presciently also proposed that the uptake bias might be at least partly responsible for this enrichment, a hypothesis we've since formalized in our molecular drive model.] Later Goodgal showed that the 9 bp sequence AAGTGCGGT was probably sufficient to give strong uptake. In 1995 Smith et al. analyzed the newly available H. influenzae genome sequence for the 9 bp sequence and found 1465 (later 1471) perfect matches. Alignment of these also showed a strong consensus for two 6 bp AT-rich segments on one side of the 9 bp sequence. And in the last five or ten years we showed that these sequences are the consensus of a diverse motif that accumulates in the genome by point mutation and homologous recombination, and attempted to measure the preference of the uptake machinery for different variants of this consensus.
The terminology problem: So we need terminology for several different phenomena, all of which have at least sometimes been called uptake sequences or uptake signal sequences (USSs). There's the short sequence initially identified as the uptake sequence; the 9 bp version of this is commonly referred to as the uptake signal sequence or USS and is often the only sequence considered in analyses of uptake. Then there's the consensus derived from analysis of the genome sequence, with the 9 bp sequence called the core USS and the flanking AT-rich segments called segments 2 and 3. Then there's the position-weight matrix description of the genomic enrichment, which we refer to only in convoluted ways. And finally there's the true bias of the DNA uptake machinery, which we have only very poor estimates of and no term for.
Until recently nobody knew enough about DNA uptake and the genome to clearly distinguish between these, and this confusion has led to a lot of muddled thinking. We've been among the main perpetrators of this confusion, using 'USS' interchangeably for genomic sequences and uptake biases. But now we're proposing to obtain a very detailed description of the 'true' uptake bias, and to compare it to the genomic motif, so I need to carefully choose the terms I use, and to explicitly draw the reader's attention to them.
I could call the genomic motif the genomic USS motif, and its consensus the genomic USS consensus. We think it really has evolved as a motif, just like those associated with the preferences of known DNA-binding proteins, but we haven't yet identified the proteins and we aren't defining it by the proteins' preferences (which we don't know) but by its abundance in the genome. Although this consensus is useful in some experiments and simulations, it isn't much use when considering what's actually in the genome. If the genomic USS motif does approximate the real bias of the uptake machinery, then counting the number of occurrences of the core consensus (1471) is quite misleading about the actual uptake of different parts of the genome. (If it doesn't, counting is even more misleading.)
Should I call the real uptake bias just that (real uptake bias)? Or the real uptake motif? I could leave off the real, and just use uptake motif. Like the genomic USS motif, it would be described by a position-weight matrix. What would I call the consensus of this, which I am proposing to use in additional experiments in place of the genomic USS consensus? Should I avoid using the USS abbreviation when referring to the matrix and consensus of the real uptake bias? This would help the reader keep things straight, but would mean getting rid of the customary description of this bias as the USS.
I could keep it simple, using genomic motif and genomic consensus, and uptake motif and uptake consensus. Or I could insert USS into each of these terms (genomic USS motif etc.). I think I'll try the latter for now - it's a bit wordy but has the virtue of being unambiguous, and if I later come up with something better I can easily make the needed changes.
Later: I overlooked one more terminology problem. How do I refer to individual sequences in the genome that might (or might not) promote uptake, but that haven't ever been tested? We used to distinguish between fragments that contain 'USS' and those that don't, but the motif/molecular drive perspective predicts a smooth continuum between sequences, with those that strongly promote uptake at one end and those that resist uptake at the other. It gives no justification for distinguishing between 'real' USS and sequences that look a bit like USS. Should I call them all USS-like sequences? Candidate USS sequences? Fragments with different degrees of correspondence to the genomic USS motif?
John Keats's "Chapman's Homer" (chemistry and drug discovery version)
9 hours ago in The Curious Wavefunction