While I've been doing other things a collaborator has been working hard on a comparative genomics project that will tell us how much impact uptake signal sequences (USS) have on gene function.
Reminder: USS are short sequence motifs (the longest are ~30bp) present in many copies in the genomes of naturally transformable bacteria, probably because the cells preferentially take up DNA fragments containing the motif. Most of the USS in the Haemophilus influenzae genome are in coding sequences, and we want to find out whether their presence forces genes to specify sub-optimal amino acids at positions encoded by USS.
This analysis is testing the effect of USS by comparing the amino acid sequences of proteins with and without USS. For each H. influenzae gene with one or more USSs, we first find homologous protein sequences from at least three genomes with no USS. We compare these three protein sequences with each other (that's three no-USS comparison scores), to get a measure of how strongly selection acts on the protein, especially on the segment that in H. influenzae is specified by a USS. Then we compare each of the three with the H. influenzae sequence (that's three +USS comparison scores).
Then we compare the mean no-USS score with the mean +USS score; if the scores are similar then we conclude that the USS doesn't significantly constrain the protein's function. There's a lot of random variation, so we do this for every USS-encoded gene in the the genome and then plot each pair of scores as a point on a scatter-plot. Points that fall on a diagonal line represent genes whose USSs don't constrain them, and points that fall below the line represent genes whose USSs may be causing problems.
We're not interested in specific genes, but in the general picture - we want to know whether, on average, USSs cause problems or not. A preliminary analysis done years ago suggested they don't, but the answer from this new improved analysis will be interesting in any case.
20 hours ago in The Phytophactor