Sorry for the dead air; I was away unexpectedly with no internet access.
We got the reviews back for our USS manuscript. Not too bad. Both reviewers asked for more analysis of the role of reading frames in USS locations. (See previous posts here and here.) This turned out to be both easy and fun to do.
Many USSs are in the protein-coding parts of the genome, and these can be sorted by which of the 6 possible reading frames the respective proteins are encoded. The first two figures show the relationships of the USSs to the reading frames.
The frames aren't equally used. Frames A, B and C have 49, 125 and 425 USSs respectively, and frames D, E and F have 474, 125 and 157. The differences are too large to be explained by chance alone, and we think they arise because USS in the less-used reading frames impose more severe constraints on protein function, so that many USSs arising in these frames are eliminated by natural selection.
The new analysis considers the two factors likely to contribute to this disparity. (Because the flanking segments exert only modest constraints on amino acid sequence, I've limited the new analysis to only the most frequent tripeptides encoded by the USS core in each frame.)
The first factor likely to affect USS reading frame usage is the differing biochemical properties of the tripeptides that USS cores in these reading frames will encode. Some combinations are intrinsically more versatile than others, useful at many different locations in a wide variety of proteins, whereas other tripeptide combinations will only be useful in particular contexts.
The second factor is 'codon bias'. Most amino acids are specifiable by at least 2 (often 4 and sometimes 6) different codons, some of which are more efficiently translated than others, and cells preferentially use the easiest-to-translate codons for proteins they need to make a lot of.
The new analysis evaluates the first factor by comparing the number of USSs in each frame to the total usage of the six tripeptides in all the proteins of the genome (the 'proteome'). If the differing versatilities of the tripeptides is responsible for their differing use in USSs, we should see a correlation between total number and USS-encoded number. The results are shown by the red symbols and line in the graph below. The predicted correlation exists. (The line fitted to the points has a confidence score of 0.86. I know enough statistics to know that this is a reasonably strong correlation.)
The blue symbols in the graph show the results of the other analysis. To look for a correlation between codon bias and USS reading frame usage, I needed a crude score indicating how easily each USS-encoded tripeptide would be translated. I was able to get a table of codon usage for the H. influenzae proteome from the TIGR website; this gave the percent usage of each codon. So for each tripeptide I calculated a 'USS-codons' score as the sum of the percentages of the codons specified by USSs in that reading frame, and a 'best-codons' score as the sum of the percentages of the most commonly used codons for its three amino acids. Then I calculated a 'codon cost' as the ratios of these scores, and multiplied it by 100 so it would fit neatly on the graph.
If codon cost contributes to the disparity in reading frame usage by USS, we expect the blue points to show an inverse relationship; the highest codon costs should be for the least-used reading frames, so the blue line should slope down to the right. Instead we see no correlation at all. This tells us that codon bias makes little or no contribution to persistence of USSs in different reading frames.
We're not surprised by this second result. (In fact one of the postdocs was so sure of it that she didn't think the analysis was worth doing.) Other analysis we've done has indicated that USS are rarely found in categories of genes known to be subject to strong codon bias. But that's for a different paper, and this new analysis will nicely address concerns raised by both reviewers of our present paper.
How to treat the flu: a shopping guide
18 minutes ago in Genomics, Medicine, and Pseudoscience