One question about the accumulation of USS sequences in genomes is the extent that they interfere with coding for proteins and other 'real' functions of the DNA. I've calculates that USS constrain 2-3% of the H. influenzae genome, taking into account the two flanking segments and also the strength of the consensus at these places. That was done years ago, and I probably should redo the calculation, especially as I'm not sure I can even find the original notes.
Seven or eight years ago we started working on this in collaboration with a theoretical physicist (turned bioinformatician) in Taiwan. One of his grad students did extensive analysis but has since gone on to other things, and his supervisor says we're free to finish the analysis and publish it without including them. So I've arranged with our current bioinformatics collaborators to redo the analysis, incorporating various improvements made possible by both our improved understanding of the issues and by the availability of more sequences to analyze.
This is a nice change from most of our work, in that we are starting out with a very good idea of what the results will look like. Not the details, but the general shape of things. I took advantage of this to write much of the paper in advance of getting the results the paper will describe. I made fake figures showing what the data will probably look like, and considering different ways we might present it. And then I sent the whole draft paper off to the collaborators, so they could see where their work is going. And I'm sitting back waiting for them to do the heavy lifting of generating the data.
What are the main findings? We already know that in H. influenzae and Neisseria meningitidis the USSs are preferentially found in the non-coding regions (these is only about 10% of the genome). In H. influenzae about 35% of USS are in non-coding, and in N. meningitidis about 60%. We'll check the ratios for other genomes too. We assume (hypothesize?) that this is because USSs constrain the ability of genes to code for the best amino acids.
The big analysis is done on the USSs that ARE in the coding regions, because here we can determine true sequence homology with other bacteria. We can thus use sequence alignments to find out the degree to which USSs avoid the most highly conserved (= most functionally constrained) parts of proteins. The result is that USSs are preferentially found in two kinds of places. The first is parts of proteins that show little evidence of functional constraint - for example the flexible hinges and linkers between domains. The second is parts of proteins where USSs don't change the amino acids; i.e. where the USS specifies the same amino acids that are optimal anyway. We can also analyze these USSs by the kind of proteins (or RNAs) the different genes produce - USSs are preferentially found in the less important proteins. And we can check whether the protein-coding part of the genome has spare places where USSs could be put without changing the amino acid sequence of the protein. H. influenzae has quite a few of these (I forget the numbers).
Hmm, writing this overview is giving me better ideas of how the paper should be organized.
Fifty years of CP violation
11 hours ago in Doc Madhattan