What if we just plotted base composition as a function of USS or DUS number (really the number of perfect matches to the USS or DUS core). If some genes lack uptake sequences because they've only recently entered the genome (coming from a source genome with no uptake sequences), we predict that genes with aberrant base compositions will be preferentially found in the class with no uptake sequences.
I imagine a graph looking something like this. Each dot represents a gene. Almost all the genes have base compositions close to 38%; this gives far too many points to resolve so the number of points in this group is indicated by the pale blue circles. I could clarify this by writing the actual numbers in blue beside these circles.
The numbers below the 0, 1, etc. on the bottom axis would be the fraction of the genes in that group that had aberrant base compositions. If our hypothesis is correct that newly acquired genes tend to lack uptake sequences, I anticipate these fractions would be highest for genes with no uptake sequences.
Doing this analysis would be much simpler than one that incorporated BLAST search results. On the other hand, the BLAST searches are already done, so maybe we can do both. But probably we should do this first, just to see if there is support for our hypothesis. If there isn't we'll know not to waste time doing the fancier analyses. If there is, we should also do this using codon adaptation index instead of base composition. I found a web page that calculates this index, but only one gene at a time, and I think my collaborators could probably automate it quite easily. (Maybe codon adaptation index has been replaced by a better measure - I'd better do some searching.)
Leroy Hood and the tool-driven revolution in biology
1 day ago in The Curious Wavefunction