In the paper I'm working on, we'll be comparing the USS motifs of various species. But of course there is no one true Gibbs motif, as the results depend on both random factors and ones I control. I don't see the randomness as a concern. It arises from both the random events in the history of the sequenced genome and the random-number seed that each Gibbs run starts with. The effectiveness of the searches, and the high numbers of USSs in the genomes, mean that the randomness isn't a big issue.
But factors I control can have a big effect on the results of a search. Probably most important is the specification of an 'expected' number of occurrences of the motif. If I set this low, the search will be very stringent, reporting only occurrences that are very well matched to the motif it's found. If I set it high many poorer matches will be reported. There's no 'right' setting, because there's no 'right' USS.
In order to compare USS motifs between genomes I need to have done the searches with comparable stringencies. The simple method I'll try is to use 'expected numbers' that are 1.5 times the number of perfect matches to the standard 'core' consensus. The identification of 'core' is somewhat arbitrary and historically contingent, but using it lets me treat all the genomes thought to have the same consensus in the same way. So for H. influenzae, H. somnus, Pasteurella multocida, Actinobacillus actinomycetemcomitans and Mannheimia succiniciproducens I'll use 1.5 X the number of occurrences of AAGTGCGTT, for H. ducreyi, A. pleuropneumoniae and M. haemolytica I'll use 1.5 X the number of occurrences of ACAAGCGGT, and for the Neisserias I'll use 1.5 X the number of occurrences of ATGCCGTCTGAA.
The Gibbs searches I queued two days ago were terminated that night because I'd forgotten to set the memory allocation high enough. I re-queue'd them yesterday with more memory requested. The A. pleuropneumoniae was terminated again last night, I think because the long genome and long motif put too big a demand on the program, so I've separated the 'forward' and 'reverse complement' sequences and requeue'd them as two separate jobs. The Neisseria meningitidis one is still running; I hope it doesn't run out of allocated time before finishing.
Leroy Hood and the tool-driven revolution in biology
1 day ago in The Curious Wavefunction