I successfully worked out how to command the Gibbs Motif Sampler to analyze the new genome sequences. I've only done it for two of them, because a better option has appeared.
A new version of the Gibbs motif software is available. It gives the option of using a 'centroid' sampling method that combines the best sites found in different runs (runs initiated with different random-number seeds), rather than simply taking all the sites identified in the run that had the best score. This has the big advantage of eliminating most of the weakly-matched 'false positive' sites.
It took me a few days to work out how to get it running on the computer cluster (the helpful administrator reset some permissions for me). The new release includes a version that runs in the Mac terminal, and I now have that working too. But it didn't take long to discover that it runs about 100-fold (no, I'm not exaggerating) slower than the usual (non-centroid) version. This means that a good run analyzing a whole genome would take several weeks (or more?); getting rid of the false positives isn't worth that big an investment.
But the very helpful Gibbs expert has again offered to help - he says the centroid version shouldn't be slower at all. So I've sent him the test file I've been using (2% of the genome) plus examples of the output I get. He's going to see if he can find the problem and fix it.
Drones, Silicon Valley and biology: The future isn't here yet
1 hour ago in The Curious Wavefunction