Deciding how much (computer) data is enough

I need to decide how many replicates of the motif sampler analysis I should do for our 'defining the USS' paper. And at what level of stringency to do each replicate. For real experiments, such decisions are made by factoring in how much work each experiment will be, and how expensive the needed resources will be. But the computer cluster we're using is free to us, and starting a run takes very little of my time. (It's deciding what to do that takes the time.)

I'm starting with the first analysis: searching the whole genome sequence for sites fitting the USS pattern whose spacing is specified by my 'fragmentation mask' prior (a 10bp motif followed by two 6bp motifs separated by 1 and 6 bp. The prior file uses this line:
++++++++++x++++++-xxxx-++++++
where ‘+’s specify positions whose consensus is to be included in the motif, and ‘x’ and ‘-’ positions that are not to be included and that are optional, respectively.

The program uses a random-number 'seed' to start searching the genome sequence for sequences whose consensus motif fits this pattern. Initially it finds only sequences whose consensus motifs very weakly match this pattern, but it goes through many cycles trying to improve the motif its found to get a better match.

It scores the quality of the matches with a MAP number. The first cycles always give MAP scores much less than zero (-4000 is typical); scores improve in later cycles cycles. Once the score is stable for a 'plateau' number of cycles the search ends and the program stores the final set of sequences that gave this pattern and goes on to try another random number seed.

With genome-sized sequences, most of the time the score never gets much better than -4000 because the search hasn't found the real USS motif. But sometimes it gets lucky and finds the motif, and then the scores on successive cycles rapidly increase to values around +4000. Once the program has completed trying the specified number of seeds, it checks the results from all the seeds to find the one that gave the best MAP score. It then polishes this set of sequences to see if it can get an even better score, and then reports this as the final set of sequences and consensus.

So I control stringency by specifying two things. First, I specify how long the 'plateau period should be (how long the program keeps trying to see if it can find a better motif). Second, I specify how how many different random number seeds the program tries before it picks the best results for polishing.

The default is a plateau of 20 cycles and 10 seeds. But I've been advised to use a plateau of 100 or 200, and lots more seeds. One complication with running analyses on the computer cluster is that you have to specify in advance how much time your run will need (the cluster uses this info to set your place in the queue). I don't have a good idea how long my runs will take on this cluster, so I submitted a test run using plateau = 200 and seeds = 100. I set the time limit to 12hr because a much shorter run had taken only 2 minutes, but it's been going for 7.5 hours now.... If it runs out of time before it's done I suspect that I won't get any interim results, just a blank file. So a few hours later I queue'd up an identical run with a 48hr time limit. I hope that's long enough.

On Wednesday one of the post-docs and I are meeting with the local experts for this computer cluster. We sent them an email confessing that we don't really know what we're doing, and I suspect they feel they'd better straighten us out before we seriously mess up their system.

No comments:

Post a Comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS