Field of Science

Showing posts with label Perl. Show all posts
Showing posts with label Perl. Show all posts

Not as random as it should be

On Saturday I described a key step in the motif-search computer program thusly ("thusly"? Pomposity alert!):
The program uses a random-number 'seed' to start searching the genome sequence for sequences whose consensus motif fits this pattern. ... Once the score is stable for a 'plateau' number of cycles the search ends and the program stores the final set of sequences that gave this pattern and goes on to try another random number seed.
Well. the seeds are supposed to be random numbers. The program uses them to get approximately random starting patterns, from some of which it converge on very similar final motif patterns. In any one run only a few of any 100 seeds lead to any approximation of the USS pattern, with the rest stuck muddling around in poorly-matched pseudo-patterns. But of course starting with identical seeds will give identical outcomes

This morning I noticed that two replicate runs that had been put into the computer queue within a few seconds of each other had given exactly the same results for all 100 seeds they claimed to have used. The program doesn't report each of the seeds used, but it does give you the initial seed it started the whole run with. Surprise surprise, the identical runs had started with identical 10-digit seeds (1162169861). And the next run queue's had the number 1162169821, differing only in its second-last digit. And the next two runs also had identical seeds (1162169911) Not only are these numbers much too similar to be considered 'random' the identity of all the results implies that each of the other 99 seeds is completely determined by the first seed.

I had assumed that any respectable computer program would draw whatever random numbers it needed from the random-number generator that's built in to most programming language compilers. (That's what our Perl programs do.) These actually generate pseudo-random numbers, and a great deal of analysis has gone into designing and evaluating the algorithms they use (they're also used in computer-security applications).

But I was wrong. I just went back to the original program notes, and found that one of the many settings users can specify is a 'random number generator seed'. This feature isn't discussed anywhere in the notes, and none of the examples provided in the notes use it. But maybe it prevents exactly the problem I'm having.

I've sent an email to my helpful advisor, and in the meantime I've queue'd up substitute runs, spaced far enough apart in time that they should have distinct seeds.

Deciding how much (computer) data is enough

I need to decide how many replicates of the motif sampler analysis I should do for our 'defining the USS' paper. And at what level of stringency to do each replicate. For real experiments, such decisions are made by factoring in how much work each experiment will be, and how expensive the needed resources will be. But the computer cluster we're using is free to us, and starting a run takes very little of my time. (It's deciding what to do that takes the time.)

I'm starting with the first analysis: searching the whole genome sequence for sites fitting the USS pattern whose spacing is specified by my 'fragmentation mask' prior (a 10bp motif followed by two 6bp motifs separated by 1 and 6 bp. The prior file uses this line:
++++++++++x++++++-xxxx-++++++
where ‘+’s specify positions whose consensus is to be included in the motif, and ‘x’ and ‘-’ positions that are not to be included and that are optional, respectively.

The program uses a random-number 'seed' to start searching the genome sequence for sequences whose consensus motif fits this pattern. Initially it finds only sequences whose consensus motifs very weakly match this pattern, but it goes through many cycles trying to improve the motif its found to get a better match.

It scores the quality of the matches with a MAP number. The first cycles always give MAP scores much less than zero (-4000 is typical); scores improve in later cycles cycles. Once the score is stable for a 'plateau' number of cycles the search ends and the program stores the final set of sequences that gave this pattern and goes on to try another random number seed.

With genome-sized sequences, most of the time the score never gets much better than -4000 because the search hasn't found the real USS motif. But sometimes it gets lucky and finds the motif, and then the scores on successive cycles rapidly increase to values around +4000. Once the program has completed trying the specified number of seeds, it checks the results from all the seeds to find the one that gave the best MAP score. It then polishes this set of sequences to see if it can get an even better score, and then reports this as the final set of sequences and consensus.

So I control stringency by specifying two things. First, I specify how long the 'plateau period should be (how long the program keeps trying to see if it can find a better motif). Second, I specify how how many different random number seeds the program tries before it picks the best results for polishing.

The default is a plateau of 20 cycles and 10 seeds. But I've been advised to use a plateau of 100 or 200, and lots more seeds. One complication with running analyses on the computer cluster is that you have to specify in advance how much time your run will need (the cluster uses this info to set your place in the queue). I don't have a good idea how long my runs will take on this cluster, so I submitted a test run using plateau = 200 and seeds = 100. I set the time limit to 12hr because a much shorter run had taken only 2 minutes, but it's been going for 7.5 hours now.... If it runs out of time before it's done I suspect that I won't get any interim results, just a blank file. So a few hours later I queue'd up an identical run with a 48hr time limit. I hope that's long enough.

On Wednesday one of the post-docs and I are meeting with the local experts for this computer cluster. We sent them an email confessing that we don't really know what we're doing, and I suspect they feel they'd better straighten us out before we seriously mess up their system.

Turning the Perl program into a story

I started writing the Perl computer program nearly two years ago. I worked on it intermittently for most of a year, got it doing most of what I wanted, and then set it aside in the press of other things that seemed more urgent. This past summer one of the post-docs took it on - she wanted to learn Perl and bringing this model to fruition (= we publish a paper about the results) seemed like a great way to learn.

SO far so good. She's embellished the program and debugged it and now it seems to be doing just what we want. The most important words there are 'seems to' , because right now neither of us is confident that we really understand how the program does things, which means we can't be sure it's doing them right. I wrote the basic modules (procedures? subroutines?) of the program, but I forget how they work. And the post-doc has been able to use them without having to dig into how they work.

So now we're creating a narrative of the program, in which we try to explain in ordinary English all the things the program is doing, and how it decides what to do. A computer simulation is like a game of "let's pretend": let's pretend that these are the rules you must act by, and these are the conditions you start with. At certain points decisions need to be made, and the program spells out the rules for making each decision. At many points in our program the decision involves getting a random number (the computer equivalent of rolling the dice), but even here there are precise rules.

What we're trying to do now is write out a clear accurate description of what happens at each step. Of course such a description already exists in the program itself, but by recreating it in plain English we're forcing ourselves to make sure we really understand it. In effect we're making ourselves explain it to each other - in my experience that's the best way to be sure you understand anything you want to get right.

We hope this will only take another day or so, and then we can confidently put the program to work generating DATA!