Not as random as it should be

On Saturday I described a key step in the motif-search computer program thusly ("thusly"? Pomposity alert!):
The program uses a random-number 'seed' to start searching the genome sequence for sequences whose consensus motif fits this pattern. ... Once the score is stable for a 'plateau' number of cycles the search ends and the program stores the final set of sequences that gave this pattern and goes on to try another random number seed.
Well. the seeds are supposed to be random numbers. The program uses them to get approximately random starting patterns, from some of which it converge on very similar final motif patterns. In any one run only a few of any 100 seeds lead to any approximation of the USS pattern, with the rest stuck muddling around in poorly-matched pseudo-patterns. But of course starting with identical seeds will give identical outcomes

This morning I noticed that two replicate runs that had been put into the computer queue within a few seconds of each other had given exactly the same results for all 100 seeds they claimed to have used. The program doesn't report each of the seeds used, but it does give you the initial seed it started the whole run with. Surprise surprise, the identical runs had started with identical 10-digit seeds (1162169861). And the next run queue's had the number 1162169821, differing only in its second-last digit. And the next two runs also had identical seeds (1162169911) Not only are these numbers much too similar to be considered 'random' the identity of all the results implies that each of the other 99 seeds is completely determined by the first seed.

I had assumed that any respectable computer program would draw whatever random numbers it needed from the random-number generator that's built in to most programming language compilers. (That's what our Perl programs do.) These actually generate pseudo-random numbers, and a great deal of analysis has gone into designing and evaluating the algorithms they use (they're also used in computer-security applications).

But I was wrong. I just went back to the original program notes, and found that one of the many settings users can specify is a 'random number generator seed'. This feature isn't discussed anywhere in the notes, and none of the examples provided in the notes use it. But maybe it prevents exactly the problem I'm having.

I've sent an email to my helpful advisor, and in the meantime I've queue'd up substitute runs, spaced far enough apart in time that they should have distinct seeds.

3 comments:

  1. I've never really got my head around pseudo-random number generation. The rand() function in Perl chooses a seed based on time the program started - I guess something similar has happened here, so spacing in the queue should help.

    Here's some discussion of rand() and the alternative, srand(), with respect to Perl.

    ReplyDelete
  2. The seeds it's getting are very similar to each other, so they're not pseudorandom. I only have a compiled version of this program, not the code, so I'll have to wait for a reply from the expert.

    I think the dependence of later seeds on the initial seed may be a troubleshooting 'feature', not a bug. Combined with the option of a user-specified seed, the results of what should be identical runs can be compared.

    ReplyDelete
  3. If you fix the seed yourself, yes you should always get the same result (once the seed is fixed, you will always get the same sequence of random numbers, hence you'll get reproducibility). Some langages/functions start the first seed based on the clock (milliseconds before midnight), but others don't. You may have to look at the documentation of your compiler.

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS