On Tuesday night I successfully modified our Perl simulation model so it printed out lists of sequences of simulated fragments that it had found acceptable, based on their scores using a position-weight matrix. But I couldn't get it to choose the best fragments scored with the Gibbs-derived matrix; instead it just gave me a list of randomly chosen fragment sequences. Scoring with a fake matrix worked OK.

So tonight I was reading through the code, trying to remind myself of how the model uses each fragment's score and Perl's random number generator to decide whether or not to accept the fragment for recombination (or, in this case, for printing). The logic is simple, but its implementation surprised me.

Here's the logic: First choose a random number from the range between zero and the maximum possible score. Then compare it to the fragment's score. If the score is lower, discard the fragment. If the score is higher, accept the fragment. So a fragment with the best possible score will always be accepted, and a fragment whose score is 50% of the best will get accepted half the time.

But the code the simulation actually used added a step. When it got each random number, it converted it to its integer value (instead of rand($max) it did int(rand($max). This didn't create a problem with our fake matrix, because the preferred bases at each of its 10 positions all had value 10, giving the total matrix a maximum score of 10^10, and taking the integer values just stripped off irrelevant decimal points. But the Gibbs matrix has true probabilities in its elements, so its maximum score is only 1.9 x 10^-7. This means that all the random numbers will be ≤ 1.9 x 10^-7, so their integer values will all be zero. And this means that all the scores will be larger than the random number, so all the fragments will be accepted.

I don't know why the int function was included in the original program, but we tried deleting it and the program still ran fine. And now the program works as it should with the Gibbs matrix, preferentially accepting fragments with higher scores. So I used it to collect the positions of 200 preferred fragments from each of the forward and reverse-complement strands of the first 50 kb of the H. influenzae genome, and an equal number of control fragments (with the program set to accept every fragment it scored.

As a bonus, solving this problem may also have solved the problem I was having with the real version of the simulation - the one we are using to model the evolution of uptake sequences. In the past I couldn't see any evolution of uptake sequences when I used a Gibbs-derived matrix, only with the fake matrix. At the time I though this must be because the Gibbs matrix didn't incorporate strong enough selectivity, and that was worrisome for my understanding of how uptake sequences evolve. But now I suspect that the problem may just have been that int function!

But I still would feel better if I knew why the student who did the coding for us included the int step. Did we discuss it and decide that this was the right thing to do, or did he just do it on his own initiative? I've checked the old versions of the program and it's there from the beginning. Maybe it's some sort of Perl good-practice thing.

Botany 2015 - Over and done

18 hours ago in The Phytophactor

I can't think of a good reason to do int(rand()) in your case. It could be based on the programmer's assumption that the maximum score against which the result is compared would consistently be some large power of 10, as in your test matrix. But even then, I can't think of a good reason to remove the decimal part because the comparison is still (I assume) between floating-point numbers, for which presence/absence of digits right of the decimal point doesn't matter. So use of int() here just wastes a little runtime in addition to (as you found) creating a range-dependent bias in your accept-reject criterion.

ReplyDelete