On Tuesday night I successfully modified our Perl simulation model so it printed out lists of sequences of simulated fragments that it had found acceptable, based on their scores using a position-weight matrix. But I couldn't get it to choose the best fragments scored with the Gibbs-derived matrix; instead it just gave me a list of randomly chosen fragment sequences. Scoring with a fake matrix worked OK.
So tonight I was reading through the code, trying to remind myself of how the model uses each fragment's score and Perl's random number generator to decide whether or not to accept the fragment for recombination (or, in this case, for printing). The logic is simple, but its implementation surprised me.
Here's the logic: First choose a random number from the range between zero and the maximum possible score. Then compare it to the fragment's score. If the score is lower, discard the fragment. If the score is higher, accept the fragment. So a fragment with the best possible score will always be accepted, and a fragment whose score is 50% of the best will get accepted half the time.
But the code the simulation actually used added a step. When it got each random number, it converted it to its integer value (instead of rand($max) it did int(rand($max). This didn't create a problem with our fake matrix, because the preferred bases at each of its 10 positions all had value 10, giving the total matrix a maximum score of 10^10, and taking the integer values just stripped off irrelevant decimal points. But the Gibbs matrix has true probabilities in its elements, so its maximum score is only 1.9 x 10^-7. This means that all the random numbers will be ≤ 1.9 x 10^-7, so their integer values will all be zero. And this means that all the scores will be larger than the random number, so all the fragments will be accepted.
I don't know why the int function was included in the original program, but we tried deleting it and the program still ran fine. And now the program works as it should with the Gibbs matrix, preferentially accepting fragments with higher scores. So I used it to collect the positions of 200 preferred fragments from each of the forward and reverse-complement strands of the first 50 kb of the H. influenzae genome, and an equal number of control fragments (with the program set to accept every fragment it scored.
As a bonus, solving this problem may also have solved the problem I was having with the real version of the simulation - the one we are using to model the evolution of uptake sequences. In the past I couldn't see any evolution of uptake sequences when I used a Gibbs-derived matrix, only with the fake matrix. At the time I though this must be because the Gibbs matrix didn't incorporate strong enough selectivity, and that was worrisome for my understanding of how uptake sequences evolve. But now I suspect that the problem may just have been that int function!
But I still would feel better if I knew why the student who did the coding for us included the int step. Did we discuss it and decide that this was the right thing to do, or did he just do it on his own initiative? I've checked the old versions of the program and it's there from the beginning. Maybe it's some sort of Perl good-practice thing.
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
-
Political pollsters are pretending they know what's happening. They don't.5 weeks ago in Genomics, Medicine, and Pseudoscience
-
-
Course Corrections6 months ago in Angry by Choice
-
-
The Site is Dead, Long Live the Site2 years ago in Catalogue of Organisms
-
The Site is Dead, Long Live the Site2 years ago in Variety of Life
-
Does mathematics carry human biases?4 years ago in PLEKTIX
-
-
-
-
A New Placodont from the Late Triassic of China5 years ago in Chinleana
-
Posted: July 22, 2018 at 03:03PM6 years ago in Field Notes
-
Bryophyte Herbarium Survey7 years ago in Moss Plants and More
-
Harnessing innate immunity to cure HIV8 years ago in Rule of 6ix
-
WE MOVED!8 years ago in Games with Words
-
-
-
-
post doc job opportunity on ribosome biochemistry!9 years ago in Protein Evolution and Other Musings
-
Growing the kidney: re-blogged from Science Bitez9 years ago in The View from a Microbiologist
-
Blogging Microbes- Communicating Microbiology to Netizens10 years ago in Memoirs of a Defective Brain
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl12 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House13 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs13 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby13 years ago in The Large Picture Blog
-
in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
1 comment:
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
I can't think of a good reason to do int(rand()) in your case. It could be based on the programmer's assumption that the maximum score against which the result is compared would consistently be some large power of 10, as in your test matrix. But even then, I can't think of a good reason to remove the decimal part because the comparison is still (I assume) between floating-point numbers, for which presence/absence of digits right of the decimal point doesn't matter. So use of int() here just wastes a little runtime in addition to (as you found) creating a range-dependent bias in your accept-reject criterion.
ReplyDelete