I need to decide how many replicates of the motif sampler analysis I should do for our 'defining the USS' paper. And at what level of stringency to do each replicate. For real experiments, such decisions are made by factoring in how much work each experiment will be, and how expensive the needed resources will be. But the computer cluster we're using is free to us, and starting a run takes very little of my time. (It's deciding what to do that takes the time.)
I'm starting with the first analysis: searching the whole genome sequence for sites fitting the USS pattern whose spacing is specified by my 'fragmentation mask' prior (a 10bp motif followed by two 6bp motifs separated by 1 and 6 bp. The prior file uses this line:
++++++++++x++++++-xxxx-++++++
where ‘+’s specify positions whose consensus is to be included in the motif, and ‘x’ and ‘-’ positions that are not to be included and that are optional, respectively.
The program uses a random-number 'seed' to start searching the genome sequence for sequences whose consensus motif fits this pattern. Initially it finds only sequences whose consensus motifs very weakly match this pattern, but it goes through many cycles trying to improve the motif its found to get a better match.
It scores the quality of the matches with a MAP number. The first cycles always give MAP scores much less than zero (-4000 is typical); scores improve in later cycles cycles. Once the score is stable for a 'plateau' number of cycles the search ends and the program stores the final set of sequences that gave this pattern and goes on to try another random number seed.
With genome-sized sequences, most of the time the score never gets much better than -4000 because the search hasn't found the real USS motif. But sometimes it gets lucky and finds the motif, and then the scores on successive cycles rapidly increase to values around +4000. Once the program has completed trying the specified number of seeds, it checks the results from all the seeds to find the one that gave the best MAP score. It then polishes this set of sequences to see if it can get an even better score, and then reports this as the final set of sequences and consensus.
So I control stringency by specifying two things. First, I specify how long the 'plateau period should be (how long the program keeps trying to see if it can find a better motif). Second, I specify how how many different random number seeds the program tries before it picks the best results for polishing.
The default is a plateau of 20 cycles and 10 seeds. But I've been advised to use a plateau of 100 or 200, and lots more seeds. One complication with running analyses on the computer cluster is that you have to specify in advance how much time your run will need (the cluster uses this info to set your place in the queue). I don't have a good idea how long my runs will take on this cluster, so I submitted a test run using plateau = 200 and seeds = 100. I set the time limit to 12hr because a much shorter run had taken only 2 minutes, but it's been going for 7.5 hours now.... If it runs out of time before it's done I suspect that I won't get any interim results, just a blank file. So a few hours later I queue'd up an identical run with a 48hr time limit. I hope that's long enough.
On Wednesday one of the post-docs and I are meeting with the local experts for this computer cluster. We sent them an email confessing that we don't really know what we're doing, and I suspect they feel they'd better straighten us out before we seriously mess up their system.
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
-
Political pollsters are pretending they know what's happening. They don't.5 weeks ago in Genomics, Medicine, and Pseudoscience
-
-
Course Corrections6 months ago in Angry by Choice
-
-
The Site is Dead, Long Live the Site2 years ago in Catalogue of Organisms
-
The Site is Dead, Long Live the Site2 years ago in Variety of Life
-
Does mathematics carry human biases?4 years ago in PLEKTIX
-
-
-
-
A New Placodont from the Late Triassic of China5 years ago in Chinleana
-
Posted: July 22, 2018 at 03:03PM6 years ago in Field Notes
-
Bryophyte Herbarium Survey7 years ago in Moss Plants and More
-
Harnessing innate immunity to cure HIV8 years ago in Rule of 6ix
-
WE MOVED!8 years ago in Games with Words
-
-
-
-
post doc job opportunity on ribosome biochemistry!9 years ago in Protein Evolution and Other Musings
-
Growing the kidney: re-blogged from Science Bitez9 years ago in The View from a Microbiologist
-
Blogging Microbes- Communicating Microbiology to Netizens10 years ago in Memoirs of a Defective Brain
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl12 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House13 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs13 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby13 years ago in The Large Picture Blog
-
in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
No comments:
Post a Comment
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS