Field of Science

How different are the replicates?

I haven't solved the random-number problem (runs submitted together get the identical seed) because it's not my code (I only have the compiled file). But I'm easily circumventing it by not submitting replicate runs at the same time. This afternoon the post-doc and I meet with the local computer-cluster experts, who will no doubt be stupefied by our ignorance of things Unix. If we're lucky they'll speak English as well as Unix; if not I'll just be brazen in insisting that we understand their explanations. (It's in their own interest; otherwise they risk that we'll unintentionally do something that compromises their system.)

I'm making lots of progress with the motif search results this system has given me so far. I have at least four replicate runs of each of four kinds of searches: forward DNA strand, reverse-complement DNA strand, leading strand, lagging strand (leading and lagging refer to the direction DNA polymerase moves on the strand during DNA replication). I've analyzed only the five forward strand files so far, because I want to be clear about what analyses are worth doing before going on to the other strands' files.

I've analyzed them for:
  • final log MAP score (all between 5096 and 5109).
  • number of sites found (all between 1053 and 1058).
  • how many of these sites contain perfect, singly-mismatched (one-off) or doubly mismatches (two-off) USS cores (usually 719, 205 and 105 respectively).
  • the mean score of the ~1055 sites each search found (between 0.92 and 0.93). Most sites have scores very close to 1.0, but some are much lower and drive the mean down. (Hmm, I should check the median as well as the mean.)
  • the number of distinct sites between each pair of searches (1 vs 2, 1 vs 3 etc.). This ranged between 5 and 17, meaning that more than 98% of the sites each search found were also found by the replicate searches. The sites found in some sites and not others were usually ones with very low scores, usually about 0.5. A few were closely spaced sites with strong scores - I suspect the program can't handle overlapping sites.
I've also queue'd replicates of two more searches - still the forward strand, but now varying the number of sites I tell the search program to expect. I think because this is a Bayesian method (see my Bayes posts early in August) it needs to start with a 'prior' expectation. My understanding is that this value sets the stringency of the search; if it expects only 500 sites it will be fussier about accepting possible sites than if it expects 2000 sites. I've been telling it to expect 1000 sites, and the five replicate runs have found between 10053 and 1058. Now I'm doing two runs that expect only 500, and two that expect 2000.
If I'm right, the expect-2000 runs will give us datasets with lots more sites, and those sites will be mainly poorer matches to the pattern than the sites we have now. This will be useful because it may give us more power when we look for correlations between the bases at different positions. (I may have posted about this; I know the post-doc has.)

No comments:

Post a Comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS