I'm making lots of progress with the motif search results this system has given me so far. I have at least four replicate runs of each of four kinds of searches: forward DNA strand, reverse-complement DNA strand, leading strand, lagging strand (leading and lagging refer to the direction DNA polymerase moves on the strand during DNA replication). I've analyzed only the five forward strand files so far, because I want to be clear about what analyses are worth doing before going on to the other strands' files.
I've analyzed them for:
- final log MAP score (all between 5096 and 5109).
- number of sites found (all between 1053 and 1058).
- how many of these sites contain perfect, singly-mismatched (one-off) or doubly mismatches (two-off) USS cores (usually 719, 205 and 105 respectively).
- the mean score of the ~1055 sites each search found (between 0.92 and 0.93). Most sites have scores very close to 1.0, but some are much lower and drive the mean down. (Hmm, I should check the median as well as the mean.)
- the number of distinct sites between each pair of searches (1 vs 2, 1 vs 3 etc.). This ranged between 5 and 17, meaning that more than 98% of the sites each search found were also found by the replicate searches. The sites found in some sites and not others were usually ones with very low scores, usually about 0.5. A few were closely spaced sites with strong scores - I suspect the program can't handle overlapping sites.
If I'm right, the expect-2000 runs will give us datasets with lots more sites, and those sites will be mainly poorer matches to the pattern than the sites we have now. This will be useful because it may give us more power when we look for correlations between the bases at different positions. (I may have posted about this; I know the post-doc has.)