Quite a bit of progress on the Gibbs motif sampler analyses yesterday. I figured out what I'd done to remove the RS3 repeats from the N. meningitidis genome sequence (used Word to delete all occurrences of ATTCCCnCnnnnGnGGGAAT). So I then ran some small-scale Gibbs searches on my laptop and the fastest of the lab computers, to see whether removing the RS3 repeats changed the motif it found. But none of the searches found any DUS-like motif at all even when I used a prior file that specified the DUS motif base frequencies. So now I'm rerunning these on a much larger scale (2x100 replicates) on the Westgrid computers, with a prior that specifies the DUS size but not its sequence. The Westgrid computers are slow, but I can have multiple searches running simultaneously, freeing up my own computers to run some Perl simulations (see below).
I discovered that I don't need to repeat the leading/lagging strand analyses after all. I had forgotten that I'd already redone the N. meningitidis ones (showing that the original surprising result was a fluke), and I decided that the H. influenzae ones I've done don't need to be repeated.
I started analysis of the DUS in the N. meningitidis coding sequences. I ran 2x100 replicates overnight on Westgrid, but they didn't find the DUS even thought they used the prior that specified its sequence. Instead they found about 7000 instances of sequences that resemble it only in containing GCCG. I think the problem may be the low density of DUS in coding sequences (~650 perfect 10-mers in ~1.74 megabases; 0.37/kb); the whole genome has ~1900 in 2.2 megabases; 0.89/kb. So I've set up a couple more runs, this time telling the program to expect only about 100 occurrences (yesterday I told it to expect 1500).
Now I'm going to try to get some Perl simulations running, after at least skimming the copious notes and data the former post-doc left me.
End of summer &
5 hours ago in The Phytophactor