I have to chop a 200 kb file into 20 kb pieces, because the USS position weight matrix I'm using (derived from Gibbs analysis of the H. influenzae genome) is so fastidious (???) that runs take forever. Specifically a 200 kb simulation that's using a pre-evolved sequence with quite a few uptake sequences already in it has taken 28 days to complete about 3300 cycles and it's about to exceed its pre-specified time limit (800 hours, about 33 days) and be terminated before it finishes. Terminating prematurely means that it won't report the sequence it has to painstakingly evolved. And I had even given it a tenfold higher mutation rate to help it run fast!
Anyway, my clumsy solution was to chop the 200 kb input sequence into ten 20 kb segments, and evolve them all in parallel. Because Word is good with work counts, I opened the sequence file (as a text file) in Word and marked off every 20 kb with a couple of line breaks. Then I opened the file in Textedit and deleted everything except the last 20 kb to get a test file (no line breaks at all, that I could see). But it generated an 'unrecognized base' error message when I tried to use it, so my first suspicion was that Word had somehow generated a non-Unix line break.
Sure enough, opening the file in Komodo showed that it had. But surprisingly, the problem wasn't a Mac-style line break, but a DOS/Windows line break! Maybe Word 2008 thinks all .txt files are for Windows?
Goodies for being good
1 hour ago in The Phytophactor