About eight years ago I started collaborating with people with bioinformatics skills on an analysis of how USSs constrain (or don't constrain) the ability of the genome to code for proteins. The project is still unfinished, but it's made a lot of progress. I think I should make one last push to get it done before I dive into doing experiments.
Originally I was working with a biophysicist in Taiwan and his excellent graduate student. The student did a lot of nice analysis and was coauthor on one paper with us, but we never got the second part of the work finished (or published). He's since moved on to other things, and the analysis is being redone (new data is available, and we now realize the flaws in the original analysis) by a bioinformatician (bioinformaticist?) working at the National Research Council labs in Ottawa. I'm the main idea person, and the main manuscript-writer, and she's the person who can write Perl scripts and deal efficiently with databases.
Last summer, when we finished our first joint manuscript on USS evolution (pdf here), I did the sensible thing of writing a rough draft of this second manuscript before we'd done most of the analysis. I even put in mock-up figures of the results I expected (based on the previous work by the grad student). I'm told that the best scientists always have a pretty good idea what the paper will say before they do (get their students and post-docs to do) the experiments, but I'm rarely that far ahead.
Since then the bioinformatician has done quite a bit of the work, but I've mostly let my contribution slide while I did more urgent things. Today I read through my rough draft of the manuscript (a very nice aid to my lousy memory) and realized that I'm far from clear about what analysis has been done and what still needs to be done, by her and by me. I'm pretty sure that I just need to read back through our emails and associated attachments to get this clear.
One new bit of analysis will be Gibbs motif searches on the various genomes. I've already done this to death for H. influenzae but now want to do it with other genomes. Unfortunately I've forgotten such important basic information as how to connect to the computer cluster where I run the searches (username? password?), and how to format the search instructions. Not a big deal - I just wish I had been more organized in storing this useful information.
20 minutes later: I was unduly pessimistic. I started up Fugu (the program I'd used to interface with the computer cluster) and found that it had not only remembered the name and location of the cluster server, it knew my password. And I found a folder called "run pbs scripts" in the Gibbs folder on my computer, and this turns out to contain the instruction files I used when queueing my searches on the cluster server! (I had even forgotten that they needed to be queued.) Now I just need to get the genome sequences, and get them into the right format (Fasta, in big fragments?), and reread the pages explaining what my previous instructions meant. Then I can set up the new searches and put the files onto the server. I'll still need to log on to the server using Mac's Terminal interface, to put the searches into the queue, but I know those instructions are around somewhere....
What math can teach us about drug discovery and biology (and all of science, really)
3 hours ago in The Curious Wavefunction