Now we have all this data, we need to decide how to analyze it. It's not just a big dataset but a very rich one since the samples differed in what can be considered to be three independent directions, all of which are underlain by rich sets of biological information about phenotypes and molecular events.
- Each sample was in one of two different culture media, either rich growth medium (supplemented brain-heart infusion, sBHI) or the competence-inducing starvation medium M-IV.
- Each sample is part of a time course, either three different cell densities in sBHI or four time points of a culture transferred to M-IV.
- Each sample has a specific genotype: wildtype, knockout mutations in well characterized competence-regulating genes (sxy or crp or cya), mutations that cause hypercompetence (sxy-1, murE747 or rpoD753), and other mutations that affect competence by unknown mechanisms (knockouts of hfq and of one or both members of the mysterious toxin/antitoxin system
- Most samples have one or two replicates, from independent cultures usually on different days.
Ideally we would first characterize the data quality of each sample, and decide if we need to apply any constraints to its use. Then we'd do a very meticulous analysis of the wildtype cultures to identify the genes that change when cells become competent, followed by analysis of the sxy and crp/cya knockouts to identify the genes that are specifically responding to these regulators. This would let us identify all the genes we know we should pay attention to when looking at the effects of the other mutations.
But I don't have a regimented team of minions to do exactly what I tell them. There will be several of us working on this data set, with different skill levels and research goals. So here's my tentative plan to keep us at least informed about what each other is doing.
Group blog: I've set up a group blog on Blogger, called The Sense Strand (great name, right?). Each of us needs to post there to tell the others what we've done and what we've learned. These posts should be in plain English, this is not a place for data files or code.
Gene-info: We have a big table of information about the known competence genes, and I'm going to convert this into a Google Docs spreadsheet that we all can edit, adding new genes and new information as we develop it.
Shared data files: We've created a shared folder on Google Drive, where we all will post copies of the useful data files we generate.
Code repository: Finally, I've just learned how to create a shared code repository on GitHub (I'm doing the short Coursera course A Data Scientist's Toolbox, in preparation for their short R Programming course.) We'll all use this to archive copies of the code we use to do our analyses.