Field of Science

Keeping records of computer work


When I do benchwork I consistently keep pretty good notes.  I write down everything I do as I do it, on numbered and dated sheets of paper that go into looseleaf binders, organized by experiment.  If I make notes on a scrap of paper, I tape them in.  I write a brief plan (a few sentences about the point and design of the experiment) and end with some sort of conclusion or summary.

But I don't seem to be able to apply these good record-keeping habits when I'm working with computers.  Instead everything I do feels 'exploratory', as if everything I do is just a preliminary check to see what effect a modification will have, before I do something worth writing down.  The settings I've used for a particular test get overwritten, or lost, or buried in some output file labelled only with the date and time.  Sometimes I print out the results and scribble a few words on the sheet to remind me of its significance, but mostly I just rush on to the next test. Occasionally I resolve to keep better records, but I just write a few sheets of notes and then go back to the exploratory rush.  The various printouts and scribbled notes eventually get shoved in a folder but are too disorganized to be much use.

I hope I'll do better with the work I'm now doing for the US variation manuscript.  I have several relatively well structured goals this time, and a pretty good sense of how to go about accomplishing them.  So yesterday I set aside a new binder, and a pad of proper science-notebook paper (not one of those ephemeral yellow pads).  The first task for this morning is to organize all the sheets of paper on my desk into either this binder or the recycling bin.  All the papers unrelated to the UV variation work have already been moved to a pile on the floor, where I'm doing my best to ignore them.

Gibbs can't find the DUS

Well, my four Gibbs searches of the whole genome (with and without the RS3 repeats) hit the 36 hour wall when they were only about 1/3 of the way through their 100 seeds (= 100 replicate searches).  And, judging by the scores reported for the results with each seed, none of them found the DUS.

So I went back to old blog posts, to see whether this is a problem I had already solved.  (Yes, I know that's pathetic, I should be able to remember what I've learned, but the ability to find forgotten results in blog posts is one of the big benefits of blogging about my research.)  In this post I considered using a prior that specifies the motif, but decided to instead seed the search with a few hundred bp enriched for the DUS and later remove these occurrences from the output.  I'm not really sure that this is the best approach so maybe I'll try both now (using only 25 seeds and specifying a 48 hr walltime.  

I'll queue up a lot of combinations of priors (just-length and base-frequencies), numbers of expected occurrences (200 and 2838), and seeded and unseeded sequences, but I'll do this only for the sequences with the RS3 repeats removed.  That's because I already have results from sequences with the repeats, and the purpose of these new runs is just to find out whether removing the repeats makes a difference.  If  one or more of the runs succeeds in finding the DUS, I'll do the same run with the sequence with the repeats and see if the motifs differ.

I'd rather find out why Gibbs can't find the DUS but can find the less frequent and more complex USS with no trouble.  But I don't have any ideas.

Progress on multiple fronts

On the Gibbs front, my new analysis of the coding sequence DUSs did find them. Both replicates did: the trick was to reduce the expected number of occurrences (not something I would have predicted). I may try that with other difficult searches. My analyses of the whole genome are still running. I hope they don't run over the 36 hours I specified when I put them in the Westgrid queue, because they'll just be aborted and I'll have to start them over again.

On the Perl simulations front, I've got the program running and used it to do the control simulations. The first controls use random sequences the same lengths and base compositions as the concatenated H. influenzae or N. meningitidis intergenic sequences, run with matrices specifying the corresponding USS or DUS core but with no recombination. These controls tell us what the baseline USS or DUS score is for a genome that hasn't experienced any accumulation. The second controls use the real H. influenzae or N. meningitidis intergenic sequences instead of random sequences, and run for a long time to see how long the sequences take to degenerate to the predetermined baselines (i.e. to become randomized with respect to USS or DUS). The score isn't a very sensitive indicator for this degeneration, as the genome may still contain an excess of the imperfectly matched cores, but I'll be able to tell this from the final analysis done at the end of the run.

After screwing up the settings many times this afternoon (e.g. specifying the N. meningitidis sequence and matrix but forgetting to change to the corresponding base composition), I realized that I could save myself a lot of wasted time by making two versions of the program, each with its own matrix and sequence files and with a settings file that specifies the appropriate genome size, base composition, and matrix and sequence files. So I did. All of the analyses I've planned will be simulating the evolution of either USS in H. influenzae intergenic sequences or DUS in N. meningitidis intergenic sequences, so now I just need to open the right folder.

Gibbs progress, and on to the Perl simulations

Quite a bit of progress on the Gibbs motif sampler analyses yesterday. I figured out what I'd done to remove the RS3 repeats from the N. meningitidis genome sequence (used Word to delete all occurrences of ATTCCCnCnnnnGnGGGAAT). So I then ran some small-scale Gibbs searches on my laptop and the fastest of the lab computers, to see whether removing the RS3 repeats changed the motif it found. But none of the searches found any DUS-like motif at all even when I used a prior file that specified the DUS motif base frequencies. So now I'm rerunning these on a much larger scale (2x100 replicates) on the Westgrid computers, with a prior that specifies the DUS size but not its sequence. The Westgrid computers are slow, but I can have multiple searches running simultaneously, freeing up my own computers to run some Perl simulations (see below).

I discovered that I don't need to repeat the leading/lagging strand analyses after all. I had forgotten that I'd already redone the N. meningitidis ones (showing that the original surprising result was a fluke), and I decided that the H. influenzae ones I've done don't need to be repeated.

I started analysis of the DUS in the N. meningitidis coding sequences. I ran 2x100 replicates overnight on Westgrid, but they didn't find the DUS even thought they used the prior that specified its sequence. Instead they found about 7000 instances of sequences that resemble it only in containing GCCG. I think the problem may be the low density of DUS in coding sequences (~650 perfect 10-mers in ~1.74 megabases; 0.37/kb); the whole genome has ~1900 in 2.2 megabases; 0.89/kb. So I've set up a couple more runs, this time telling the program to expect only about 100 occurrences (yesterday I told it to expect 1500).

Now I'm going to try to get some Perl simulations running, after at least skimming the copious notes and data the former post-doc left me.

Today's goal: Gibbs analyses

I've been trying to sort out the massive jumble of files that might be relevant to the US-variation manuscript, but I've pretty much decided that it's a waste of time. Instead I'm just going to focus on finding the files and sheets of paper where we have summarized what's been done and what we learned. And then I'll try to either find or create the specific files I need (depending on whether finding or re-creating looks easier).

For today, I'm not going to try to do anything about the Perl simulation work. Instead I'll just focus on the Gibbs analyses. I did find my instructions-to-myself of how to do these. I haven't heard back from the RS3 expert (apparently because his spam filter didn't like the urls in my email signature), so the first step now is to find what I've done to analyze these repeats. In particular, I have a N. meningitidis genome sequence from which the RS3 repeats have been removed; maybe I can figure out how I did this, and then test whether it makes a difference to the Gibbs results. And I also need to figure out why I seem to have been using two slightly different counts of the number of perfect 10-mer DUSs in the N. meningitidis genome.

Checking a basic technique

A conversation with the new post-doc, and then with the RA, raised questions about how each of us does the serial dilutions and platings we use to measure cell numbers. These measurements are so fundamental to our research that we haven't questioned whether we're all doing them right.
  • Do we always use a pipettor and disposable tips, or do we sometimes use glass pipettes?
  • Do we use relatively large volumes, in culture tubes, or small volumes in microfuge tubes?
  • Do we always make 1-in-10 dilutions, or sometimes make 1-in-100 or other proportions?
  • When we use a pipettor, do we use a fresh pipette tip every time, or do we only change tips when we think it matters? Every time we sample from a different tube? Only if the new tube has a higher concentration of cells? Only if the new tube has a lower concentration of cells? Only if the volume we need to measure changes? What about when we're using glass pipettes - do we use a fresh one every time?
  • Do we pipette liquid up and up and down in the tip or pipette before removing our sample? Do we pipette liquid up and down to rinse the tip or pipette out after putting the sample into the new tube?
  • Do we always plate the same volume of dilution onto each agar plate, or do we use different volumes to refine our measurements?
This afternoon we're going to do a test, to find out whether these differences matter. I've grown cultures of two E. coli strains overnight, one wildtype and one resistant to kanamycin. And I've poured lots of plates, with and without kanamycin. I'm going to mix a bit of the KanR culture into the wildtype culture, and then we're all going to dilute and plate the mixed culture to estimate the density of wildtype and KanR cells.

If we all get the same answers, we'll know that our differences in technique don't matter. But if we get different answers then we'll need to investigate further.

What the uptake sequence variation manuscript needs

I'm going to sort out what data this manuscript needs before I do another thing!

........

OK. Progress.

Genome analyses needed: I need to reanalyze the Neisseria meningitidis genome with the Gibbs motif sampler, but not until I've decided whether or not to first remove the copies of the RS3 repeat. I've emailed the person who discovered them, asking him whether he thinks they are insertions or arise in situ like uptake sequences. If the former, I'll use the genome sequence that I've already removed them from. I'll do the analysis on the whole genome, and then on the strands sorted by their direction of replication. I did this before and got weird results; if the same thing happens this time I'll investigate further.

I've already done the corresponding analyses for H. influenzae, though I should probably repeat the replication-direction analysis because that was done with a slightly different dataset.

I should also analyze both genome datasets for the numbers of one-off and two-off motifs (singly and doubly mismatched); that will be easy because we have a little Perl script (somewhere) to do that now.

I should look at the effect of coding constraints by doing Gibbs searches with the coding and intergenic subsets of both genomes. But I won't split up the coding subset by the different reading frames - this is messy and not very informative.

The analysis of covariation has been done for Neisseria. I can't remember whether the H. influenzae covariation analysis was only done with the old dataset and so should be redone. The control analysis for Neisseria showed an odd pattern of weak covariation between every third position of random sequence segments. I don't think it's due to coding effects because I see the same pattern, a bit weaker, in the noncoding dataset. Maybe it's those blasted RS3 elements, so perhaps I should redo the Neisseria analysis with the RS3-deleted dataset.

The analysis of within-species variation at uptake sequences in H. influenzae is done, and there's no N. meningitidis equivalent to do.

And finally, what needs to be done with the Perl simulation of uptake sequence evolution? The few paragraphs I've found in the manuscript (written by me last fall) say that I'm going to take 200kb of intergenic sequence (or maybe all the intergenic sequence) of H. influenzae and of N. meningitidis, and find out what combinations of mutation rates and uptake bias the simulation needs to maintain their present levels of uptake sequences. Sounds straightforward, though I bet it isn't really.

back to the uptake-sequence variation project

Last night I reread what I've written so far on the manuscript about variation of uptake sequences.  It's a bit of a mess, because the different parts were intended for different manuscripts.  I got discouraged by all the work that still needs to be done, but this morning I'm more optimistic.  

One big job is to finish up all the analyses based on Gibbs motif searches.  I need to do the basic searches on the H. parasuis genome, do several analyses of subsets of the N. meningitidis genome, and repeat analyses of the H. influenzae genome that were originally done with a different dataset than the one I'm now using.

A second big job is to do more simulations with the Perl model of uptake sequence evolution, to give a clear picture of how a few basic factors affect their accumulation.  Right now I don't even remember which factors this should address, but I think everything is clearly set out in earlier posts and in the documents the former post-doc left me.

And the final big job is to weave the different parts together to make a manuscript that tells a coherent story.

Clean negative results

I've scored all my transformation plates and found no evidence of transformation. The Tet plates have no colonies at all. This could be because I used 20 µg/ml when I should have used 10; I'm checking the resistance of RR902 to see if I should repeat this transformation. The Kan plates had lots of colonies when cells were plated undiluted, but the numbers were similar for cells with and without DNA. With DH5alpha the colonies were all tiny even after two days, but quite a few of the colonies produced by the BW25113 derivative RR3015 were reasonably large.

What next? The RA is also going to repeat these experiments, using exactly the same method that gave apparent transformants previously. And I'm going to streak some of my KanR colonies onto MacConkey maltose (the recipeints are all Lac-) to see if any are crp- as true transformants should be.

The new post-doc's plans

I've been reading over the fellowship application our new post-doc submitted to NIH. NIH didn't fund it, but the comments were quite supportive so we want to fix it up and send it in for the next deadline (June sometime?). For me this is also an opportunity to think carefully about the immediate and long-term experiments we're planning. The immediate experiments will provide preliminary data that should make the proposal more compelling, and will also be a foundation for the full NIH R01 proposal we hope to submit in the fall. The medium-term is the experiments the proposals actually propose to do, and the long-term is where they'll lead, both for the post-doc's career (an issue raised in the reviews of his application) and for my lab's ongoing research.

One preliminary analysis we should do is comparison of the two genomes he'll use. Both are sequenced, and it would be good to provide table and/or a figure giving specific numbers of SNPs (is it called a polymorphism when you're only comparing two individuals?), numbers and lengths of indels, and information about specific differences relevant to the proposed analysis. This could be an appendix if such are allowed, or a small table in the text.

Another thing the proposal needs is a more explicit description of the calculations that underlie its claims that the scope of the sequencing is sufficient for the information desired. He'll be creating pools of DNA from different stages and sequencing these. Depending on the source of the pool of DNA, this will require a lot of sequencing, a ton of sequencing, and what until recently would have been an absurd amount of sequencing. We can afford to do some sequencing on our present budget, and the R01 proposal is mainly to get funding for the massive sequencing. I don't understand the sequencing methods he's proposing as well as I need to, and I haven't seen any of the calculations yet. We need to lay them out in enough detail that the reviewers will have confidence in us. Perhaps we can also create some simple diagrams illustrating how the different pools will be analyzed.

Follow-up frustrations

I'm trying to do the follow-up experiment to the previous unsuccessful attempt to transform E. coli with chromosomal DNA. This 'follow-up' is really a 'fall-back', as I'm now trying to replicate the RA's previous apparently successful transformations. I'm growing the two strains she's transformed previously (derivatives of BW25113 and DH5alpha, each containing a low-copy sxy expression plasmid), and will transform them with the two DNAs she's used successfully, one from RR902, an old E. coli strain containing a Tn10 insertion in purE*, and the other from RR1314, a BW25113 derivative containing a KanR cassette in crp. Both genotypes have been confirmed; RR902 grows fine on minimal supplemented with the purine inosine but not at all without inosine, and PCR shows the size of the crp fragment in RR1314 to be that predicted by the disruption.

Because she's found plating anomalies depending on cell density, I'm planning to plate a wide range of dilutions, and I'm also going to use two kanamycin concentrations - the 10 µg/ml she's already used and also 20 µg/ml.

BUT... one of the recipient strains (RR3013, the DH5alpha derivative) refuses to grow in LB with 20 µg/ml chloramphenicol (the selection for the sxy plasmid), although the other strain, which carries the same plasmid, is growing just fine. Both were inoculated at the same time, each from a plump single colony on a LB Cm10 plate. And when I tried to look at the non-growing culture under our very expensive microscope, I discovered that the high-power lens has somehow become all crudded up, and that lens cleaning solution doesn't help.

So I guess I'll just do the transformations into the other strain (RR3015). Update: my reinoculation of RR3013 crept up to the appropriate density so I included it too.

* Our strain list says this strain is also called NK6051 (constructed by Nancy Kleckner), and the E. coli Genetic Stock Center says NK6051 has its Tn10 insertion in purK, not purE. That's fine, purK is the gene next door to purE, so the original mapping was probably an error.

Negative results (no transformants)

The results of my big transformation experiment are clear but not what I had hoped.

There's no evidence of transformation of either strain by any of the markers. As expected, the C600 strain produced Lac+ revertants (frequency 10^5 - 10^6) and Leu+ revertants (frequency 10^6 - 10^7). I know these are revertants because the control cells (no DNA or no sxy insert or no inducer) produced just as many Lac+ and Leu+ colonies as the cells given DNA, sxy and inducer. The thr-1 mutation didn't revert detectably (frequency <10^8) style="font-style: italic;">lacZ gene.

I know the plates used for the selection were OK because the donor strain (w3110) grows fine on all of them but the recipients don't. This also confirms that the donor strain carries the expected wildtype alleles of lac, leu and thr.

So what next? First the RA and I need to discuss her latest results checking out the crp::kan marker she's been using for her (apparently successful) transformations. Then I'll probably do another set of transformations and controls, this time with that marker.

Waiting for the colonies to grow

OK, I did my assigned E. coli transformation experiment, though it took a couple of days.  I did find out that cells with the sxy plasmid grow just as well as those with the no-insert plasmid, at least in the absence of the inducer IPTG.  I still need to have someone show me how to use the nano-drop DNA-concentration spec, and before using it I need to further clean up my DNA prep to get rid of the RNA.  (I know there's RNA in it because I estimated its concentration and quality (fragment length) by running it in a gel.)

When I was in grad school we always added tetrazolium tetrachloride to our minimal plates because it makes the colonies red and easy to see.  I couldn't find any instructions for using it in minimal plates but I found old (Lederberg-old) instructions for using it in nutrient agar, so I used that concentration.

Now I just need to wait for the colonies to grow.  That will be slow because they're on minimal agar, so they probably won't be countable until tomorrow.  I can spend the intervening time helping grade my last exam question and entering the numbers into my spreadsheet, and then doing the Excel-wrangling to get the final grades for my ~400 freshman biology students.

Experiment plans

OK, my assignment (from the RA) is to test transformation of E. coli with nutritional markers rather that the crp::kan cassette she's used so far.

The first step is to inoculate strain W3110 into, say, 10 ml of LB broth and grow it overnight, and then tomorrow morning do a DNA prep. This will be the donor DNA for my transformations.

I'll be transforming two strains, C600 and BW25113. C600 is a work-horse strain from my grad-school days; genetically it is lacY (it can't take up lactose), thi (it needs the vitamin thiamine in its medium), and thr leu (needs the amino acids threonine and leucine in its medium). Its full genotype is F- tonA21 thi-1 thr-1 leuB6 lacY1 glnV44 rfbC1 fhuA1 λ-. F- and λ- means it differs from the ancestral K-12 strain in lacking the conjugative plasmid F and the lambda prophage. The tonA21 mutation removes a protein used as a receptor by phage T1, the glnV44 mutation (also called supE44) changes a glutamine tRNA gene so it inserts its glutamines at what would be UAG stop codons, rfbC catalyzes a step in synthesis of the outer-membrane sugar rhamnose, and fhuA helps cells take up ferrichrome (an iron-scavenging siderophore). I've listed all these because I want to make sure I've thought about all the factors that might confound my experiments - I see none here.

The other strain, BW25113, comes from Barry Wanner by way of the fantastic Keio group in Japan who have generated many of the E. coli clones and cassette mutants we've used. Its full genotype is F- ∆(araD-araB)567, ∆lacZ4787(::rrnB-3), lambda-, rph-1, ∆(rhaD-rhaB)568, hsdR514. So it has deletions of the araBAD operon (can't use arabinose) and the rhaBAD operon (can't take up rhamnose). rph-1 helps processing of tRNAs, and hsdR is a restriction nuclease that cuts DNA lacking a specific methylation. 

I'm only interested in the lacZ mutation; this is an insertion of 4 rrnB terminators that block transcription of the lac operon. I'd better check for an associated deletion, and for the size of the insertion, as these factors may influence the efficiency of recombination. Well, after quite a lot of searching I've come up with not much. This allele has an unspecified deletion and an insertion of 3 (or 4) copies of the rrnB transcriptional terminator in the lacZ promoter. I've emailed Barry Wanner asking for more details that would let me estimate the length of the heterology between this allele and a wildtype allele. (Prompt reply! The heterology is about 1 kb.)

What controls will I want to do? No DNA, to confirm that I'm seeing transformants and not new mutants. Cells without the inducing sxy gene (with a no-insert version of the plasmid). Cells with the sxy plasmid but without IPTG induction.

What will I select for? In BW25113 I can only select for Lac+, so I'll need minimal plates with lactose as well as minimal plates with glucose. BW25113 doesn't need any supplements. In C600 I can select for Lac+ by providing lactose as the only sugar, but the plates need to be supplemented with thiamine, threonine and leucine. I can also select for Thr+ and for Leu+ by not adding these amino acids to the medium.

So I'll need stocks of glucose and lactose (20%, I think - it's been a very long time since I did simple genetics in E. coli) and of threonine and leucine (10 mg/ml is standard for amino acids, I think). Thiamine I might as well put into all the agar. And minimal salts, and autoclaved agar.

Because I'll be growing up cells with both the inducing plasmid and a no-insert plasmid, I can also follow their growth in enough detail to see whether the insert slows growth.

No real results yet

The first experiment didn't produce much in the way of results.  Mostly because of a strain-name mixup that had me trying to transform kanamycin-resistant cells with a kanamycin-resistance marker...  (certainly not the first time I've made that kind of mistake).  The other transformations gave somewhat-unexpected results, but I hadn't done the proper controls for them so it's all still inconclusive.

I'm going to switch to transformations with simple nutritional markers (sugar use, amino acid requirements), but I'll need to first make some DNA from the donor cells (W3110), and then make lots of minimal plates with glucose or other sugars.  I think I can just spread the required amino acids on the plates as needed, rather than pouring plates with particular amino acids in the agar.

But first I have to deal with the problem of too many freshman biology exams to grade and too few graders to do the grading.  Then some Excel-wrangling, then I'm done with teaching for a long time.

Today's experiment

Today I'm planning to do the first of what I hope will be a long series of natural transformation experiments in E. coli.  The research associate has already done a number of these, but I didn't trust her positive results because they happened in the supposedly recA- strain DH5alpha as well as in another recA+ strain.  But now I've found that our DH5alpha stocks have the UV-sensitivity typical of recA+ strains I'm more optimistic that the transformations really are working.

What strains will I test?  The two strains that she has previously tested, plus a control strain that doesn't carry the plasmid we use to induce the chromosomal competence genes.  (I could not bother with the DH5alpha strain, but I think its unexpected phenotype may provide useful information.)

What DNA will I use?  The RA has made a stock of DNA from a strain carrying a kanamycin-resistance insertion in the crp gene.  She uses it at 2 micrograms/ml.  She's now making a better strain whose DNA we'll use for future experiments; it carries this insertion and easily-selectable wild-type alleles of other genes.

How will I prepare the cells?  I've grown the two plasmid-carrying cultures overnight from single colonies (over two nights really as I inoculated them on Friday night).  I'll dilute these 1/100 in LB and grow them to OD=0.2  The other strain I'll need to inoculate directly from a single colony this morning.  When the cells are at OD=0.2 I'll add IPTG and then the DNA.  After 2 hours I'll plate the cells with and without kanamycin.  I'll use a wide range of dilutions on the kanamycin plates, because the RA has found 'bald-spot' problems when dense cultures are plated directly.  I'll include a negative control with no DNA, and another with DNaseI added at the same time as the DNA.

What do I need to do to prepare?  Pour lots of LB plates, especially kanamycin ones.

Results of UV-sensitivity tests

My tests of UV sensitivity have found something odd about strain DH5alpha. I know it's supposed to be recA1, but both our standard lab strain and an independent DH5alpha derivative obtained from a European lab are no more UV-sensitive than the Rec+ strains we've tested (W3110, C600, BW25113), and much more UV-resistant than our NM554 strain (recA13).

This suggests that our DH5alpha strains are not really RecA-. That would be consistent with the RA's results in her transformation assays, but it seems unlikely that both our stocks are not what they're supposed to be.

Another weird result is that when DH5alpha is carrying a low-copy sxy expression plasmid it becomes as UV-sensitive as NM554. But the UV-sensitivity of the Rec+ strain BW25113 isn't altered by the same plasmid. (sxy expression in these cells wasn't induced with IPTG, but that may not matter because DH5alpha is deleted for lacZYA and maybe also lacI.)

I guess I should check other components of the genotype of our DH5 alpha strains before I start emailing the RecA experts for advice. It's supposed to be F-, φ80dlacZΔM15, Δ(lacZYA-argF) U169, deoR, recA1, endA1, hsdR17(rk-, mk+), phoA, supE44, λ-, thi-1, gyrA96, relA1. Easy to check for Lac- (the RA just made some MacConkey plates), but many E. coli strains are Lac- so that's not very diagnostic. Hmmm....

What I love about doing E. coli genetics

Sometimes you can do one simple experiment in the morning, check the plates late that afternoon, and do a follow-up experiment before you go home!

Photo documentation

 This is an iphone snap of my test plate after overnight incubation.  The first and third streaks are a recA+ control strain and the second and fourth are of a known recA- strain.  Because I was working at an odd angle, the green lines I drew to guide the exposures aren't lined up with the actual exposures.   The recA+ cells grew fine after 5 seconds of UV, and gave a few colonies even after 1 minute (probably cells that were shielded in some way from the UV).  The recA- cells grew fine when they were not exposed to any UV, but didn't grow at all even after only 5 seconds UV.

So today I've streaked all the cells I want to test, each on at least 2 different plates.  I also reduced the UV dose.  Yesterday I used a high dose range (5, 15, 30 and 60 seconds) to make sure I had a dose that would kill the recA- cells; today I used 2, 5 10 and 15 seconds.

Doing an experiment at last!

Only a tiny experiment, but it's a start.  I need to test some E. coli strains to find out whether or not they have recA mutations.  Because RecA regulates DNA repair, the easy phenotypic test is UV sensitivity.

Here's a diagram of how it's done.  First find a piece of cardboard and a short-wavelength UV lamp (I use the handheld one we sometimes use to look at DNA in gels).  Lightly streak each strain you want to test onto an agar plate, as shows in the Before picture, being sure to include both recA+ and recA- controls.  Then expose different parts of the streaks to the UV, using the piece of cardboard to shield the rest as shown in the middle picture.  (Be sure to take the lid off the petri dish because the plastic is UV-opaque.)  Then incubate the plate overnight.  If your UV dose is appropriate, you'll see the growth shown in the After picture.

This is very quick and easy, but it's not foolproof. If the UV dose is too high none of the strains will grow in any of the exposed areas, and if it's too low all the strains will grow everywhere.  If you make the mistake of streaking the cells too thickly the cells on top will shield the cells underneath them and all the strains will grow everywhere.  Because this is the first time I've done this with E. coli since I was in grad school, I'll need to try several different combinations of lamp distances and exposure times.  

What's done, what's not

OK, now I guess I'd better go through these parts of the UV-variation manuscript, figuring out what's done and what still needs to be done:
  • Analysis of the true consensus and variation in uptake sequence motifs in all the bacterial genomes that have uptake sequences (= the family Pasteurellaceae (USS) and the genus Neisseria (DUS)).
I've done whole-genome Gibbs analyses and logos for all the species with uptake sequences. (Hang on, better check that no new ones have appeared since I did this.) Good thing I did; they've finished the Haemophilus parasuis genome. A quick count of canonical USSs (MS-Word bioinformatics) finds only 99 Hin-type USS cores (AAGTGCGGT and reverse) and 450 Apl-type (ACAAGCGGT and reverse). The genome has only 12 of a predicted novel USS core GAGTTCGGT), nicely confirming our prediction that another group's assignment of this as the H. parasuis uptake sequence was an error (Redfield et al. 2006). Now I need to remember how to do the Gibbs analysis and do it on this new genome.
  • Analysis of variation in DUS and USS motifs across different location categories (orientation wrt replication, in coding sequences, in non-coding sequences, in terminator positions).
We're only describing this for N. meningitidis and H. influenzae. The H. influenzae work is all done, but I still need to do at least some of the N. meningitidis analysis. My notes say I haven't done the direction-of-replication analysis but I think I have - maybe I didn't finish it up.
  • Analysis of covariation between the different positions of the DUS and USS uptake sequence motifs (e.g. does having a particular base at one position correlate with having a particular base at another position).
I've done this for both N. meningitidis and H. influenzae, and prepared the figure.
  • Additional experimental data on how variation in uptake sequence affects uptake by H. influenzae. (This will just be a paragraph as it only modestly enriches a previously published dataset.)
Done, figure prepared.
  • Development of a computer-simulation model of uptake sequence evolution, and use of it to investigate the roles of key factors in maintaining uptake sequences in the non-coding parts of genomes.
Now this is the biggie. The model is all developed, and we've done a lot of work with it. But I need to remind myself of what we'd found (my recollections are all muddled withthe confusing interim results and changes we made to the model). Luckily, before the former post-doc left she put together a good summary of where things stood, so my first task is to use that to restore my brain to its previous understanding.

The Perl-model manuscript (and the data) still need lots of work

I guess it's not surprising that I'd overestimated how close to completion the work is for the manuscript that, among other things, describes results from our computer simulation (Perl) model of uptake sequence evolution . As it's the only uptake sequence manuscript we still have under way, I think I'll start by giving it a simpler title. But first I'd better summarize what it now contains.
  1. Analysis of the true consensus and variation in uptake sequence motifs in all the bacterial genomes that have uptake sequences (= the family Pasteurellaceae (USS) and the genus Neisseria (DUS)).
  2. Analysis of variation in DUS and USS motifs across different location categories (in coding sequences, in non-coding sequences, in terminator positions).
  3. Analysis of covariation between the different positions of the DUS and USS uptake sequence motifs (e.g. does having a particular base at one position correlate with having a particular base at another position).
  4. Additional experimental data on how variation in uptake sequence affects uptake by H. influenzae. (This will just be a paragraph as it only modestly enriches a previously published dataset.)
  5. Development of a computer-simulation model of uptake sequence evolution, and use of it to investigate the roles of key factors in maintaining uptake sequences in the non-coding parts of genomes.
In its present incarnation the manuscript is titled Evolution of DNA uptake sequences under molecular drive, but that title really only refers to the simulation model part. The real title should probably include the word 'variation', because that's what's been missing from previous work. For now, in this blog, I think I'll refer to it as the US-variation manuscript. (I reserve the right to come back and change the previous sentence if I find a better name.)

Excavated documents

Well, the big pile of documents has been excavated and the results are i) a few minor contributions to the recycling bin; ii) a smaller pile of printouts of articles; and iii) a similar pile of assorted research notes and records. Not the Perl-model manuscript drafts and notes I was looking for - they turned out to have been neatly filed in the file box labeled "Manuscripts in progress", which sits on a shelf right next to the file box labeled "Manuscripts failing to progress".

I now remember that the stuff in the pile on the floor was there for a reason. It's all sources of important ideas that I keep forgetting about - either papers I've read that tell me things I really want to remember, or notes from previous research that should someday be followed up on. Filing them would have almost the same effect as just throwing them out. This way I periodically go through the pile hoping to tidy it away but instead discovering that I need to keep these things where I'll see them now and then.

So what did I find? (Maybe if I blog about them I can decide how to use some of them?) Starting from the top of the research notes pile:
  1. The table of contents of part of a former tech's lab notebook, indexing the DNA uptake experiments she had done.
  2. A table listing the H. influenzae strains in a 'tiling-path' collection that a colleague had given us, with a map of the large-insert plasmid they're in. These are cloned in E. coli, and I think I was planning to use them to test whether H. influenzae competence genes work in E. coli (maybe make E. coli competent). I should mention these to the RA when we sit down on Tuesday to discuss the E. coli transformation experiments.
  3. The abstract of a paper reporting that RadC (competence-induced in S. pneumoniae, H. influenzae and E. coli) does not contribute to transformation or DNA repair in S. pneumoniae. This belongs in the pile of articles, next to one showing that RadC contributes to replication-fork stabilization (highly relevant to our ideas about what else the 'competence' regulons control).
  4. A very old (c. 1992?) folder containing restriction maps of some plasmids we made with the H. influenzae cya gene. I think these could be filed away.
  5. A table summarizing 'Next-Generation Sequencing Informatics', printed four months ago and probably already out of date. But highly relevant to the new post-doc's research and our planned NIH proposal.
  6. A list of the research questions we hoped to answer by microarray analysis of H. influenzae gene expression (also c. 1992). We've certainly answered most of these, but perhaps not all. Certainly I can't remember the answers to some.
  7. An unpublished summary of research some colleagues did into the distribution of transformability in H. influenzae strains. I think it was presented as a poster about 4 years ago. They sent us the strains, and a former post-doc included some of them in her more detailed analysis of the distrobtuion of competence and transformability (now in press in Evolution).
  8. Notes from my analysis of PTS genes in Pasteurellaceae, particularly the glucose and fructose transporters. Of interest because the PTS regulates cAMP which regulates competence, and the H. influenzae PTS has only the fructose transporter.
  9. The reviewers' comments on a manuscript that the former post-doc is revising.
  10. A page of notes from last summer (or the summer before last) when I was planning to test conditions that might induce E. coli sxy by assaying for expression of two lacZ fusions (to comA and ppdA). I did this and saw no induction.
That's enough for today, even though it's only about 10% of the pile.

One manuscript accepted, another resubmitted

Our manuscript on the coevolution of uptake sequences and bacterial proteomes has been accepted for the inaugural issue of Genome Biology and Evolution.  This is the one that's been ten years in the making!

And our manuscript on the CRP-S regulon of E. coli has been resubmitted.  It was rightfully trashed by the editors in its previous incarnation (mostly for claiming what it didn't deliver), but now we think it's very good.

And classes are over!  Still lots of teaching stuff to deal with - I have 24 term papers discussing whether intelligent design is a scientific alternative to natural selection, and a final exam to compose - but I've started excavating the stack of research-related paper on my office floor.  Today I expect to reach the layer of drafts and data for our Perl-model of uptake sequence evolution.  As I recall, much of the draft manuscript is already written, and only bit more simulation data is needed.  

And maybe by Monday I can get into the lab...

NIH programs

I spoke yesterday with the Program Director for the NIH Evolution of Infectious Diseases program.  She suggested that the analysis we're planning might fit better in a different program, Prokaryotic Cell Growth, Differentiation and Adaptation.

She gave me contact information for the director, so one of today's tasks is to compose an email that introduces the research I want to propose and asks when would be a good time to call him and discuss it.

CMYK - the saga continues

I found a grad student in the lab next door who's a graphics whiz, and she converted my PowerPoint files to CMYK for me.  But I forgot about the need for 300dpi resolution and for reducing the size to approximately that of the final printed image, so the CMYK versions are both about twice as big as they should be and much too low in resolution (72 dpi, says the digital image analyzer thoughtfully recommended by the journal).  

Unfortunately the size discrepancy isn't large enough compensate for the resolution discrepancy.  So I've sent the PowerPoint file and jpgs and tiffs derived from it to my collaborator, hoping she can do the conversion we need.  She's sent me the other information I need for the final revisions, so maybe we can still get it all submitted before Monday.

Tne Sxy in E. coli manuscript

About 2 weeks ago (maybe even longer) the RA gave me what was supposed to be the final version of our manuscript about what Sxy does in E. coli. She and the former post-doc had laboured over the revisions (we had convinced the editor to treat it as a 'revise and resubmit' rather than an absolute never-darken-our-door-again rejection), and I was just supposed to suggest a few minor improvements in the writing.

Instead I found lots of problems, so she and I have spent much of the last 10 days revising and re-polishing the sentences, and changing parts of the text, and completely replacing the Discussion, and making the figures clearer. Finally we have a version to send to the former post-doc for final approval. We think it's now quite good, and we certainly don't want to d any more writing, so he's been strictly warned that he's not allowed to make any substantive changes to the text, except maybe to the Discussion, which still needs a final paragraph.

Here's hoping we can submit this one on Monday.

CMYK!

The uptake sequences vs proteomes manuscript is almost ready for resubmission. The bioinformatics author is just checking whether she can do any statistical analysis for one simple graph, and I'm trying to prepare the figure files for submission.

Unfortunately the journal wants the files in CMYK format, but we made the figures with PowerPoint, which doesn't do CMYK. I spent much of yesterday evening looking for a way to convert them that didn't involve PhotoShop. I don't want to buy PhotoShop because it's far too sophisticated and complex for our needs, and it's very expensive.

So first I Googled the problem, but didn't find any easy free solutions. Then I tried my test version of the program Acorn, which costs less than 10% of what Photoshop costs but claims to do most of the same things, only easier. But Acorn can't convert files to CMYK. Then I Googled some more, and found that Macs come with a utility application called ColorSync that claims to do this. So I spent quite a while figuring out how, and doing it, only to discover that the CMYK files became their negatives every time they were saved. Everything goes black except the text, which goes white. The colours don't exactly go black, but they become very dark.

So then I emailed the former post-doc who had mastered these file conversions, and he said he'd had the same problem with ColorSync, but had done CMYK conversions using an old copy of Photoshop on one of our computers. I found our old copy of Photoshop Elements on that computer (it came free with a scanner), but it refused to open. The RA said she has Photoshop (no, wait, it's on the home computer), and that it's also on another old computer. That was again Photoshop Elements, and it did open. But when I tried to use it on my files, there was no CMYK option, and further Googling revealed that Photoshop Elements doesn't support CMYK at all.

I'd really like to get this done today, because the journal wants the resubmission by today so they can include it in their inaugural issue. So I think I'll ask around to see if someone else has a copy of Photoshop I can use for a little while.

Should I go to the SMBE meeting?

I've been debating whether to attend the Society for Molecular Biology and Evolution meeting in June.  It's in the same place (University of Iowa) as and right after a meeting I will attend (John Logsdon's Evolution of Sex and Recombination), so practically it would be easy and I'm paying the airfare anyway.  But I'm also going to the American Society for Microbiology meeting in May (Philadelphia), and because I'm combining this with a visit to family in Florida I'll be home for only 2 days between this and the Iowa trip.  So I'll be pretty burned out from meetings and traveling by that point.

I've never been to Iowa, so few days there sounds interesting, but a week in a small college town surrounded by cornfields?  I suppose I could get a lot of work done.  Dates:  Fly in May 31, Sex meeting May 31-June 3, SMBE June 3-June 7, fly home Sunday June 7.

Also, registration for the SMBE meeting is $450.  But accommodations are cheap, especially if I stay at the Motel 6 ($36/night).  But staying there would require renting a car, which would be fine if I wanted to do a lot of sightseeing, but I don't think there are many sights to see (though I'd like to drive over and look at the Mississippi).  I could stay on campus for about $75/night, which is still far cheaper than what I'm paying for my ASM accommodation.

On the other hand, I just got an email announcement that the SMBE meeting will include a symposium on the impact of next-generation sequencing methods for evolution.  As the CIHR and NIH proposals I'm planning will include massive sequencing, this sounds like something I should attend.  And other symposia look excellent too.

I need to decide by April 5 to get the early registration discounted fee - otherwise I'd pay $50 more.

Starting to plan proposals

I'm planning to write two grant proposals this summer, to the Canadian Institutes for Health Research (CIHR) and the US National Institutes of Health (NIH).  They'll be on similar topics, aspects of how H. influenzae and E. coli take up DNA.  NIH has way more money to give out, so I'll be asking them to support the really expensive work involving massive amounts of DNA sequencing.

The first (practical) issue is when grant proposals are due.  CIHR proposals are due Sept. 15, and NIH proposals Oct. 5.  (Both actually have to be in the hands of the UBC research administration a few days before, because the streamlined electronic submission process takes a lot longer than the old paper submissions, which just had to be in the hands of the courier by midnight on the due date.)

These dates work out well, because I won't have to tell CIHR all the details about the megabucks we're asking NIH for (you have to include with your application a copy of the summary page of any related proposals you have submitted), but can just casually mention that we plan in the future to apply to NIH for sequencing money to address related questions.  This is good because Canadian grant panels aren't always comfortable with giving their limited funds to people who have lots from other sources.  They also may not be happy if they see that you have already applied elsewhere for funds to do overlapping work.  But NIH doesn't mind (see my previous post), so having already applied to CIHR won't faze them.

What I learned at the NIH workshop

This morning's NIH workshop was very useful - I learned quite a bit, and it sure got me ready to start preparing for applying next fall. My plan is to redo the DNA uptake proposal I submitted to CIHR two years ago, and send versions of it to both CIHR and NIH. A new post-doc arrives in a couple of weeks to begin working on aspects of DNA uptake. I'll be just about done teaching by then (classes over), so he, I and the present post-doc (about to be Research Associate) will spend much of the summer getting the preliminary results these proposals need.

What I learned this morning: (I'm collating my scattered notes here so I won't forget them)

Sign up for the weekly NIH Announcements emails. Most of the contents are of no interest to me, but I'm a fast reader so I should be skimming them anyway.

Find a Program Officer and talk to him/her repeatedly about the science. NIH is full of very helpful people, who measure their success by whether or not your grant gets funded! (Not like CIHR.) So the first step in grant preparation is finding the program officer whose interests best match yours. I have the url for a listing of these people (no, it's just the list of Institutes; I guess I should start by calling NIAID - they handle infectious diseases). Anyway, I'll do this on Wednesday. I can also ask US colleagues who their program officer is. And when I go to US meetings I should look for the NIH people and talk to them about my plans. I'm going to the big microbiology meeting in May -- NIH should have a substantial presence there -- and to two meetings on evolution (evolution of sex and molecular evolution) in June (NIH people might be there too).

Use the cover letter to direct your proposal to the right people and places: Name the names of the NIH people who are on your side.

Relationships matter as much as good science. Again, this is all about getting to know a program officer.

Brag! This is hard for Canadians, but being modest is a big msitake here.

NIH is flush right now. Obama has given them $10 billion that must be spent by Sept 2010 (on top of their annual budget of about $30 billion). Most of this can't go outside the country, but some can, as subcontracts and foreign components of domestic grants. And it will take the pressure off of the main grant stream, so hopefully getting funded will be easier for at least the next several years.

Focus on the 'R01' grants: Don't bother applying for the little 'R03" grants ($50k/year for only two years). But it might be worth applying for an 'R21' grant ($275K over two years) - you don't need as much preliminary data as you do for the regular 5-year R01 grants because they're intended to let you generate the preliminary data.

Money is always tight, so having US collaborators is a bonus. I wouldn't be a co-investigator or sub-contractor on someone else's grant (this is our own work), but I should be able to line up a couple of solid American collaborators, or at least have letters of support from Americans who will provide help and advice if needed. But the need for this varies with the NIH program, so I need to talk to 'my program officer'.

Spell out up front why this foreign grant deserves funding. There are three criteria. 1. opportunities not available in the US. In my case, it's that nobody in the US wants to do this, and only I have the combination of evolutionary and molecular expertise to see why it's so important and to carry it through. (Of course this argument will depend on how much evolution is in the grant.) 2. Augmentation of existing US resources. Can I claim this? 3. Potential for improving the health of US taxpayers (it's their money). I can argue that understanding why and how bacteria take up DNA will give is therapeutic targets.

Commit at least 15% of your 'effort' to this project. 20% is probably better. But be prepared to back this claim up with evidence - don't claim to be putting 50% of your effort into each of several projects.

The Budget section of NIH proposals is complex: I was looking forward to the new 'modular' budgeting, where you just need to say how many $25K modules you want per year (up to $250K/year), but this doesn't apply to foreign grants.

Ask for some salary. Canadian faculty have 12-month salaries and we're not allowed to ask for any salary support from the Canadian agencies. But salary support is a standard item on NIH grants so we're not seen as very serious if we don't ask for any. The legalities of doing this are a bit uncertain - I asked my Department Head whether the money is allowed to travel from the UBC Finance account into my bank account; he's going to ask the Dean. Someone said we are allowed to get the equivalent of two months' salary - I don't know if this only applies to consulting fees and running a company on the side, or also to salary from outside grants and contracts.

Some 'indirect costs' can become direct on foreign grants. Foreign grants get only 8% as indirect costs to the institution, and these are intended to defray only the costs of administering the grant, not to cover the indirect costs of doing the research. So some expenses that in the US would be considered indirect costs to be provided by the institution (e.g. phone, office supplies, secretarial support) on foreign grants can be included in direct costs.

Ask for more money than you need: My previous NIH grant was awarded the full amount I had asked for, but now across-the-board cuts of 10% or even 20% are common. So budget in some extra. So I guess I should ask for 20% of my salary?

Don't be overambitious: Don't propose to accomplish an unreasonable amount of science. The reviewers won't think you're exceptional, just naive.

Plan for the long term: When planning your proposal, think beyond the 5-year term. What will you want to do next? How will what you are proposing now affect that? What are your long-term goals and how does each project take you closer?

Be very careful not to write anything that might turn a reviewer against you. Don't be disparaging or smart-assed. Check the membership of the study sections your proposal is likely to be sent to, and be sure to cite all their relevant work.

The font matters? One speaker recommended using Ariel 11 font. I'll have to email him to ask why.

Don't leave town right after your proposal is submitted: The NIH system scans each proposal for technical errors (wrong kinds of information in wrong boxes), and gives you only two days after submission to fix these.

It's fine to apply to both NIH and CIHR for the same project. If both succeed, NIH is happy for the aims to be readjusted and the project split up into two parts, one funded by each agency.

Starting in 2010, a rejected proposal will only be allowed one resubmission. NIH is trying to clear out the deadwood of the grants that just won't die.

Once the project is funded, the budget allocation is flexible. I used to think that NIH budgets were quite rigid, but if I later decide that I need to spend the money on something other than what was originally budgeted I can do so. If I'm changing key personnel (not just a tech or student) or if it would change the 'scope of the work' I need to get NIH approval, but that's usually just an email. However there's no allowance for currency fluctuations.

Progress...

I'm spending the morning at a workshop on getting NIH grants, sponsored by some consortium of UBC research administrators.  It should be time well spent, as I plan on submitting a proposal next fall to get money for ambitious work on DNA uptake.   

Then maybe a bit of the afternoon can be spent working with the post-doc to finish up the E. coli-Sxy manuscript, so it can be sent (hopefully for the last time) to our former-post-doc co-author in Ireland.

Then a day of thesis defense (someone else's student) and an always-enjoyable meeting evaluating proposals for interdisciplinary workshops.

Then maybe, just maybe, I'll have time to take a long-overdue look at our computer simulation manuscript.

Minor manuscript submitted!

Our London colleagues agreed with our improvements and arguments, and suggested ways to improve the cover letter.  So now we'll see if the editor also buys our arguments.

Too bad I'm just one of the middle authors.  I feel like I've done a lot of work on it, both actual experiments (the time course I did in December) but the senior author is, rightly, the London investigator whose grant supported most of the experiments, and the first author is, rightly, the person in his lab who did those experiments.  My post-doc is sharing official-first-author credit with that person, largely I think because she's taken the initiative of putting the paper together and doing most of the writing.

Minor manuscript progress (major progress on a minor manuscript)

A minor manuscript we've been working on (the one I did the time course for last December) was recently provisionally accepted by an appropriately minor journal, but one of the reviewers wanted us to do some major additional experiments. However neither we nor our coauthors in London are keen to invest any more work on its topic (why some strains of Actinobacillus pleuropneumoniae are competent and others aren't).

None of the unanswered questions can be answered by simple experiments, and if we were to follow this trail we would want to work in Haemophilus influenzae, the species our research focuses on. In particular, the two analyses recommended by the reviewer would both be a waste of time. One, knocking out the sxy gene in a strain that transforms very poorly, wouldn't tell us anything about why the strain doesn't transform better. Neither would the other, sequencing all the competence genes in a strain that transforms well and comparing them to the already-sequenced genes of this strain, because we have no functional framework for interpreting any differences we might find (i.e. we don't know anything about how the encoded proteins do their jobs).

So the post-doc and I spent much of yesterday rewriting the manuscript to address the reviewers' concerns and misunderstandings, and composing a tactful cover letter to the Editor explaining why we aren't going to do the experiments. The paper is much better now - while rewriting we realized that we had been underplaying an important implication of one of the experiments - the demonstration that the A. pleuropneumoniae sxy gene works perfectly in H. influenzae. These genes are quite divergent (only 24% amino acid identity) and because the two species are in the two different Pasteurellaceae subclades, this result lets us conclude that Sxy functions identically in all of the Pasteurellaceae.

We've now sent the revisions off to our London coauthors (without the point-by-point response to the reviews, which we need to compose today). They've been a bit more cautions than us on this manuscript, arguing even before we submitted it that more experiments probably should have been done, even though they don't have the resources to do them. Hopefully our arguments will convince them that we should just send the revised manuscript back to the Editor and hope for the best. Our cover letter to the Editor does say that, if he still thinks it necessary, we will make and test the sxy knockout, but by 'we' we mean them.

ANOVA success

I found our lab stats package (Graphpad Prism), and read bits of its very detailed and user-friendly help files. Then I pasted in my data and did some two-way ANOVAs. Then I read the help files some more and decided I should have done 1-way ANOVAs with 'repeated measures'. (That tells the software to consider all the values in the same row as belonging together.)

I first analyzed each group of tripeptides separately (the blue ones as one dataset, then the pink, then the yellow). The blue set had significant differences between the columns in the ANOVA (p=0.01). It also had significant differences between the bright-blue column and all other columns by Tukey's multiple comparison test. I used this rather then the Bonferroni test but I'm not sure which would have been more appropriate - I think this is less sensitive than the experiment deserves, because I had specific comparisons in mind from the start. The pink set had not-quite significant differences (p=0.058) in the ANOVA, and not-significant differences between any pairs of columns in the Tukey's test. The yellow data had very significant differences between the columns in the ANOVA (p<0.0001), and significant differences between the bright-yellow column and all other columns by the Tukey's test.

I then rearranged the data, putting the bright-colour data all in the same column (the 'cognate-proteome' column), and the pale-colour data in the other columns. This let me analyze all three colours together. The ANOVA found very significant differences between the columns (p<0.0001) and the Tukey's test found significant differences between the cognate-proteome column and all the other columns.

The control comparisons ( using reversed tripeptides) were never significant.

So now I can add a sentence to the manuscript, reporting that the effects shown in Figure 1 are statistically significant.

I should have paid more attention in stats class




One of the reviewers of the manuscript I'm revising for Genome Biology and Evolution asked if we could do some statistical analysis of the data we present in a graph. On the left I've put the graphs and the data . The lower graph panel and lower block of data are the controls; we can ignore them for now. I think we can also safely ignore what the data represent.

I'll describe the significance questions with respect to the top-panel graph (A):

We want to know the following:
In the left group (4 blocks of four bars, labels SAV, TAL, KEG, PHF/L), are the four blue bars significantly higher than the red, yellow and green bars beside them?
In the middle group (4 blocks of 4 bars, labels QAV, TAC, TSG, PLV), are the four red bars significantly higher than the blue, yellow and green bars beside them?
In the right group,(5 blocks of 4 bars, labels PSE, SDG, FRR, QTA, RLN/K), are the five yellow bars significantly higher than the blue, red and green bars beside them?

The actual numbers are in the upper part of the table, in the correspondingly coloured cells, and below I'll restate the above questions in terms of these numbers.

In the top four rows of the table (blue), are the numbers in the bright-blue cells significantly higher than the numbers in the light-blue cells in the same rows?
In the next four rows of the table (pink), are the numbers in the bright-pink cells significantly higher than the numbers in the light-pink cells in the same rows?
In the next four rows of the table (yellow), are the numbers in the bright-yellow cells significantly higher than the numbers in the light-yellow cells in the same rows?

I suspect this is an ANOVA (analysis of variance) type of problem. But I'm pretty sure it would require more complicated analysis than the simple ANOVA described the new statistics textbook my author-colleague kindly gave me (probably to get me off his back with dumb statistics questions). Hmmm, maybe it would be possible to do a separate ANOVA on each group -- i.e. one for the blue data, one for the red data, and one for the yellow data.

UPDATE:

My basic version of EXCEL doesn't have the statistics add-in needed for ANOVAs, and I can't even remember the name of the statistics/graphing package the lab owns (it's not installed on my computer). But I found an on-line applet to do two-way ANOVAs here ( I need two-way because I have two variables, the rows and the columns). So I pasted the data from the blue cells into the applet, with the following results.


"Conclusion on Treatments Effects: Very strong evidence against the null hypothesis." The null hypothesis is that all treatments (columns) gave the same results, so there are very significant differences between the data in the different columns (p=0.00058).

"Conclusion on Blocks Effects: Moderate evidence against the null hypothesis." The null hypothesis is that all blocks (rows) gave the same results, so there are moderately significant differences between the data in the different rows (p=0.011).

This is definitely the kind of information I want, so I guess I should find the lab's statistical/graphing package and find someone to show me how to use it to do ANOVAs properly.

But this analysis doesn't let me see whether it's only the bright-blue column that's significantly different from the others. I guess I could repeat the analysis, leaving out the bright-blue data, and see if the others are not significantly different, but I'm sure there's a better way to do this. After I play around with our statistical/graphing package for a bit, I might be knowledgeable enough to go ask my colleague for help without embarrassing myself too badly.
The uptake-sequence manuscript we submitted to the new journal Genome Biology and Evolution has been provisionally accepted. One of the reviewers said the following, and I'm wondering if there might be an easy way to do this suggested analysis:
...despite claiming that one of their main goals was to determine whether uptake sequences had an effect on protein and organismal fitness, the authors did not look if these sites are under purifying/diversifying selection. It would be greatly relevant for their question of interest, which is currently only supported by indirect evidence.
The reviewer is absolutely right. We didn't think of doing this analysis, but we should have (though of it, not necessarily done it).

I don't think our dataset is appropriate for anything more sophisticated than simply calculating dN/dS ratios, and I'm not at all sure it's even suitable for that. I had to start by pulling out my complimentary copy of Freeman and Herron's undergraduate textbook Evolutionary Analysis, which explains how dN/dS ratios and McDonald Kreitman tests are used to examine DNA sequences for evidence of purifying or diversifying selection on the amino acids they encode. For a pair of aligned DNA sequences, dN/dS is the ratio of the number of differences that change the encoded amino acid to the number of differences that don't change the encoded amino acid. There are lots of programs and web sites that will do this analysis, given pairs of aligned seuqences in the appropriate format.

I think that my bioinformatician coauthor has DNA sequences of hundreds of H. influenzae and N. meningitidis genes, each aligned with each of three 'standard' homologs from genomes that don't have uptake sequences. These alignments have been sorted into classes, based on how many uptake sequences the H. influenzae or N. meningitidis gene has (0, 1, 2, 3, >3). I think the appropriate analysis would be to score the dN and dS ratio for each alignment, calculate the mean score of the three standard alignments of each H. influenzae or N. meningitidis gene, and then calculate the grand mean score for all the genes in each class.

This analysis isn't hard to describe, but it might be harder for my coauthor to automate, depending on the details of how the alignments are fomatted and what the dN/dS programs will accept. I'm going to email my former post-doc who has a lot of sophisticated knowledge about these methods, asking for her advice.

Should Darwin be an 'ism'?

On Tuesday evening I'm leading a Cafe Scientifique discussion on the topic Should Darwin be an 'ism'? I chose this topic as something that a broad range of people would be interested in and have ideas about, but I need to do some reading and thinking first.  Luckily the discussions take place in a local pub (The Railway Club, 579 Dunsmuir, 7:30pm, in case you're interested), and the atmosphere is very informal.
What will I read?  Carl Safina had a very relevant article in the New York Times last month (Darwinism must die so that evolution may live, Feb. 9), which I need to read carefully.  When it came out I didn't take the time to read it properly because I expected to agree with everything it said.  But I also need to read a bit more history of the use of the term Darwinism, maybe in Ernst Mayr's The Growth of Biological Thought
What do I think?  Maybe biologists started referring to evolutionary theory as Darwinism as a way to give credit to a truly exemplary scientist.  But now the creationists are turning this against us, claiming that we 'worship Darwin like a god', and that any evidence that Darwin made any error is evidence that evolutionary theory is wrong.  Using the term Darwinism also lets people put evolutionary biology in with a pile of what are now largely discredited ideologies and belief systems (Marxism, Raelism, Freudian psychology, etc.).
Unfortunately the cat is out of the bag.  Getting evolutionary biologists to forego using Darwinism will be easy, but re-educating the general public will be much harder.  The real problem is still that the creationists are much better publicists than we are, and they are determined to keep the public believing that evolutionary biology is synonymous with Darwinism.

Open access at the American Society for Microbiology annual general meeting

In May I'll be part of a panel discussion on open-access publishing, at the big General Meeting of the American Society for Microbiology.  The other participants are 'professional experts': Jon Eisen, Academic Editor in Chief, PLoS; Sam Kaplan, Chair of the ASM Publications Board (ASM publishes about a dozen journals and many books); and Joe Deken of the California Institute for Telecommunications and Information Technology.  I guess what I'll bring to the table is the perspective of the ordinary scientist trying to do what's right.

Writing

In addition to How to Write a Lot, I've been re-reading a little book on writing by Joseph Williams, Style, the Basics of Clarity and Grace.  This wonderful book is mainly about how to write sentences that are easy to read and understand, something all scientists strive for but few of us achieve.

One reason scientific sentences are often hard to follow is called 'nominalization'.  That's when an action is described by a noun rather than a verb.  For example, instead of writing 'the cell divided' we might write 'cell division occurred'.  I'm building my ability to avoid this by going through the manuscript I'm revising, rewriting sentences that suffer excessively from nominalization.  I don't have to search for these sentences, almost every sentence has one or more nominalized actions in it.

Here's an all-too-typical example:  "In E. coli, the dramatic reduction in growth and eventual cell death caused by sxy overexpression made it impossible to test whether sxy induction produces the typical ‘natural competence’ phenotype of high-efficiency transformation with linear chromosomal DNA."  It's a perfectly OK sentence, no grammar or syntax errors, but it's still a bit of an effort to read.  Can I improve it by replacing some of the nominalizations (reduction, overexpression, induction, transformation) with verbs?  

Yes I can.  "We could not test whether inducing sxy causes E. coli cells to become naturally competent and efficiently transform with chromosomal DNA, because when cells overexpress sxy their growth rate slows and they eventually die."

Small steps

Prompted by the How to Write a Lot book, I'm trying to spend half an hour on a scholarly-writing task before I get up, each morning that I don't have to be somewhere early (i.e. not at the gym by 8:00am).  As a result of this, I've nearly finished the first draft of my short essay for the ASM evolution book.  Only about 3 paragraphs still to go, and I know what they're going to say!

Yesterday I started reading the revised manuscript about Sxy in E. coli.  It still needs a fair bit of rewriting work, but maybe the post-doc and I will have enough time for sitting-down-together-and-revising that we can still get it out by the end of the week.

Manuscript work

The end of term is approaching so I can see the light at the end of the teaching tunnel (mixed metaphor?).  Here's a list of the manuscript-related tasks on my plate:

Informal chapter for the feitschrift for John Roth:  The first draft is in the hands of the editors, who I hope will soon give me feedback on how to improve it.  The post-doc and undergrad also have it - I haven't had any feedback from them either.

Short essay for the ASM popular science book on Darwin and microbial evolution:  I'm working on this.  I need 2000 words and have about 1500.  It's turning into a nice discussion of how we can study natural selection in bacteria.  I should soon have a list of all the other authors and their topics, which will help me integrate mine.  I just reread the email invitation, and now realize that I'm supposed to include personal stuff about me as a scientist - maybe I will, maybe it won't fit.

Manuscript about regulation by E. coli Sxy:  This was gently rejected by J. Bacteriology (with the possibility of resubmission).  The post-doc first-author has now rewritten it for resubmission, with input from the former post-doc other-author, and she's now passed it on to me.  If it's OK we'll submit it this week. 

Manuscript on co-evolution of uptake sequences and proteomes:  This has been provisionally accepted by Genome Biology and Evolution.  I asked the bioinformatician coauthor for feedback -she sent me a short email with some questions I haven't responded to.  So the first step is to respond to her questions (well, after I re-read the reviewers' comments).  I'm hoping we won't need to do any substantial new work.

Manuscript on student writing and learning:  This has been languishing since my teaching-fellow post-docs left town.  It's nearly finished so I should get it done.

Manuscript on the perl model of uptake sequence evolution:  As I recall, this needs a bit more computer-simulation work and quite a bit more writing.  The post-doc senior author has moved to Toronto, but we should still be able to get this done.

Manuscript on the phylogeny of H. influenzae strains and competence:  This is the work of this same post-doc.  The manuscript was provisionally accepted, but with requests for substantial additional work.  Based on her past performance I'm confident that she will get this done, but I should touch bases with her about it.

Can that be all?  

progress (?) on the ligase puzzle

The NHEJ expert said that he thought the periplasmic assignment of the H. influenzae ATP DNA ligase must be an error.  I was discussion the ligases with a colleague who works on Campylobacter (which also has one of these ligases) and she suggested I try running the sequences through the program PSORT-b, which is particularly good with bacterial proteins.  

PSRT-b could not assign a high-probability location to most of the ligases I tried, suggesting that the HMM method used by TIGR's database may be overconfident.  I was also surprised to find that its BLAST search pulled up some NAD-dependent ligases as matches to the ATP-dependent ligase sequences I tried.  I had been thinking that the two families had very dissimilar sequences, but maybe I'm wrong in that. 

The possibility that these ATP-dependent ligases act in the cytoplasm is interesting, as the competence-induction of the H. influenzae one may mean that it contributes to the postulated replication-arrest problem rather than to DNA uptake.

That periplasmic ligase

Yesterday I talked to Tom Silhavy about the periplasmic ATP-dependent DNA ligase that's co-induced with H. influenzae DNA-uptake genes (see old blog post here). He hadn't heard of this and was adamant that there is no ATP in the periplasm. So I did some more poking around.

I found papers about bacterial ATP-dependent ligases that function in 'non-homologous end joining' (NHEJ) reactions - these serve as last-resort repair mechanisms for double-strand DNA breaks that can't find a homologous template to use for repair. I emailed the author of a review, asking if the H. influenzae ligase belonged in this category. (He turned out to also be the person who had done the biochemical characterization of the H. influenzae ligase!)

He said that H. influenzae doesn't have the other NHEJ genes Ku and LigD, so it probably can't do NHEJ. I suspect the H. influenzae protein is in a different category of ligase, because a BLAST search with the H. influenzae ligase doesn't find known NHEJ ligases.

He also asked why I think it's targeted to the periplasm. At first I thought he meant, what do I think is the reason it's target to the periplasm, so I explained that I don't know. But then I realized he might be asking what is the reason I think it's targeted to the periplasm. I couldn't remember so I looked at it and its homologs using TIGR's HMM (hidden Markov model) location analysis function - this says that the H. influenzae protein and the four homologs I checked (Neisseria, Campylobacter, Shewanella and Thiomicrosomethingorother) all have a high probability of being periplasmic, with a single strong transmembrane domain close to the N-terminus. Tim VanWagoner, who also worked on the H. influenzae gene, also wrote that its Vibrio homolog is predicted to be periplasmic. Tom Silhavy had wondered if the apparent signal sequence might be an annotation error (wrong start site?), but this is very unlikely to be the case for all the homologs, so the odds are very high that these really are periplasmic.

I mentioned to Tom my idea that the ligase might be exported to the periplasm with an ATP already bound (the purified protein has its ATP covalently bound, ready for action). He said that, if that were the case, the protein would have to be exported by the Tat (twin argine translocation)system, because that's the only export system that can handle folded proteins. Luckily there's now a TatFind server, so I pasted in the various protein sequences, all of which had no recognizable TAT site in their first 35 aas.

How peculiar... We must be overlooking something important...

outer membrane issues

Later this morning I'll be meeting with Tom Silhavy, who's visiting to give the Microbiology seminar today.  He's an expert on outer membrane biogenesis, so what might I ask him about?

In the context of the development of competence, there's the timing issue.  How long should it take H. influenzae to assemble its DNA-uptake machinery once the genes have been turned on? We traditionally allow 100 minutes from transfer to starvation medium.  The microarray analysis showed that under these conditions gene expression is higher at 30 minutes than at 10 minutes.  Addition of cAMP to non-starved cells induced competence with a peak at 45 minutes.  How much of this time is needed for assembling DNA uptake complexes in the membranes?  What other factors might contribute to a lag?  Should E. coli be faster?

Another issue Tom's interested in is energy sources for periplasmic and outer membrane processes.  He might have some insight into the periplasmic ATP-dependent ligase that's co-induced with H. influenzae competence genes.  Where might it get its energy and what might it be contributing to uptake.

He might also have ideas about the "getting stiff DNA across the outer membrane without a free end" problem.  How flexible is the outer membrane to being pushed around?  And what about the cell wall - is it an obstacle we should worry about?

Should Darwin be an 'ism'?

In a few weeks I'm leading a Cafe Scientifique discussion on the topic "Should Darwin be an 'ism'?" I promised to provide a short abstract, so here goes:

Darwin's place in modern biology is unusually personal. When The Origin of Species was first published, biologists readily accepted the publicly controversial idea that all modern life evolved from simpler organisms. But they were dubious of natural selection's role in adaptation, and 'Darwinism' competed with 'Lamarckism' and then ''Mendelism' until the genetic basis of inheritance became clear in the 1930s. Since then many biologists have invoked Darwin whenever they spoke of natural selection, perhaps to make up for our original skepticism. But creationists are now turning this against us, claiming that evolution is nothing but Darwin-worship. Is it time to push Darwin into the closet?

ECOR strains

Yesterday I was filling in the form to apply for a permit to import pathogenic bacteria, so we could get the 'ECOR' set of E. coli strains. This is a set of 72 different E. coli strains from many different human and animal sources, chosen by Howard Ochman and Bob Selander to represent the diversity of this species.

Since Ochman and Selander's original analysis (1984) they've been examined for many different genotypes and phenotypes. The group that maintains the strains has a long list of papers describing work on them, but it only goes to 2001. So just now I did a Google Scholar search for 'ECOR collection' and one of the top hits was a paper by a UBC colleague, Julian Davies, describing these strains' repertoire of antibiotic resistance genes carried on integrons.

So I just emailed Julian. If he already has these strains, we won't have to bother importing them!

Experiments with stalled replication forks?

We think (I think) bacteria turn on their 'competence' genes because they are running out of deoxynucleotides for DNA synthesis. Part of this adaptive response is taking up DNA (an excellent dietary source of deoxynucleotides) and part of it is other changes that help cells cope with problems that arise when DNA replication is interrupted.

If I'm right, then cells with their competence genes already on might be better able to survive interruption of DNA replication. How can we test this? Are there antibiotics that block DNA replication, that can be used to create a transient block and then washed out? What about temperature-sensitive (ts) mutations in DNA replication genes? This might best be done in E. coli, not H. influenzae, because ts mutations don't work well in the latter ( it's intrinsically sensitive to minor shifts in temperature). E. coli also has a fine collection of already characterized ts mutations, and we now are able to artificially induce its CRP-S (competence) regulon by putting E. coli sxy on an inducible plasmid.