Field of Science

Can I remember how our USS-evolution model works?

Today we're having the first lab meeting in weeks. (When my turn came around a few weeks ago I just kept canceling them, but now I'm starting to see the light at the end of the teaching tunnel.)  We're going to discuss an issue that's arisen in the USS-modeling work being done by an undergrad research assistant, but first I promised to introduce this project.  What can I remember (or rediscover by reading my old blog posts about it)?

The big goal is to simulate how uptake sequences accumulate in genomes of competent bacteria, under the combination of mutation pressure (a randomizing force) and biased uptake preferring fragments containing these sequences.  The model follows a single genome-sized sequence through repeated cycles in which 
  1. Random segments of the genome are treated as if they were fragments in an external DNA pool released by descendants of the 'index genome'
  2. These fragments are scored for quality of their match to the ideal uptake sequence.  The best fragment is chosen for the uptake step
  3. In the conceptual meantime, the index genome itself undergoes random mutation, becoming the descendant index genome.
  4. The chosen fragment's sequence replaces the homologous sequence in the descendant  index genome.
  5. This recombinant sequence becomes the new index sequence and the cycle starts again at step 1.
There have been lots of issues to resolve along the way (how the mutation steps maintain the base composition of the sequence, how the uptake sequences are scored), but we finally have a program that runs.  It seems to be working correctly, but the undergraduate who's done most of the work tells me that it isn't causing any uptake sequences to accumulate.  He's quite a sophisticated undergraduate - he has a Biochemistry degree under his belt and is nearly finished a second degree, in computer science - and he's done a lot of statistical analysis to look for the expected accumulation.

I suspect that the problem is that so far the model is using inappropriate parameters (mutation rates too low or too high? uptake bias settings too weak or too fussy? numbers of cycles too short?).  Today's goal is to figure out what these might be.



Research about teaching

It's been so long since I posted that Blogger had a seizure when I clicked on 'New Post'.
I'm still struggling to find a few spare synapses to think about scientific research, but the teaching demands will begin to diminish soon. In the meantime here's a post about the only research I'm currently paying much attention to.

I'm teaching freshman biology to about 380 students, in two sections. With one of our wonderful teaching fellows (supported by UBC's Carl Wieman Science Education Initiative), I'm carrying out an experiment to find out how homework might improve students' understanding of the course material and ability to explain their understanding in writing.

It's been widely assumed, but perhaps never explicitly tested, that doing homework helps students understand course material. Int his experiment we're going to examine whether students whose homework required them to formulate their ideas in correctly written sentences, and who got detailed feedback on their errors, perform better on midterm and final exams than students whose homework required only that they recognize correct answers.

The course I teach (BIOL 121, Genetics, Evolution, Ecology) has no tutorials and no TAs, just graders for midterms and finals; it's never had homework. This term we've split the students randomly into two homework groups that both get weekly homework assignments with very similar content but different requirements. Each homework is built around a single theme; more of a 'case study' than an series of unrelated questions. Students are given some information, asked one or two questions, given a bit more information, asked more questions, etc.

Group B's questions are in formats that can be automatically graded by our BlackBoard course management system - mostly multiple-choice questions, with some matching and fill-in-the-blanks. Group A is asked many of the same questions, but they have to think up their own answers and write them out. If the question does not require writing (e.g. has a numerical answer) the students are asked to give a written explanation of their answer.

Group B students can check the grading of their answers through the online system but get no specific feedback about their errors. Group A students get individual feedback about both their writing errors and their content errors. Part of this feedback is a very detailed grading key that gives, for each question, a sample answer, a numbered list of points a good answer should contain and errors that should have been avoided, and often a reference to lecture notes, textbook pages, or other sources of clarifying information. For each student's submission, each answer that did not earn full marks is commented with numbers indicating which problems the answer contained. For example, Writing error A is 'grammar errors', and Content error 4a is 'misinterpreting pedigree symbols or relationships'.

At the end of term we will compare the performance of the two groups on both the midterm and the final exam. Both these assessments include questions with written answers, allowing us to evaluate both students' mastery of course material and their ability to write clearly and correctly. Students were given a pre-quiz at the first class, and some of these questions are repeated on the midterm or final, allowing direct before and after comparison.

To make sure students feel they have been treated fairly, the course grades will be normalized across the two groups before being officially submitted to the Registrar's Office. The group with the lower course mean will have its grades raised to match the mean of the other group. This seems to be having the desired effect. We had expected some students to protest the unequal treatment, complaining either that Group A had to work harder or that Group A was going to learn more, but this hasn't materialized.

Blogging on (30-year-old) peer-reviewed literature

I'm reading a 1979 paper about GTA (Yen, Hu and Marrs, Journal of Molecular Biology 131:157-168).  The authors devised a way to select for mutants that overproduced GTA, and used these to find out more about how the particles are produced and the DNA they contain.

The mutant they focused on produces enough GTA that culture supernatants transfers any one chromosomal marker to about 0.0001 to 0.001  of the cells in a recipient culture.  As each particle contains about 0.001 of the donor chromosome, and recombination of such short fragments is relatively inefficient, this means that the supernatant probably contains about as many particles as there are cells in the recipient culture.  That was probably about 10^9 per ml.

The mutant grew poorly, and the authors interpreted this as a consequence of increased cell lysis associated with the increased GTA production.  They maintained the strain by growing it in medium they had found to inhibit GTA production (PYE medium), and transfered it to a GTA-inducing medium (RCV) when they wanted GTA.  After this transfer they observed that after several cell divisions about 10-20% of the cells died at the same time that high levels of GTA became detectable in the medium.  

This is a very provocative result (too bad they don't show any data), because it implies that GTA production is very deleterious . I'd never heard of either medium; I wanted to find out what's in them but one of the disadvantages of reading old papers is that they and their references are often not available online.  But simply Googling "PYE RCV" led me to the recipes - apparently they're quite widely used.  PYE is just 0.3% peptone and 0.3% yeast extract, which makes it a slightly more dilute version of our old favourite rich medium LB.  RCV is a defined medium used for R. capsulatus photosynthetic growth; it contains 0.4% malic acid as the only carbon source, 0.1% ammonium sulfate, thiamine, and other salts specified in papers that Springer will show me for $32.

So cells make lots of GTA in rich medium but not in the very poor medium used for photosynthetic growth.  Hmmm...

But searching for the paper with the recipes for these media led me to an even older paper that I need to first read.  This is Marrs 1974, PNAS 71:971-973, and PNAS is on line all the way back to the beginning.  So I'll take a break to read this paper, and post on it before continuing.

More about GTA

Yesterday I found time to sit down with my colleague who works on the 'gene transfer agent' (GTA) of Rhodobacter capsulatus. This helped me sort out a few things that are known about this entity, and a few things that aren't.

Is GTA derived from phage? Almost certainly. My colleague's lab's recent work has shown that some of the genes needed for GTA production are homologs of known phage genes. Old work from Barry Mrrs' group also showed that supernatants of GTA-producing cultures contain particles that look like tiny tailed phages. However, no GTA- control cultures were examined, and these phage-like particles could be produced from a defective prophage unrelated to GTA.

GTA particles contain chromosomal DNA fragments about 4.5kb, but nothing is known about how the DNA comes to be packaged in these particles. This information is critical to understanding how evolutionary processes act on GTA.

Old Cot-curve and restriction analyses were consistent with the fragments being derived from random positions in the chromosome, but the resolution is very poor. The issue could be nicely resolved by isolating DNA from the particles and hybridizing it to Affymetrix chips. Unfortunately my colleague says that getting sufficient GTA particles is quite difficult, as yields are both very low and not very predictable. An attempt to find out whether the ends of the fragments are blunt or staggered was unsuccessful.

From an evolutionary perspective, the most critical missing pieces of information are probably whether GTA is always (or often) accompanied by the death of the producing cell, and whether genes allowing GTA production can be transferred by GTA. That's because, if the genes are anything more than accidents of evolutionary history, they must either enhance the fitness of the cells they are in or spread into new cells faster than they kill their present cells.

If cells can produce GTA without dying, they must have a way to pass the particles out through the cell membranes without destroying them. Some filamentous phages can be secreted by living cells, but I think the tailed phages GTA is thought to resemble escape only by lysing their hosts. The amounts of GTA produced are sufficiently small that this might entail death of only a tiny fraction of the culture.

And if GTA does kill its cells on the way out, GTA could persist of evolutionary time only if it either spread between cells like an infectious agent or greatly increased the fitness of its close kin. Neither of these seem very likely, but I'll post more about this later.



I've now got some old papers to read.

New microarray data

The post-docs have finished the first-pass analysis of how E. coli gene expression is affected by both the E. coli Sxy and the H. influenzae Sxy proteins. I suppose I shouldn't be surprised that it's more complicated than I had hoped. For example, unlike the situation in H. influenzae, in E. coli there are also groups of genes whose expression goes down when Sxy is present.

One complication is that these cells are probably seriously OVER-producing Sxy. Unlike H. influenzae, where we've only done arrays of cells expressing a single-copy sxy gene under its natural promoter, these E. coli studies used a sxy gene on a high-copy plasmid and under a highly inducible promoter. We know that prolonged expression of Sxy from this plasmid produces large quantities of denatured Sxy (in inclusion bodies) and we don't know the extent to which even the 30-minute expression used for the array studies might create a situation unlike that of natural sxy expression.

Thermodynamics of home heating

This isn't a real research post, but my friends/colleagues thought I was wrong when I explained this to them so I want to see what others think.

I live in a condo and my apartment is heated by electric baseboard units (call this heating electricity). Like everyone else I also use electricity to accomplish domestic tasks such as lighting, cooking and refrigeration (call this working electricity).

I argue that the inefficiency with which I use working-electricity processes in my home is irrelevant to my electricity consumption because all the electricity used by these processes ultimately becomes heat. This applies not only to 'wasted' energy such as the heat put out by light bulbs, but to the work I'm using the electricity for, such as the light itself. That's because work becomes heat; for example light becomes heat when it is absorbed by the surfaces it hits. Thus every watt of electricity I use for cooking or lighting sooner or later becomes heat, and as such proportionately reduces the amount of electricity I need to send directly to the heaters. In effect I'm getting my working electricity for free.

The argument doesn't apply to working energy that gets lost as light out the windows or sound through the walls or hot water down the drain. And maybe not to the energy equivalent of the information I'm transmitting to Blogger with this post, though I suspect that is somewhere between infinitesimal and nonexistent. But it applies to everything that happens in the apartment.

The argument also doesn't apply when the weather is warm enough that any heating needed is less than the heat produced by working electricity. And if the weather ever got hot enough that I used electricity for cooling, I'd be paying double for the energy wasted by my working electricity - e.g. once to run the computer and once to run the air conditioner to get rid of the heat.

And it wouldn't completely apply in winter if I was able to use natural gas or a similarly cheap energy source for heating. But, given that I'm stuck with expensive electrical heating, I console myself with the thought that all the rest of my electricity is free.

So, blogosphere, do you agree?

Gene transfer agent

A colleague's lab has been working on the molecular biology of the 'Gene Transfer Agent' (GTA) of the bacterium Rhodobacter capsulatus. He and I have very different ideas about the evolutionary function of GTA, and we plan to sit down together and work through our disagreements, maybe coming up with a synthesis as a review article. I haven't been paying close attention to GTA, and in this post I'm going to take the first step by summarizing what I think I remember about it (before I go back and read any papers).

The phenomenon: Cultures of R. capsulatus have been known for many years to produce small phage-like particles, each consisting of a protein coat surrounding a 3-4kb fragment of R. capsulatus DNA. These particles can be separated from the source culture and are able to introduce their DNA into other R. capsulatus cells, where it can recombine with the chromosome and change the recipient cell's genotype. The variety of genes that can be transferred suggests that the DNA fragments may be random segments of the source cell's DNA.

I read about GTA when I was in grad school in the 1980s and first becoming interested in the evolution of processes causing gene transfer. I was already coming to the heretical conclusion that bacterial gene transfer by conjugation and transduction occurs as accidental side effects of infectious processes, not because such transfer is beneficial to the cell. At that time only Barry Marrs' lab had worked on GTA. My supervisor, the phage biologist Allan Campbell, thought that GTA was probably produced by a defective prophage. I was working on a cryptic prophage at the time, and this made sense to me. GTA would then be a form of transduction, a side effect of activity of genes whose normal function is to package phage DNA so it can infect new host cells.

The genes: More recently my colleague's lab has identified the R. capsulatus genes responsible for production of GTA and has partially characterized their regulation. As I recall, these genes are in a couple of clusters that do resemble defective prophage but that also have some properties of normal genes. In particular, aspects of the regulation suggest selection for a cellular function. My colleague has also done some analysis of the distribution of the GTA-producing genes, and as I recall this was not consistent with a single acquisition of a defective prophage. He thus interprets his findings as evidence that the ability to transfer genes by GTA is beneficial to R. capsulatus, i.e. that GTA has evolved as a form of bacterial sex.

Questions that I think have not yet been answered, or whose answers I forget: Does the individual cell that produces GTA die, as phage-infected cells normally do? Do only a small fraction of cells in a R. capsulatus culture produce GTA? How many genes are specific to GTA production (have no other function in the cell)? Have phage-derived genes acquired cellular functions independent of GTA production? Does GTA production directly reduce fitness? Can the ability to produce GTA be transferred by GTA? How strong is the phylogenetic evidence?

Next steps: Perhaps we should start our collaboration by working our way through the GTA literature, starting with Barry Marrs' 1974 PNAS paper. This would have the advantage of giving us both the same foundation of facts and factoids (things that look like facts but later turn out to be wrong) to base our discussions on. At the same time we ought to read one or more papers that clarify the evolutionary issues. My "Do bacteria have sex" paper is an obvious choice but shouldn't be the only one.

I'll ask my colleague to read this post, and we can then set up a time for our first meeting and decide what we should read in preparation for it.

Sorry for the paucity of posts

The research side of my brain has been devoured by the looming need to teach introductory biology to 450 freshmen (two sections of 225). Last year at this time I was focusing on grant proposal writing, and so I let my teaching coast on the course preparation I'd done the year before (the first year I taught this course). This year I'm trying to make up for last year's neglect, and my brain is struggling to come up with concept maps and personal response system questions and stimulating homework assignments and lecture content that better matches our new learning objectives and classroom activities suitable for large lectures and ...

But I did spend much of the last couple of days working with one of the post-docs on her manuscript about the competence phenotypes of diverse H. influenzae strains. One of the issues that came up about the Discussion is why our standard lab strain is one of the most competent, rather than being more typical of average strains.

Our initial thought was that perhaps, over more than 50 years of lab culture, descendants of the original isolate had been gradually selected for higher and higher competence in lab transformation experiments. That is, each time a transformation was done, variants that had taken up or recombined more DNA would be enriched in the plate of transformed colonies. But such transformants do not replace the original lab stock, but become new lab strains with new names and new places in the freezer. The original strain has (I think) always been maintained as a frozen stock, with individuals occasionally replacing their depleted vials with a new culture grown from descendants of a previous one. Depending on the culture history int eh intervals between thawing the parental stock and freezing a new one, these cells are likely to have been variably but unintentionally selected for improved growth in broth or on agar, of for longer survival after growth had stopped. We have no particular evidence that the ability to take up DNA would have played a significant role in this selection.

But there are other explanations for why the Rd strain is so competent. First, it was not a completely random isolate. The original H. influenzae transformation paper (Leidy and Alexander 1952?) reports testing strains of different serotypes, with Rd being the most competent. Second, if this most-competent isolate had transformed poorly, H. influenzae might not have become the first model organism for studies of competence in gram-negative bacteria.

We'll need to concisely explain this thinking in or Discussion, as a reviewer is likely to raise the issue.

Genespring progress and problems

We did fork out the $$$$ for GeneSpring to analyze our new microarray data, and the post-docs have been hard at work analyzing their E. coli Sxy arrays. It looks like E. coli Sxy and H. influenzae Sxy have overlapping but not identical effects.

It's not clear yet whether these differences are in quality or in kind. That is, using a cutoff of at least a 4-fold effect in at least 3 of the 4 replicate arrays, some genes are turned on (or off) by E. coli Sxy but not by H. influenzae Sxy. But in some cases it may just be that these genes are affected more strongly by E. coli Sxy than by H. influenzae Sxy.

The postdocs have also had to work hard to get GeneSpring set up. The promised 24/7 tech support appears to have been outsourced, with phone advisors who rely on a set of help files rather than personal experience. Some of its promised functionalities can't be made to work at all, even with the data that comes with the software. We've escalated our complaints, and results are promised, but of course not until after the holidays.

Data on E. coli protein abundances

My previous post complained that our new mass spec data was difficult to interpret, partly because it gives no information about the relative abundances of the proteins it identifies in our Sxy prep. But it occurred to me that useful data about this may be available online.

This Sxy prep was purified from a standard E. coli K-12 strain growing probably in LB + an antibiotic. So I did some Google searching for "coli protein abundance" and easily found a year-old paper in Nature Biotechnology that compares the abundances of mRNAs and the corresponding proteins for E. coli and also for yeast (full text here). This paper nicely explains the reasons why mass spec alone can't really estimate protein abundance, and then describes a new method of combining data to does this. And a supplementary file provides all of their data on E. coli protein abundance ("APEX" estimates, as molecules per cell) in an Excel spreadsheet!

[Reading this paper also taught me that I was wrong to say in the previous post that peptide composition was calculated from each peptide's molecular weight. Instead each peptide is digested and put through a second mass spec that directly detects the amino acids it is composed of.]

What can we do with this information? We want to know, among other things, whether CRP is just one of many proteins that the purification procedure used for our Sxy prep fails to completely wash away, or whether our Sxy prep contains CRP because CRP specifically interacts with Sxy and thus co-purifies with it. If the latter, we expect our prep to contain more CRP than would be predicted based on its usual abundance in cells. Thus I think we should check the APEX abundances of all the proteins identified in our sample. If CRP has a much lower APEX value than the other contaminating proteins the mass spec analysis identified, we can suspect that CRP does interact with Sxy.

Of course, lots of confounding factors are likely to affect the efficiency with which different proteins will be removed by the purification procedure. Unfortunately our 'informed' guesses about these factors are not informed by much. But I think this is still worth a try.

Progress (lack thereof?)

I keep asking myself "Why aren't I getting any science done?". The answer seems to be partly that I've been bogged down with reviewing manuscripts and reading theses, partly that preparing for next term's teaching is starting to loom large, and partly that I've been doing incremental bits of work on various fronts that don't feel as much like research as I would like.

Today I struggled to understand what we could and couldn't learn from the mass-spec analysis of a purified protein. This is a His-tagged E. coli Sxy protein, purified from E. coli cells using a nickel-affinity column. The prep has two odd features. First, in a gel it gives two bands, both close to the size of the expected Sxy-His product and both of roughly equal intensity. We want to find out what distinguishes these two proteins. Second, the prep behaves in band-shift assays as if it also contains a small amount of another protein, CRP. There's not enough CRP to detect in a gel (I forget whether we can detect it with an antibody in Western blots). We hoped the mass spec would tell us whether the prep does indeed contain enough CRP to explain the bandshift results.

Now that I have a better idea what the mass spec analysis does and doesn't do, I see that it can't give us very useful answers.

Here's what it does: All the protein in the prep is first digested into peptides by the protease trypsin. Mass spec analysis of this mixture then determines the exact (to about 7 significant figures) molecular weight of each detectable peptide. The threshold of detection is very low; I think the post-doc who best understands this told me that it's about a femtomole of peptide. Software then calculates the amino acid composition or compositions that could give this molecular weight. (Because different combinations of amino acids can sum to the same weight, several alternative compositions may be possible.)

Other software has analyzed the E. coli proteome, calculating the composition of every peptide that could be produced by trypsin digestion. This database is then compared with the observed peptides in the mass spec sample, to see which mass spec peptides could have come from which E. coli proteins. If a significant number of mass spec peptides match a particular E. coli protein, the software reports that that protein was likely present in the sample.

We have two main problems with the results from our sample. The first is because the mass spec analysis is so sensitive - it identified about 150 proteins in our sample! The second is because the report gives no indication of the relative amounts of the different peptides in the sample - we have no idea which proteins are abundant and which are present only in femtomole amounts. Sxy is one of the 150, which reassures us that we have purified the right protein. So is CRP. Finding that CRP was absent from the prep would have been very significant, because it would have meant that CRP could not be responsible for the bandshifts the prep causes, but finding that it is present doesn't advance things very much. This is largely because we get no information about how much CRP is present, relative to Sxy and to all the other proteins.

We also have some practical problems in interpreting the data. First, the results file is full of hyperlinks, but none of them work (we're 'not authorized'), so we can't tell what we would learn by clicking on them. Second, some of the peptides seem to not match the indicated protein - we don't know if there's a flaw in the software of if we're just misinterpreting the data. So more consultation with the person who does the mass spec analysis is needed.

We had been planning to cut out each of the Sxy-sized bands from a gel and run them separately through the mass spec analysis. But if each of these excised bands is even slightly contaminated with protein from the other, the mass spec will detect them both in both preps. Excising the bands will remove (or at least greatly decrease) most of the contaminating proteins, so he results should be much simpler, but I don't know how much we can learn about the identities of the proteins in these bands, especially if one or both of them differs in sequence from the predicted E. coli proteins.

Luckily the post-docs have lots of ideas for tests that don't rely on mass spec.

We're a winner!

Our lab's web site just won a Judge's Choice award in The Scientist's Laboratory Web Site and Video Awards contest! The judge said that our site "...gives us clues on how the lab sites of the future should look".

"No correlation" can be a result...

This afternoon was my turn to present at lab meeting, so I talked about the results of the uptake sequences-vs-proteomes manuscript. One of the analyses we've done compares the degree of conservation (measured by % identity of BLAST alignment) with the numbers of uptake sequences. I had originally thought this was going to show a strong negative correlation (higher % identity = fewer uptake sequences), consistent with the general pattern that uptake sequences preferentially accumulate in genes lacking strong functional constraint.

But when I saw the graph of the final data I was disappointed, because the sets of genes with no uptake sequences had only slightly higher mean % identities than the sets of genes with several uptake sequences. We haven't done any statistics on these means yet, but it looked like the correlation was weak at best. So I was considering just leaving this analysis out of the manuscript. But the post-doc suggested instead keeping it in, and describing the lack of correlation as an interesting result. That seems like a good idea (though first we need to do the stats - I don't have the raw data so I've emailed my collaborator).

The same post-doc also reminded me of an analysis I did last summer (link to post). I don't think this result should go in this manuscript, as it has nothing to do with proteomes. But it might fit nicely in the reworked Gibbs-analysis manuscript.

A Correspondence Arising for Nature

Today I submitted our Correspondence Arising on the Diggle et al. paper I posted about a couple of weeks ago. The delay was because Nature asks authors of such submissions to first send them to the authors of the paper in question, and to include the resulting correspondence (i.e. the emails) with the submission. By requiring this step Nature makes sure that there is a genuine and serious issue being raised by the Correspondence, not just a confusion that can be quickly cleared up.

In our case the authors replied promptly, but their response didn't make the problem go away. Instead it confirmed that we had correctly interpreted their descriptions of what they had done, and that they agreed with us on the immediate causes of the results they had observed. Most importantly, it confirmed that we strongly disagree about the significance of the results.

Here's hoping that Nature thinks this issue sufficiently important to publish. If they do, they will contact the authors directly to solicit a formal response to our submission, and will then publish our submission and any response online (but not in the print version). If they don't I expect we'll hear from them within a few days.

Subscription-supported journals are like the qwerty keyboard

Tomorrow afternoon I'm participating with several other faculty in a panel on open access/scholarly communication. It's being organized by our research librarians, who hope this will help them make the best use of their meager resources. I have 10-15 minutes to talk, as do the others, and then we'll be 'participating in discussion groups about these topics with other faculty/librarians'. My theme will be "Why subscription-supported journals are like the qwerty keyboard."

As you probably know, the arrangement of letters on the 'qwerty' keyboard that all our computers come with is far from optimal for efficient typing. The original mechanical typewriters had the keys arranged alphabetically. But this caused levers to jam up if their letters were typed in rapid succession, so a key arrangement was devised that interspersed the commonly used letters with uncommon letters, and split up commonly-used sequences of letters. This was a good solution: although it slowed down the speed at which a skilled typist could hit the keys, it eliminated the time they would otherwise have to spend unjamming the levers. You can read all about this on Wikipedia.

The jammed-levers problem became no longer an issue with the invention of roller-ball typewriters such as the IBM Selectric, but by then the qwerty keyboard had become standard and there was no market for a more-optimal keyboard. Now everyone uses computers - these of course have no levers to jam, and can quite easily be switched to, for example, the Dvorak simplified keyboard.

But switching the users is a lot harder. We're used to doing our typing the hard way, and unlearning one keyboard and learning another seems so daunting that very few of us ever even try.

Using reader subscriptions to support the cost of scientific publishing is a lot like the qwerty keyboard. The first scientists disseminated their results by sending letters to their colleagues. The cost of disseminating the research (paper, ink and postage) was seen as part of the cost of doing the research.

Later the desire to reach more readers, and to reach readers not known to the author, led to the first scientific journals, published by scientific societies or for-profit publishers and supported by subscription fees paid by the readers. (The formal peer-review component was added later.) A large part of the cost of publishing a journal was physical, and required specialized facilities that only a professional publisher could afford. Because the cost of producing and mailing a paper copy for each subscriber was rightly borne by the person or institution receiving it, it made sense that they should also bear the cost of the editorial process.

As subscription costs rose, university libraries spent more and more of their budgets on journal subscriptions. If a journal's readership was large enough, some of the cost could be paid by advertisers, but the more specialized journals had to cover their full costs from subscriptions. As the publication costs got higher, some journals, especially those that wanted to remain independent of advertisers, introduced 'page charges' to the authors. As subscription fees rose higher and higher, fewer and fewer people could afford them, so publishers began charging individuals much less than the supposedly deep-pocketed institutional libraries. Publisher profits got higher and higher, because there was no competition to hold them in check.

Like the qwerty keyboard, subscription-supported scientific publishing was a solution to a technical problem that no longer exists - how to distribute research to an audience. Now that journals can be published online, the costs of producing and mailing paper copies are gone, and there is no need for massive printing presses. In principle we should be able to go back to the original state, where the dissemination costs are considered part of the cost of doing the research, rather than a price the reader pays for the privilege of access. Instead of paper, ink and postage, these costs are now those of administering peer review, copy editing, and web-site maintenance. But the principle is the same.

But we're tied down by history. Our reputations depend on rankings of the journals we can get our papers into, so we're very reluctant to shift to new ones of dubious reputation. The cost of journal subscriptions (now often electronic rather than paper) is entrenched in university budgets, and we don't want to spend our tight research funds on publication charges just so people we've never met can read our papers.

Are there solutions? One reason for optimism is that changing how we pay the costs of disseminating research is not an all-or-nothing change like switching from qwerty to Dvorak keyboards. Some new open-access journals are very prestigious. Granting agencies are giving strong 'in-principle' support to open access publishing, and my last grant proposal's budget included a hefty amount for open-access publication charges. And libraries are looking for ways to escape the burden of subscription charges.

Why take the risk of writing a research blog?

Dave Ng at The World's Fair (part of the Science Blogs group) has written a post about our research blogs, and Boing Boing has picked it up. So this is a good time to try to answer the obvious question of why we do this. Several comments on Dave's post asks why we take the risk of being scooped. To quote one
"... isn't there a massive chance of one of her lab members getting scooped to a paper because they aired unpublished results to the world?"
This is the big fear that seems to stop researchers from even considering blogging about their work. But for most labs the risk is not very high, and there are benefits for everyone.

Benefits first. I'm a bit of an idealist about science - I think cooperation is more powerful than competition. NIH thinks so too - If you call them with a new research idea, they don't warn you to keep it under your hat because others are working on similar stuff. Rather they try to put you in touch with these people to encourage collaboration. Blogging about our ongoing research doesn't only actively promote interaction with other researchers, it helps me remember that science should be a community activity.

I also think the risks are overestimated. Although one dramatic scientific stereotype is of research groups competing for glory, in reality very few of us are engaged in fierce competition with groups trying to use the same methods to answer the same questions. If you are in such a competition, blogging about your research might not be a good idea. On the other hand, thinking about blogging might cause you to consider ways to could reduce the competition and promote collaboration instead.

Getting GeneSpring?

The post-docs have generated a lot of E. coli microarray data, so we need to reactivate our long-expired license to the GeneSpring software we use for array analysis. Unfortunately the rep won't return our calls.

GeneSpring has been bought out by Agilent. In the US a one-year license costs about $3300. But that's not the problem. In Canada it costs over $4000, even though our dollars are now at par because the US dollar has fallen against everything! The helpful GeneSpring/Aligent rep in the US tells us that we're forbidden to buy it directly from the US at the US price. But the Canadian rep won't return our calls or emails.

We could: 1. Buy it online through the US web site, paying the outrageously inflated Canadian price; 2. Wait for the Canadian rep to reply, hoping to be able to negotiate a better price; 3. Call Agilent in the US and complain (to someone higher than the nice rep) about the Canadian rep and price.

I think I'll start with 3 because it will make me feel less helpless, and then move on to 1.

Results on the Results

I spent yesterday continuing to sort out the Results section of our paper about how uptake sequences affect proteomes.

Because we've changed the order of topics several times, each time renumbering the new versions of the figures, the data files and figure files are a big mess. For example, data for one figure is in files variously named "Fig. 3", Fig. 5, "Fig. 6". "altFig5" ... you get the picture. The additional complication that I and my collaborator are on different sides of the continent has been mitigated by having a Google-Groups page where we have posted the recent files, albeit under a variety of names and figure-number attributions.

But now I have the Results in what I hope will be their final order. To keep the files straight I've created a folder for each section (Results-A, Results-B, etc) and put the associated data and figure files into it. (Previously I just had one folder for data and another for figures.) I'm hoping that this will let us keep the files together even if we do change the order of the sections.

Today it's checking over the Methods section (written by my collaborator - so should be fine) and the as-yet almost nonexistent Discussion (needs to be written by me).

Back to the USS manuscripts

I'm finally back to working on papers about uptake sequence evolution. Right now its the analysis of evolutionary interactions between each genome's uptake sequences and its proteome.

While I've been neglecting the manuscript my bioinformatics collaborator has been generating the final data and, I now discover, suggesting a different and more logical way to order the results. So I'm shuffling the sections around, rewriting the text that links them together and explains why we did each analysis. Well, that's not exactly true. Any scientist will admit that their papers don't always honestly explain the actual reasons why each experiment or analysis was done. That's because scientists often do good experiments for not-very-good reasons, and only later discover the logical thread that links our results together.

And sometimes, like now, we initially don't think to do experiments or analyses, only later realizing the contribution they will make to understanding or explaining other results. The reorganizing I've just done suggested two simple correlations I might look for, which might provide context for interpreting the result I had in mind. So I entered some of my collaborator's data on the tripeptides that uptake sequences specify into a new Excel file, plotted a couple of simple graphs, and presto, new results!

These aren't very important results in themselves. The relative frequencies of tripeptides specified by uptake sequences do correlate modestly (R2 = 0.54) with the total frequencies of those tripeptides in their proteomes. And the proportion of tripeptides usable by uptake sequences but not used correlates even more modestly (R2 - 0.4) with the tripeptides frequencies in their proteomes. But they provide a context for other results that makes them easier to understand.

and coming up with a couple of new simple analyses we had overlooked.