Field of Science

Sorry for the paucity of posts

The research side of my brain has been devoured by the looming need to teach introductory biology to 450 freshmen (two sections of 225). Last year at this time I was focusing on grant proposal writing, and so I let my teaching coast on the course preparation I'd done the year before (the first year I taught this course). This year I'm trying to make up for last year's neglect, and my brain is struggling to come up with concept maps and personal response system questions and stimulating homework assignments and lecture content that better matches our new learning objectives and classroom activities suitable for large lectures and ...

But I did spend much of the last couple of days working with one of the post-docs on her manuscript about the competence phenotypes of diverse H. influenzae strains. One of the issues that came up about the Discussion is why our standard lab strain is one of the most competent, rather than being more typical of average strains.

Our initial thought was that perhaps, over more than 50 years of lab culture, descendants of the original isolate had been gradually selected for higher and higher competence in lab transformation experiments. That is, each time a transformation was done, variants that had taken up or recombined more DNA would be enriched in the plate of transformed colonies. But such transformants do not replace the original lab stock, but become new lab strains with new names and new places in the freezer. The original strain has (I think) always been maintained as a frozen stock, with individuals occasionally replacing their depleted vials with a new culture grown from descendants of a previous one. Depending on the culture history int eh intervals between thawing the parental stock and freezing a new one, these cells are likely to have been variably but unintentionally selected for improved growth in broth or on agar, of for longer survival after growth had stopped. We have no particular evidence that the ability to take up DNA would have played a significant role in this selection.

But there are other explanations for why the Rd strain is so competent. First, it was not a completely random isolate. The original H. influenzae transformation paper (Leidy and Alexander 1952?) reports testing strains of different serotypes, with Rd being the most competent. Second, if this most-competent isolate had transformed poorly, H. influenzae might not have become the first model organism for studies of competence in gram-negative bacteria.

We'll need to concisely explain this thinking in or Discussion, as a reviewer is likely to raise the issue.

Genespring progress and problems

We did fork out the $$$$ for GeneSpring to analyze our new microarray data, and the post-docs have been hard at work analyzing their E. coli Sxy arrays. It looks like E. coli Sxy and H. influenzae Sxy have overlapping but not identical effects.

It's not clear yet whether these differences are in quality or in kind. That is, using a cutoff of at least a 4-fold effect in at least 3 of the 4 replicate arrays, some genes are turned on (or off) by E. coli Sxy but not by H. influenzae Sxy. But in some cases it may just be that these genes are affected more strongly by E. coli Sxy than by H. influenzae Sxy.

The postdocs have also had to work hard to get GeneSpring set up. The promised 24/7 tech support appears to have been outsourced, with phone advisors who rely on a set of help files rather than personal experience. Some of its promised functionalities can't be made to work at all, even with the data that comes with the software. We've escalated our complaints, and results are promised, but of course not until after the holidays.

Data on E. coli protein abundances

My previous post complained that our new mass spec data was difficult to interpret, partly because it gives no information about the relative abundances of the proteins it identifies in our Sxy prep. But it occurred to me that useful data about this may be available online.

This Sxy prep was purified from a standard E. coli K-12 strain growing probably in LB + an antibiotic. So I did some Google searching for "coli protein abundance" and easily found a year-old paper in Nature Biotechnology that compares the abundances of mRNAs and the corresponding proteins for E. coli and also for yeast (full text here). This paper nicely explains the reasons why mass spec alone can't really estimate protein abundance, and then describes a new method of combining data to does this. And a supplementary file provides all of their data on E. coli protein abundance ("APEX" estimates, as molecules per cell) in an Excel spreadsheet!

[Reading this paper also taught me that I was wrong to say in the previous post that peptide composition was calculated from each peptide's molecular weight. Instead each peptide is digested and put through a second mass spec that directly detects the amino acids it is composed of.]

What can we do with this information? We want to know, among other things, whether CRP is just one of many proteins that the purification procedure used for our Sxy prep fails to completely wash away, or whether our Sxy prep contains CRP because CRP specifically interacts with Sxy and thus co-purifies with it. If the latter, we expect our prep to contain more CRP than would be predicted based on its usual abundance in cells. Thus I think we should check the APEX abundances of all the proteins identified in our sample. If CRP has a much lower APEX value than the other contaminating proteins the mass spec analysis identified, we can suspect that CRP does interact with Sxy.

Of course, lots of confounding factors are likely to affect the efficiency with which different proteins will be removed by the purification procedure. Unfortunately our 'informed' guesses about these factors are not informed by much. But I think this is still worth a try.

Progress (lack thereof?)

I keep asking myself "Why aren't I getting any science done?". The answer seems to be partly that I've been bogged down with reviewing manuscripts and reading theses, partly that preparing for next term's teaching is starting to loom large, and partly that I've been doing incremental bits of work on various fronts that don't feel as much like research as I would like.

Today I struggled to understand what we could and couldn't learn from the mass-spec analysis of a purified protein. This is a His-tagged E. coli Sxy protein, purified from E. coli cells using a nickel-affinity column. The prep has two odd features. First, in a gel it gives two bands, both close to the size of the expected Sxy-His product and both of roughly equal intensity. We want to find out what distinguishes these two proteins. Second, the prep behaves in band-shift assays as if it also contains a small amount of another protein, CRP. There's not enough CRP to detect in a gel (I forget whether we can detect it with an antibody in Western blots). We hoped the mass spec would tell us whether the prep does indeed contain enough CRP to explain the bandshift results.

Now that I have a better idea what the mass spec analysis does and doesn't do, I see that it can't give us very useful answers.

Here's what it does: All the protein in the prep is first digested into peptides by the protease trypsin. Mass spec analysis of this mixture then determines the exact (to about 7 significant figures) molecular weight of each detectable peptide. The threshold of detection is very low; I think the post-doc who best understands this told me that it's about a femtomole of peptide. Software then calculates the amino acid composition or compositions that could give this molecular weight. (Because different combinations of amino acids can sum to the same weight, several alternative compositions may be possible.)

Other software has analyzed the E. coli proteome, calculating the composition of every peptide that could be produced by trypsin digestion. This database is then compared with the observed peptides in the mass spec sample, to see which mass spec peptides could have come from which E. coli proteins. If a significant number of mass spec peptides match a particular E. coli protein, the software reports that that protein was likely present in the sample.

We have two main problems with the results from our sample. The first is because the mass spec analysis is so sensitive - it identified about 150 proteins in our sample! The second is because the report gives no indication of the relative amounts of the different peptides in the sample - we have no idea which proteins are abundant and which are present only in femtomole amounts. Sxy is one of the 150, which reassures us that we have purified the right protein. So is CRP. Finding that CRP was absent from the prep would have been very significant, because it would have meant that CRP could not be responsible for the bandshifts the prep causes, but finding that it is present doesn't advance things very much. This is largely because we get no information about how much CRP is present, relative to Sxy and to all the other proteins.

We also have some practical problems in interpreting the data. First, the results file is full of hyperlinks, but none of them work (we're 'not authorized'), so we can't tell what we would learn by clicking on them. Second, some of the peptides seem to not match the indicated protein - we don't know if there's a flaw in the software of if we're just misinterpreting the data. So more consultation with the person who does the mass spec analysis is needed.

We had been planning to cut out each of the Sxy-sized bands from a gel and run them separately through the mass spec analysis. But if each of these excised bands is even slightly contaminated with protein from the other, the mass spec will detect them both in both preps. Excising the bands will remove (or at least greatly decrease) most of the contaminating proteins, so he results should be much simpler, but I don't know how much we can learn about the identities of the proteins in these bands, especially if one or both of them differs in sequence from the predicted E. coli proteins.

Luckily the post-docs have lots of ideas for tests that don't rely on mass spec.

We're a winner!

Our lab's web site just won a Judge's Choice award in The Scientist's Laboratory Web Site and Video Awards contest! The judge said that our site "...gives us clues on how the lab sites of the future should look".

"No correlation" can be a result...

This afternoon was my turn to present at lab meeting, so I talked about the results of the uptake sequences-vs-proteomes manuscript. One of the analyses we've done compares the degree of conservation (measured by % identity of BLAST alignment) with the numbers of uptake sequences. I had originally thought this was going to show a strong negative correlation (higher % identity = fewer uptake sequences), consistent with the general pattern that uptake sequences preferentially accumulate in genes lacking strong functional constraint.

But when I saw the graph of the final data I was disappointed, because the sets of genes with no uptake sequences had only slightly higher mean % identities than the sets of genes with several uptake sequences. We haven't done any statistics on these means yet, but it looked like the correlation was weak at best. So I was considering just leaving this analysis out of the manuscript. But the post-doc suggested instead keeping it in, and describing the lack of correlation as an interesting result. That seems like a good idea (though first we need to do the stats - I don't have the raw data so I've emailed my collaborator).

The same post-doc also reminded me of an analysis I did last summer (link to post). I don't think this result should go in this manuscript, as it has nothing to do with proteomes. But it might fit nicely in the reworked Gibbs-analysis manuscript.

A Correspondence Arising for Nature

Today I submitted our Correspondence Arising on the Diggle et al. paper I posted about a couple of weeks ago. The delay was because Nature asks authors of such submissions to first send them to the authors of the paper in question, and to include the resulting correspondence (i.e. the emails) with the submission. By requiring this step Nature makes sure that there is a genuine and serious issue being raised by the Correspondence, not just a confusion that can be quickly cleared up.

In our case the authors replied promptly, but their response didn't make the problem go away. Instead it confirmed that we had correctly interpreted their descriptions of what they had done, and that they agreed with us on the immediate causes of the results they had observed. Most importantly, it confirmed that we strongly disagree about the significance of the results.

Here's hoping that Nature thinks this issue sufficiently important to publish. If they do, they will contact the authors directly to solicit a formal response to our submission, and will then publish our submission and any response online (but not in the print version). If they don't I expect we'll hear from them within a few days.

Subscription-supported journals are like the qwerty keyboard

Tomorrow afternoon I'm participating with several other faculty in a panel on open access/scholarly communication. It's being organized by our research librarians, who hope this will help them make the best use of their meager resources. I have 10-15 minutes to talk, as do the others, and then we'll be 'participating in discussion groups about these topics with other faculty/librarians'. My theme will be "Why subscription-supported journals are like the qwerty keyboard."

As you probably know, the arrangement of letters on the 'qwerty' keyboard that all our computers come with is far from optimal for efficient typing. The original mechanical typewriters had the keys arranged alphabetically. But this caused levers to jam up if their letters were typed in rapid succession, so a key arrangement was devised that interspersed the commonly used letters with uncommon letters, and split up commonly-used sequences of letters. This was a good solution: although it slowed down the speed at which a skilled typist could hit the keys, it eliminated the time they would otherwise have to spend unjamming the levers. You can read all about this on Wikipedia.

The jammed-levers problem became no longer an issue with the invention of roller-ball typewriters such as the IBM Selectric, but by then the qwerty keyboard had become standard and there was no market for a more-optimal keyboard. Now everyone uses computers - these of course have no levers to jam, and can quite easily be switched to, for example, the Dvorak simplified keyboard.

But switching the users is a lot harder. We're used to doing our typing the hard way, and unlearning one keyboard and learning another seems so daunting that very few of us ever even try.

Using reader subscriptions to support the cost of scientific publishing is a lot like the qwerty keyboard. The first scientists disseminated their results by sending letters to their colleagues. The cost of disseminating the research (paper, ink and postage) was seen as part of the cost of doing the research.

Later the desire to reach more readers, and to reach readers not known to the author, led to the first scientific journals, published by scientific societies or for-profit publishers and supported by subscription fees paid by the readers. (The formal peer-review component was added later.) A large part of the cost of publishing a journal was physical, and required specialized facilities that only a professional publisher could afford. Because the cost of producing and mailing a paper copy for each subscriber was rightly borne by the person or institution receiving it, it made sense that they should also bear the cost of the editorial process.

As subscription costs rose, university libraries spent more and more of their budgets on journal subscriptions. If a journal's readership was large enough, some of the cost could be paid by advertisers, but the more specialized journals had to cover their full costs from subscriptions. As the publication costs got higher, some journals, especially those that wanted to remain independent of advertisers, introduced 'page charges' to the authors. As subscription fees rose higher and higher, fewer and fewer people could afford them, so publishers began charging individuals much less than the supposedly deep-pocketed institutional libraries. Publisher profits got higher and higher, because there was no competition to hold them in check.

Like the qwerty keyboard, subscription-supported scientific publishing was a solution to a technical problem that no longer exists - how to distribute research to an audience. Now that journals can be published online, the costs of producing and mailing paper copies are gone, and there is no need for massive printing presses. In principle we should be able to go back to the original state, where the dissemination costs are considered part of the cost of doing the research, rather than a price the reader pays for the privilege of access. Instead of paper, ink and postage, these costs are now those of administering peer review, copy editing, and web-site maintenance. But the principle is the same.

But we're tied down by history. Our reputations depend on rankings of the journals we can get our papers into, so we're very reluctant to shift to new ones of dubious reputation. The cost of journal subscriptions (now often electronic rather than paper) is entrenched in university budgets, and we don't want to spend our tight research funds on publication charges just so people we've never met can read our papers.

Are there solutions? One reason for optimism is that changing how we pay the costs of disseminating research is not an all-or-nothing change like switching from qwerty to Dvorak keyboards. Some new open-access journals are very prestigious. Granting agencies are giving strong 'in-principle' support to open access publishing, and my last grant proposal's budget included a hefty amount for open-access publication charges. And libraries are looking for ways to escape the burden of subscription charges.

Why take the risk of writing a research blog?

Dave Ng at The World's Fair (part of the Science Blogs group) has written a post about our research blogs, and Boing Boing has picked it up. So this is a good time to try to answer the obvious question of why we do this. Several comments on Dave's post asks why we take the risk of being scooped. To quote one
"... isn't there a massive chance of one of her lab members getting scooped to a paper because they aired unpublished results to the world?"
This is the big fear that seems to stop researchers from even considering blogging about their work. But for most labs the risk is not very high, and there are benefits for everyone.

Benefits first. I'm a bit of an idealist about science - I think cooperation is more powerful than competition. NIH thinks so too - If you call them with a new research idea, they don't warn you to keep it under your hat because others are working on similar stuff. Rather they try to put you in touch with these people to encourage collaboration. Blogging about our ongoing research doesn't only actively promote interaction with other researchers, it helps me remember that science should be a community activity.

I also think the risks are overestimated. Although one dramatic scientific stereotype is of research groups competing for glory, in reality very few of us are engaged in fierce competition with groups trying to use the same methods to answer the same questions. If you are in such a competition, blogging about your research might not be a good idea. On the other hand, thinking about blogging might cause you to consider ways to could reduce the competition and promote collaboration instead.

Getting GeneSpring?

The post-docs have generated a lot of E. coli microarray data, so we need to reactivate our long-expired license to the GeneSpring software we use for array analysis. Unfortunately the rep won't return our calls.

GeneSpring has been bought out by Agilent. In the US a one-year license costs about $3300. But that's not the problem. In Canada it costs over $4000, even though our dollars are now at par because the US dollar has fallen against everything! The helpful GeneSpring/Aligent rep in the US tells us that we're forbidden to buy it directly from the US at the US price. But the Canadian rep won't return our calls or emails.

We could: 1. Buy it online through the US web site, paying the outrageously inflated Canadian price; 2. Wait for the Canadian rep to reply, hoping to be able to negotiate a better price; 3. Call Agilent in the US and complain (to someone higher than the nice rep) about the Canadian rep and price.

I think I'll start with 3 because it will make me feel less helpless, and then move on to 1.

Results on the Results

I spent yesterday continuing to sort out the Results section of our paper about how uptake sequences affect proteomes.

Because we've changed the order of topics several times, each time renumbering the new versions of the figures, the data files and figure files are a big mess. For example, data for one figure is in files variously named "Fig. 3", Fig. 5, "Fig. 6". "altFig5" ... you get the picture. The additional complication that I and my collaborator are on different sides of the continent has been mitigated by having a Google-Groups page where we have posted the recent files, albeit under a variety of names and figure-number attributions.

But now I have the Results in what I hope will be their final order. To keep the files straight I've created a folder for each section (Results-A, Results-B, etc) and put the associated data and figure files into it. (Previously I just had one folder for data and another for figures.) I'm hoping that this will let us keep the files together even if we do change the order of the sections.

Today it's checking over the Methods section (written by my collaborator - so should be fine) and the as-yet almost nonexistent Discussion (needs to be written by me).

Back to the USS manuscripts

I'm finally back to working on papers about uptake sequence evolution. Right now its the analysis of evolutionary interactions between each genome's uptake sequences and its proteome.

While I've been neglecting the manuscript my bioinformatics collaborator has been generating the final data and, I now discover, suggesting a different and more logical way to order the results. So I'm shuffling the sections around, rewriting the text that links them together and explains why we did each analysis. Well, that's not exactly true. Any scientist will admit that their papers don't always honestly explain the actual reasons why each experiment or analysis was done. That's because scientists often do good experiments for not-very-good reasons, and only later discover the logical thread that links our results together.

And sometimes, like now, we initially don't think to do experiments or analyses, only later realizing the contribution they will make to understanding or explaining other results. The reorganizing I've just done suggested two simple correlations I might look for, which might provide context for interpreting the result I had in mind. So I entered some of my collaborator's data on the tripeptides that uptake sequences specify into a new Excel file, plotted a couple of simple graphs, and presto, new results!

These aren't very important results in themselves. The relative frequencies of tripeptides specified by uptake sequences do correlate modestly (R2 = 0.54) with the total frequencies of those tripeptides in their proteomes. And the proportion of tripeptides usable by uptake sequences but not used correlates even more modestly (R2 - 0.4) with the tripeptides frequencies in their proteomes. But they provide a context for other results that makes them easier to understand.

and coming up with a couple of new simple analyses we had overlooked.