A while back I wrote a post interpreting the messy results of some microarray analysis. The analysis was looking for effects of treating H. influenzae cells with very low concentrations of two antibiotics (rifampicin and erythromycin). These concentrations do not detectably inhibit cell growth, but rifampicin appeared to be significantly increasing the expression of some genes.
Our criterion for 'maybe significant' was that the gene should be increased at least two-fold in all three of the arrays we examined. The collaborator's technician has now gone back into the data to see if related genes are also induced, at levels just too low to meet our cutoff. Her results confirm that the antibiotic affects a whole class of genes. Our collaborator tells me that an effect on these genes is unprecedented, so on we go.....
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
Gibbs searches almost done
Not all the Gibbs motif searches I'd queue'd ran (of course). More mysterious symbols unexpectedly hidden in sequence files, and one set of sequence files turned out to be much too long. I think this latter weirdness arose because the "splitsequence.pl" program we wrote doesn't overwrite a previous version of the specified output file (as I had assumed) but instead just adds the new sequence to whatever was in the previous version. Now I think about it, I should have realized this...
And for some reason the searches on the forward strand of the H. somnus sequence had a very hard time finding the USS pattern, even though the core motif should be obvious. But the reverse-strand searches didn't have much problem, and both forward and reverse searches found about the same number of sites. I guess this is due to some minor difference in the sites, something that just pushes the Gibbs results below the threshold of not finding ANY motifs in 100 searches.
But the last few are queue'd, and I've analyzed almost all of the ones that are finished.
And for some reason the searches on the forward strand of the H. somnus sequence had a very hard time finding the USS pattern, even though the core motif should be obvious. But the reverse-strand searches didn't have much problem, and both forward and reverse searches found about the same number of sites. I guess this is due to some minor difference in the sites, something that just pushes the Gibbs results below the threshold of not finding ANY motifs in 100 searches.
But the last few are queue'd, and I've analyzed almost all of the ones that are finished.
Onward with Gibbs motif searches
I spent much of Saturday trapped in the swamp of trying to get files correctly formatted for the Gibbs searches. Files contained mysterious invisible characters that the Gibbs program refused to touch but that text editors could neither see nor delete. (And no, it wasn't that the carriage returns were of the nasty Mac type.) FASTA-formatted sequences had ">" characters hidden in what should have been plain DNA sequence. etc. etc. Sequence 'mask' instructions that had seemed foolproof didn't behave as they had a few days ago.
But finally I was able to queue up all the runs of the forward sequences of the Neisseria genomes (including a newly available sequence for N. lactamica) and of the three genomes with Apl-type USSs. And to find the errors in my queue instructions and queue them again.... And queue them again so I'd have duplicate runs. And to get results which were, all but one, USS motifs (and to re-queue that one). And to clean up the sequences and get the logos (thanks, WebLogo!). AND, to see that the results are just different enough to be interesting.
Doing the exact same analysis on the reverse complements of these sequences will give me an independent dataset. I found a website that not only converts FASTA sequences into their reverse complements but can handle whole genomes in one go (thanks, BMC!). And now, I've queued (in duplicate) all the corresponding reverse complement searches.
But finally I was able to queue up all the runs of the forward sequences of the Neisseria genomes (including a newly available sequence for N. lactamica) and of the three genomes with Apl-type USSs. And to find the errors in my queue instructions and queue them again.... And queue them again so I'd have duplicate runs. And to get results which were, all but one, USS motifs (and to re-queue that one). And to clean up the sequences and get the logos (thanks, WebLogo!). AND, to see that the results are just different enough to be interesting.
Doing the exact same analysis on the reverse complements of these sequences will give me an independent dataset. I found a website that not only converts FASTA sequences into their reverse complements but can handle whole genomes in one go (thanks, BMC!). And now, I've queued (in duplicate) all the corresponding reverse complement searches.
Steps forward - Gibbs
No word from the helpful Gibbs expert who offered to test my file with the new 'centroid' motif-finding software to find out why it ran so much slower than the previous version. I suspect he wasn't able to find a solution.
So today I'm going to collect the genome sequences from the other Pasteurellaceae and queue up searches using the old Gibbs motif sampler. I'll specify to the queueing program (on the shared computer cluster) that they might take 2 days, and make sure to ask for enough memory, so that the runs aren't terminated by the system before the Gibbs program has completed searching and analyzing the results. Because the Gibbs program is stochastic, I'd better do the runs at least in duplicate.
So today I'm going to collect the genome sequences from the other Pasteurellaceae and queue up searches using the old Gibbs motif sampler. I'll specify to the queueing program (on the shared computer cluster) that they might take 2 days, and make sure to ask for enough memory, so that the runs aren't terminated by the system before the Gibbs program has completed searching and analyzing the results. Because the Gibbs program is stochastic, I'd better do the runs at least in duplicate.
What RecA has to do with it
I spent parts of yesterday happily responding to requests for information from a colleague supportive of ideas, who was deep in an email discussion with another who thinks we're dead wrong. This had me digging through the Bacillus subtilis literature looking for information about the competence phenotypes of recA mutants.
The other colleague had expressed amazement that anyone could ever think that cells take up DNA for anything other than recombination, whereas we think they take it up for food. The idea that bacteria take up DNA for the (assumed*) evolutionary benefits of changing their genes is widely believed by microbiologists. (I say 'believed' because, like religious believers they are very reluctant to consider alternative explanations.) I've posted about why this is an important issue, and why we think the way we do, here.
As is typical for believers in "competence is for sex", he supports his conclusion by one piece of information. In his case it was that the B. subtilis recE gene is one of the genes induced in competent cells. RecE is a homolog of E. coli's recA, and functions primarily as a DNA repair protein by both detecting DNA damage and contributing to its repair by a strand-exchange process. RecA (and RecE) can also perform strand exchange on DNA strands brought into the cell by direct DNA uptake or by plasmids or phages; in its absence competent cells cannot be transformed by the DNA they take up.
In H. influenzae we know that the RecA homolog makes no contribution to DNA uptake - in recA mutants all the DNA that cells take up gets into the cytoplasm, where it is degraded by resident nucleases. We know this because if we use radioactively labelled DNA we see the label randomly incorporated into the cell's chromosome.
I was trying to find direct evidence that the same is true for B. subtilis. It's unlikely that anyone would bother to do this experiment now that we know a lot about RecA. The experiment would either have been done many years ago, when researchers were still uncertain about what RecA does, or done as a control for some other experiment. I couldn't find any experiment where researchers had looked for chromosomal incorporation of label in a B. subtilis recE mutant, but I've emailed the person most likely to have done this to ask if he can help**.
But I did find other evidence that B. subtilis recE mutants have normal DNA uptake, from a genetic analysis that's independent of recombination. Wildtype B. subtilis can be transformed with circular plasmids, which can replicate in the cytoplasm without having recombined with the chromosome. But like H. influenzae, B. subtilis only brings linear single stranded DNA into the cytoplasm, and linear strands are degraded if they can't recombine. But B. subtilis cells can take up plasmids that start out circular because they cut the DNA before taking it up. If the plasmid circles are monomers the resulting linear strands are degraded in the cytoplasm. But if the circles are not monomers but contain two or more copies of the plasmid sequence, the strands can from different molecules efficiently reassociate in the cytoplasm to produce viable circles (sorry if this explanation isn't clear enough; I don't have time to do it deeply). In B. subtilis recE mutants, transformation with plasmid preps containing such multimers is normal, confirming that DNA uptake is normal.
The other colleague had expressed amazement that anyone could ever think that cells take up DNA for anything other than recombination, whereas we think they take it up for food. The idea that bacteria take up DNA for the (assumed*) evolutionary benefits of changing their genes is widely believed by microbiologists. (I say 'believed' because, like religious believers they are very reluctant to consider alternative explanations.) I've posted about why this is an important issue, and why we think the way we do, here.
As is typical for believers in "competence is for sex", he supports his conclusion by one piece of information. In his case it was that the B. subtilis recE gene is one of the genes induced in competent cells. RecE is a homolog of E. coli's recA, and functions primarily as a DNA repair protein by both detecting DNA damage and contributing to its repair by a strand-exchange process. RecA (and RecE) can also perform strand exchange on DNA strands brought into the cell by direct DNA uptake or by plasmids or phages; in its absence competent cells cannot be transformed by the DNA they take up.
In H. influenzae we know that the RecA homolog makes no contribution to DNA uptake - in recA mutants all the DNA that cells take up gets into the cytoplasm, where it is degraded by resident nucleases. We know this because if we use radioactively labelled DNA we see the label randomly incorporated into the cell's chromosome.
I was trying to find direct evidence that the same is true for B. subtilis. It's unlikely that anyone would bother to do this experiment now that we know a lot about RecA. The experiment would either have been done many years ago, when researchers were still uncertain about what RecA does, or done as a control for some other experiment. I couldn't find any experiment where researchers had looked for chromosomal incorporation of label in a B. subtilis recE mutant, but I've emailed the person most likely to have done this to ask if he can help**.
But I did find other evidence that B. subtilis recE mutants have normal DNA uptake, from a genetic analysis that's independent of recombination. Wildtype B. subtilis can be transformed with circular plasmids, which can replicate in the cytoplasm without having recombined with the chromosome. But like H. influenzae, B. subtilis only brings linear single stranded DNA into the cytoplasm, and linear strands are degraded if they can't recombine. But B. subtilis cells can take up plasmids that start out circular because they cut the DNA before taking it up. If the plasmid circles are monomers the resulting linear strands are degraded in the cytoplasm. But if the circles are not monomers but contain two or more copies of the plasmid sequence, the strands can from different molecules efficiently reassociate in the cytoplasm to produce viable circles (sorry if this explanation isn't clear enough; I don't have time to do it deeply). In B. subtilis recE mutants, transformation with plasmid preps containing such multimers is normal, confirming that DNA uptake is normal.
*“Allow people to make assumptions and they will come away absolutely convinced that assumption was correct and that it represents fact."**The B. subtilis expert emailed the other participants (but didn't reply to me!) to confirm that yes, in recA mutants the DNA is degraded inside the cell. He also expressed strong disagreement with our hypothesis that cells take up DNA for the nucleotides, claiming that these couldn't possibly make a significant contribution to the cell's nutrient supply.
Randi the Magician, quoted in a recent New York Times article
One step forward, two steps back?
Most of yesterday's lab meeting was spent presenting and discussing the analysis of USS-imposed coding constraints. In doing this I discovered that I still don't really understand what's been done or what it means.
I think this is partly because I didn't do most of the analysis myself. The problem is compounded because I did design some of the analyses I thought should be done, but my collaborator's expertise led her to do somewhat different analyses, and I'm not sure which are good parallels to what I had originally planned, and which are improvements on my initially-flawed plans.
So getting this paper out is going to take more than just polishing the draft manuscript. The first step is to dig deeply into the analyses (with a flurry of emails to the collaborator asking for clarification) so that I do understand exactly what has been done. Only then will I be able to decide if we have what we need.
I think this is partly because I didn't do most of the analysis myself. The problem is compounded because I did design some of the analyses I thought should be done, but my collaborator's expertise led her to do somewhat different analyses, and I'm not sure which are good parallels to what I had originally planned, and which are improvements on my initially-flawed plans.
So getting this paper out is going to take more than just polishing the draft manuscript. The first step is to dig deeply into the analyses (with a flurry of emails to the collaborator asking for clarification) so that I do understand exactly what has been done. Only then will I be able to decide if we have what we need.
Not conservation but convergence
Yesterday was my turn to do lab meeting, and I began by talking about the Gibbs searches I'm using to characterize the variation of the USS motifs in different species.
I made a point about how USS have usually been studied 'typologically', as if there was one 'real' USS (in H. influenzae the core AAGTGCGGT) to be identified and characterized. I drew an analogy with how biologists used to think about species (pre-Darwin) said that we now needed to make the transition to population-based thinking about USSs, treating the variation as an essential component of the phenomenon. That's what the Gibbs motif searches help us do.
But this morning I realized that population thinking isn't going far enough. The different USSs in a genome are in no sense a population in the same sense as a the members of a species are. Rather, each USS site evolves independently. The come to be similar because all of them are subject to the same evolutionary force - the molecular drive created by biased DNA uptake. This means that the different USS sites in the genome are only similar because of convergence.
Biologists are used to invoking evolutionary 'conservation' (preservation of similarities inherited from a common ancestor) whenever aligned sequences have more similarity than can be explained by chance. So, for example, we say that two recA genes are more conserved than the sequences that flank them.
I've been fighting the tendency (mine and others') to refer to conserved similarities when describing USS similarities, because I know that distinct sites do not have common ancestry. Maybe now I can just substitute 'converged' for 'conserved' in all the usual statements about similarities.
I made a point about how USS have usually been studied 'typologically', as if there was one 'real' USS (in H. influenzae the core AAGTGCGGT) to be identified and characterized. I drew an analogy with how biologists used to think about species (pre-Darwin) said that we now needed to make the transition to population-based thinking about USSs, treating the variation as an essential component of the phenomenon. That's what the Gibbs motif searches help us do.
But this morning I realized that population thinking isn't going far enough. The different USSs in a genome are in no sense a population in the same sense as a the members of a species are. Rather, each USS site evolves independently. The come to be similar because all of them are subject to the same evolutionary force - the molecular drive created by biased DNA uptake. This means that the different USS sites in the genome are only similar because of convergence.
Biologists are used to invoking evolutionary 'conservation' (preservation of similarities inherited from a common ancestor) whenever aligned sequences have more similarity than can be explained by chance. So, for example, we say that two recA genes are more conserved than the sequences that flank them.
I've been fighting the tendency (mine and others') to refer to conserved similarities when describing USS similarities, because I know that distinct sites do not have common ancestry. Maybe now I can just substitute 'converged' for 'conserved' in all the usual statements about similarities.
Bad luck but high hopes
The sxy manuscript has been rejected! This was not what we expected, because it had been 'provisionally accepted', with the request that we do toeprinting or similar analyses to confirm our speculations that the secondary structure of sxy mRNA interferes with translation.
We didn't do the requested toeprinting, but instead used in vitro translation assays to confirm our speculations. We were a bit concerned that the editor/reviewers might not see these as an adequate substitute, but that turned out to not be a problem. But a combination of bad luck, bad judgement and bad mood let to it being rejected as not sufficiently improved for publication.
We still think this is an excellent piece of science, so we're going to instead submit it to another journal that's just as good, if not better. We're first giving the manuscript one more quick polish, taking into account all the reasonable and unreasonable comments from referees and making sure that all our explanations are as clear as possible. This should be done within the next couple of days. The new journal has a very fast turn-around time, so we'll know the outcome soon.
We didn't do the requested toeprinting, but instead used in vitro translation assays to confirm our speculations. We were a bit concerned that the editor/reviewers might not see these as an adequate substitute, but that turned out to not be a problem. But a combination of bad luck, bad judgement and bad mood let to it being rejected as not sufficiently improved for publication.
We still think this is an excellent piece of science, so we're going to instead submit it to another journal that's just as good, if not better. We're first giving the manuscript one more quick polish, taking into account all the reasonable and unreasonable comments from referees and making sure that all our explanations are as clear as possible. This should be done within the next couple of days. The new journal has a very fast turn-around time, so we'll know the outcome soon.
The collaboration that refuses to die
This morning I sat down with our collaborator on the sub-inhibitory antibiotic project. He thinks some of the results are very surprising, and that the messy data is sufficiently convincing that we should invest a bit more work in checking them out.
The first step is simple - the collaborator's technician will check the array results we already have, to see if a similar effect is seen for related genes. If it's not, we stop. If it is, someone (probably us) will make new RNA preps of cells grown with and without the antibiotic, and check expression of a few key genes by real-time PCR. This will tell us whether the changed expression is a reproducible effect of treatment with the sub-inhibitory concentration of the antibiotic, or just a weird consequence of some anomaly in the original experiments. If it's not reproducible, we stop. If it is, we decide whether or not to go on.
Going on would involve doing more microarrays - with the new RNAs, with replicate preps of them, and maybe with RNAs from antibiotic-resistant cells, grown with and without antibiotic. More real-time PCR assays would probably also be needed. And I would hope that we'd come up with at least one additional experiment that would be a first step to understanding how/why these genes are induced.
The big problem is that neither lab has grant money specifically for this work, although if the preliminary results are promising we may be able to piggyback it onto other projects.
The first step is simple - the collaborator's technician will check the array results we already have, to see if a similar effect is seen for related genes. If it's not, we stop. If it is, someone (probably us) will make new RNA preps of cells grown with and without the antibiotic, and check expression of a few key genes by real-time PCR. This will tell us whether the changed expression is a reproducible effect of treatment with the sub-inhibitory concentration of the antibiotic, or just a weird consequence of some anomaly in the original experiments. If it's not reproducible, we stop. If it is, we decide whether or not to go on.
Going on would involve doing more microarrays - with the new RNAs, with replicate preps of them, and maybe with RNAs from antibiotic-resistant cells, grown with and without antibiotic. More real-time PCR assays would probably also be needed. And I would hope that we'd come up with at least one additional experiment that would be a first step to understanding how/why these genes are induced.
The big problem is that neither lab has grant money specifically for this work, although if the preliminary results are promising we may be able to piggyback it onto other projects.
Whence Gibbs?
I successfully worked out how to command the Gibbs Motif Sampler to analyze the new genome sequences. I've only done it for two of them, because a better option has appeared.
A new version of the Gibbs motif software is available. It gives the option of using a 'centroid' sampling method that combines the best sites found in different runs (runs initiated with different random-number seeds), rather than simply taking all the sites identified in the run that had the best score. This has the big advantage of eliminating most of the weakly-matched 'false positive' sites.
It took me a few days to work out how to get it running on the computer cluster (the helpful administrator reset some permissions for me). The new release includes a version that runs in the Mac terminal, and I now have that working too. But it didn't take long to discover that it runs about 100-fold (no, I'm not exaggerating) slower than the usual (non-centroid) version. This means that a good run analyzing a whole genome would take several weeks (or more?); getting rid of the false positives isn't worth that big an investment.
But the very helpful Gibbs expert has again offered to help - he says the centroid version shouldn't be slower at all. So I've sent him the test file I've been using (2% of the genome) plus examples of the output I get. He's going to see if he can find the problem and fix it.
A new version of the Gibbs motif software is available. It gives the option of using a 'centroid' sampling method that combines the best sites found in different runs (runs initiated with different random-number seeds), rather than simply taking all the sites identified in the run that had the best score. This has the big advantage of eliminating most of the weakly-matched 'false positive' sites.
It took me a few days to work out how to get it running on the computer cluster (the helpful administrator reset some permissions for me). The new release includes a version that runs in the Mac terminal, and I now have that working too. But it didn't take long to discover that it runs about 100-fold (no, I'm not exaggerating) slower than the usual (non-centroid) version. This means that a good run analyzing a whole genome would take several weeks (or more?); getting rid of the false positives isn't worth that big an investment.
But the very helpful Gibbs expert has again offered to help - he says the centroid version shouldn't be slower at all. So I've sent him the test file I've been using (2% of the genome) plus examples of the output I get. He's going to see if he can find the problem and fix it.
In the paper I'm working on, we'll be comparing the USS motifs of various species. But of course there is no one true Gibbs motif, as the results depend on both random factors and ones I control. I don't see the randomness as a concern. It arises from both the random events in the history of the sequenced genome and the random-number seed that each Gibbs run starts with. The effectiveness of the searches, and the high numbers of USSs in the genomes, mean that the randomness isn't a big issue.
But factors I control can have a big effect on the results of a search. Probably most important is the specification of an 'expected' number of occurrences of the motif. If I set this low, the search will be very stringent, reporting only occurrences that are very well matched to the motif it's found. If I set it high many poorer matches will be reported. There's no 'right' setting, because there's no 'right' USS.
In order to compare USS motifs between genomes I need to have done the searches with comparable stringencies. The simple method I'll try is to use 'expected numbers' that are 1.5 times the number of perfect matches to the standard 'core' consensus. The identification of 'core' is somewhat arbitrary and historically contingent, but using it lets me treat all the genomes thought to have the same consensus in the same way. So for H. influenzae, H. somnus, Pasteurella multocida, Actinobacillus actinomycetemcomitans and Mannheimia succiniciproducens I'll use 1.5 X the number of occurrences of AAGTGCGTT, for H. ducreyi, A. pleuropneumoniae and M. haemolytica I'll use 1.5 X the number of occurrences of ACAAGCGGT, and for the Neisserias I'll use 1.5 X the number of occurrences of ATGCCGTCTGAA.
The Gibbs searches I queued two days ago were terminated that night because I'd forgotten to set the memory allocation high enough. I re-queue'd them yesterday with more memory requested. The A. pleuropneumoniae was terminated again last night, I think because the long genome and long motif put too big a demand on the program, so I've separated the 'forward' and 'reverse complement' sequences and requeue'd them as two separate jobs. The Neisseria meningitidis one is still running; I hope it doesn't run out of allocated time before finishing.
But factors I control can have a big effect on the results of a search. Probably most important is the specification of an 'expected' number of occurrences of the motif. If I set this low, the search will be very stringent, reporting only occurrences that are very well matched to the motif it's found. If I set it high many poorer matches will be reported. There's no 'right' setting, because there's no 'right' USS.
In order to compare USS motifs between genomes I need to have done the searches with comparable stringencies. The simple method I'll try is to use 'expected numbers' that are 1.5 times the number of perfect matches to the standard 'core' consensus. The identification of 'core' is somewhat arbitrary and historically contingent, but using it lets me treat all the genomes thought to have the same consensus in the same way. So for H. influenzae, H. somnus, Pasteurella multocida, Actinobacillus actinomycetemcomitans and Mannheimia succiniciproducens I'll use 1.5 X the number of occurrences of AAGTGCGTT, for H. ducreyi, A. pleuropneumoniae and M. haemolytica I'll use 1.5 X the number of occurrences of ACAAGCGGT, and for the Neisserias I'll use 1.5 X the number of occurrences of ATGCCGTCTGAA.
The Gibbs searches I queued two days ago were terminated that night because I'd forgotten to set the memory allocation high enough. I re-queue'd them yesterday with more memory requested. The A. pleuropneumoniae was terminated again last night, I think because the long genome and long motif put too big a demand on the program, so I've separated the 'forward' and 'reverse complement' sequences and requeue'd them as two separate jobs. The Neisseria meningitidis one is still running; I hope it doesn't run out of allocated time before finishing.
Lots of Gibbs search progress
Yesterday I worked out a way to nudge the Gibbs motif sampler into finding the Neisseria meningitidis DUS (their term for their uptake signal sequence). Even though the DUS is present in Neisserial genomes even more frequently than the H. influenzae USS is in its genome, the sampler couldn't find it without prompting. This may be because it's much shorter than the USS (only 12 contiguous bp vs 22 bp spread over 29 positions), or for some other reason I don't understand.
I didn't want to give the sampler a prior file specifying the pattern to look for, so instead I added two lines of fake sequence with a very high frequency of the DUS to the start of the genome file. This 'seed' was enough to get the sampler started on the right motif. Once it's started it has no trouble finding the DUS, and I can later delete the seeded DUSs from the list it generates.
This morning I obtained the A. pleuropneumoniae genome sequence my collaborators have been working with, split it into pieces, and generated reverse complements of both it and the N. meningitidis genome, and combined each genome's forward and reverse-complement sequences into single 'F+RC' files for searching. I did this because I need to have the sampler search both strands, and (I think) I have better control if I tell it to search just the sequence I've given it. The A. pleuropneumoniae USS is very similar to but even longer than the H. influenzae USS, so I did test runs with the 'prior' masking file I'd used for H. influenzae to make sure everything worked.
I did this and all my other tests using only 10% of the genome and only one orientation, because I wanted them to run very fast and because the guys who manage the computer cluster want all long runs to be entered through their 'Fair Share' queueing system. And now I've successfully queue'd requests for full-genome searches. I don't expect to get the results until tonight or tomorrow.
I also emailed my collaborators to let them know I'm finally back working on this project. The PI is on vacation, but the bioinformatician has been taking advantage of his absence to work full time on it! She's going to send me her new data and rewrite in a few days, so I'm not going to do any work on the manuscript until then. I could go ahead and do Gibbs analysis of all the genomes we might want to consider, but I think I should wait to see how the three main foci of our work (H. influenzae, A. pleuropneumoniae and N. meningitidis) fit into the manuscript.
I didn't want to give the sampler a prior file specifying the pattern to look for, so instead I added two lines of fake sequence with a very high frequency of the DUS to the start of the genome file. This 'seed' was enough to get the sampler started on the right motif. Once it's started it has no trouble finding the DUS, and I can later delete the seeded DUSs from the list it generates.
This morning I obtained the A. pleuropneumoniae genome sequence my collaborators have been working with, split it into pieces, and generated reverse complements of both it and the N. meningitidis genome, and combined each genome's forward and reverse-complement sequences into single 'F+RC' files for searching. I did this because I need to have the sampler search both strands, and (I think) I have better control if I tell it to search just the sequence I've given it. The A. pleuropneumoniae USS is very similar to but even longer than the H. influenzae USS, so I did test runs with the 'prior' masking file I'd used for H. influenzae to make sure everything worked.
I did this and all my other tests using only 10% of the genome and only one orientation, because I wanted them to run very fast and because the guys who manage the computer cluster want all long runs to be entered through their 'Fair Share' queueing system. And now I've successfully queue'd requests for full-genome searches. I don't expect to get the results until tonight or tomorrow.
I also emailed my collaborators to let them know I'm finally back working on this project. The PI is on vacation, but the bioinformatician has been taking advantage of his absence to work full time on it! She's going to send me her new data and rewrite in a few days, so I'm not going to do any work on the manuscript until then. I could go ahead and do Gibbs analysis of all the genomes we might want to consider, but I think I should wait to see how the three main foci of our work (H. influenzae, A. pleuropneumoniae and N. meningitidis) fit into the manuscript.
What are DprA and ComM's real jobs?
When H. influenzae cells take up DNA into their cytoplasm, the DNA may be degraded to nucleotides or, if a chromosomal sequence has enough similarity, may recombine with it. Three H. influenzae proteins are known to be be needed for the recombination but not the uptake. If H. influenzae and other bacteria take up DNA only for the nucleotides, why would they encode proteins to help it recombine? Does the existence of these proteins mean that the recombination has been selected for?
We understand why one of the proteins exists. RecA's real job is DNA repair, and its action on incoming DNA is just a side effect of this. RecA has evolved to both detect DNA damage and use recombination to repair gaps arising from replication of damaged DNA. RecA is not induced in competent cells, and everything it does that promotes recombination with foreign DNA is most parsimoniously explained as side effects of its selected action on damaged DNA.
The other two proteins, ComM and DprA, are harder to explain. Both have been shown in other competent bacteria to limit the degradation of the DNA that has been taken up by competent cells, and this has been interpreted as an adaptation to promote recombination. The H. influenzae homologs of both have good CRP-S sites in their promoters and are strongly induced in competent cells. This means that they must be beneficial under the conditions that induce DNA uptake. In a previous post I put forward the idea that these conditions induce two kinds of processes, DNA uptake and stabilization of stalled replication forks, and that DprA and ComM have evolved because of roles in this latter function.
While trying to clean up my desk I came across a table from a 2004 paper on the AAA+ class of ATP-binding proteins, showing alignments of ComM with related proteins. This reminded me that both ComM and DprA are widely distributed in bacteria, including many species not known to take up DNA. This distribution suggests that these proteins may indeed have an important cellular function that's independent of their affect on recombination.
So I just did BLAST searches with both protein sequences. Their homologs are indeed widespread, and very highly conserved, with E-values throughout the gamma-proteobacteria of better than 10^-120 for ComM homologs, and better than 10^-60 for for DprA homologs.
So what do they do? Nobody knows!
ComM: AAA+ proteins whose functions have been studied have very diverse roles. They share "a structurally conserved ATP-binding module that oligomerizes into active arrays" (Erzberger and Berger 2006). I'll need to carefully read this big review to understand the 'arrays' part, but some of these proteins form hexamer rings around DNA at replication forks, and some form spiral filaments involved in the initiation of DNA replication (e.g. DnaA). ComM is in subgroup of members with diverse or unknown functions; some use ATP hydrolysis to generate forces (e.g. dynein), and some push Mg++ into chlorophyll (!). None of the orthologs of ComM in bacteria appear to have examined for roles outside of competence.
DprA: DprA is known to limit degradation of incoming DNA in many different competent bacteria (H. influenzae, Bacillus subtilis, Streptococcus pneumoniae, Campylobacter jejuni), but nothing is known about a more fundamental role in almost all bacteria (only some obligate intracellular pathogens lack it). Last year a Dutch group reported an attempt to find out what DprA does for E. coli. They found that knocking out the gene had no significant effect on growth rate or various repair-dependent or recombination-dependent processes, even in cells lacking other genes involved in repair and/or recombination. They found that the E. coli gene could partially restore transformability to a H. influenzae dprA knockout. It had been previously shown that the H. influenzae dprA gene can restore transformation to a C. jejuni dprA knockout, so we can be pretty sure that whatever this gene does for non-competent species allows it to also prevent DNA degradation during transformation in competent species.
Conclusions? These are two ubiquitous and very strongly conserved genes. Their effects on transformation in naturally competent cells are likely to be only side effects of more important functions in all cells, but we remain quite ignorant of what these functions might be.
We understand why one of the proteins exists. RecA's real job is DNA repair, and its action on incoming DNA is just a side effect of this. RecA has evolved to both detect DNA damage and use recombination to repair gaps arising from replication of damaged DNA. RecA is not induced in competent cells, and everything it does that promotes recombination with foreign DNA is most parsimoniously explained as side effects of its selected action on damaged DNA.
The other two proteins, ComM and DprA, are harder to explain. Both have been shown in other competent bacteria to limit the degradation of the DNA that has been taken up by competent cells, and this has been interpreted as an adaptation to promote recombination. The H. influenzae homologs of both have good CRP-S sites in their promoters and are strongly induced in competent cells. This means that they must be beneficial under the conditions that induce DNA uptake. In a previous post I put forward the idea that these conditions induce two kinds of processes, DNA uptake and stabilization of stalled replication forks, and that DprA and ComM have evolved because of roles in this latter function.
While trying to clean up my desk I came across a table from a 2004 paper on the AAA+ class of ATP-binding proteins, showing alignments of ComM with related proteins. This reminded me that both ComM and DprA are widely distributed in bacteria, including many species not known to take up DNA. This distribution suggests that these proteins may indeed have an important cellular function that's independent of their affect on recombination.
So I just did BLAST searches with both protein sequences. Their homologs are indeed widespread, and very highly conserved, with E-values throughout the gamma-proteobacteria of better than 10^-120 for ComM homologs, and better than 10^-60 for for DprA homologs.
So what do they do? Nobody knows!
ComM: AAA+ proteins whose functions have been studied have very diverse roles. They share "a structurally conserved ATP-binding module that oligomerizes into active arrays" (Erzberger and Berger 2006). I'll need to carefully read this big review to understand the 'arrays' part, but some of these proteins form hexamer rings around DNA at replication forks, and some form spiral filaments involved in the initiation of DNA replication (e.g. DnaA). ComM is in subgroup of members with diverse or unknown functions; some use ATP hydrolysis to generate forces (e.g. dynein), and some push Mg++ into chlorophyll (!). None of the orthologs of ComM in bacteria appear to have examined for roles outside of competence.
DprA: DprA is known to limit degradation of incoming DNA in many different competent bacteria (H. influenzae, Bacillus subtilis, Streptococcus pneumoniae, Campylobacter jejuni), but nothing is known about a more fundamental role in almost all bacteria (only some obligate intracellular pathogens lack it). Last year a Dutch group reported an attempt to find out what DprA does for E. coli. They found that knocking out the gene had no significant effect on growth rate or various repair-dependent or recombination-dependent processes, even in cells lacking other genes involved in repair and/or recombination. They found that the E. coli gene could partially restore transformability to a H. influenzae dprA knockout. It had been previously shown that the H. influenzae dprA gene can restore transformation to a C. jejuni dprA knockout, so we can be pretty sure that whatever this gene does for non-competent species allows it to also prevent DNA degradation during transformation in competent species.
Conclusions? These are two ubiquitous and very strongly conserved genes. Their effects on transformation in naturally competent cells are likely to be only side effects of more important functions in all cells, but we remain quite ignorant of what these functions might be.
It's all coming back to me...
My goal of using the Gibbs motif sampler to examine the consensus of USS repeats in other genomes is getting closer.
This morning I found the Perl script we wrote to chop genomes into sub-segments for Gibbs analysis, figured out its requirements, and used it to prepare a Neisseria genome sequence for analysis. I worked out where I needed to put the resulting genome file in my directory on the computer cluster server, and used Fugu to do it. I found the instructions on how to log on to the cluster from the Mac Terminal, found my password, and figured out how to change to the directory with the Gibbs program in it. I worked out how to modify the command line to work for this genome, and YES! it worked.
Unfortunately it didn't readily find the Neisseria USS motif, even though it's a much simpler motif than the H. influenzae one, and a bit more frequent too. I'm hoping that by tomorrow my brain will have remembered how I solved this problem for H. influenzae.
This morning I found the Perl script we wrote to chop genomes into sub-segments for Gibbs analysis, figured out its requirements, and used it to prepare a Neisseria genome sequence for analysis. I worked out where I needed to put the resulting genome file in my directory on the computer cluster server, and used Fugu to do it. I found the instructions on how to log on to the cluster from the Mac Terminal, found my password, and figured out how to change to the directory with the Gibbs program in it. I worked out how to modify the command line to work for this genome, and YES! it worked.
Unfortunately it didn't readily find the Neisseria USS motif, even though it's a much simpler motif than the H. influenzae one, and a bit more frequent too. I'm hoping that by tomorrow my brain will have remembered how I solved this problem for H. influenzae.
Do USS constrain protein-coding?
About eight years ago I started collaborating with people with bioinformatics skills on an analysis of how USSs constrain (or don't constrain) the ability of the genome to code for proteins. The project is still unfinished, but it's made a lot of progress. I think I should make one last push to get it done before I dive into doing experiments.
Originally I was working with a biophysicist in Taiwan and his excellent graduate student. The student did a lot of nice analysis and was coauthor on one paper with us, but we never got the second part of the work finished (or published). He's since moved on to other things, and the analysis is being redone (new data is available, and we now realize the flaws in the original analysis) by a bioinformatician (bioinformaticist?) working at the National Research Council labs in Ottawa. I'm the main idea person, and the main manuscript-writer, and she's the person who can write Perl scripts and deal efficiently with databases.
Last summer, when we finished our first joint manuscript on USS evolution (pdf here), I did the sensible thing of writing a rough draft of this second manuscript before we'd done most of the analysis. I even put in mock-up figures of the results I expected (based on the previous work by the grad student). I'm told that the best scientists always have a pretty good idea what the paper will say before they do (get their students and post-docs to do) the experiments, but I'm rarely that far ahead.
Since then the bioinformatician has done quite a bit of the work, but I've mostly let my contribution slide while I did more urgent things. Today I read through my rough draft of the manuscript (a very nice aid to my lousy memory) and realized that I'm far from clear about what analysis has been done and what still needs to be done, by her and by me. I'm pretty sure that I just need to read back through our emails and associated attachments to get this clear.
One new bit of analysis will be Gibbs motif searches on the various genomes. I've already done this to death for H. influenzae but now want to do it with other genomes. Unfortunately I've forgotten such important basic information as how to connect to the computer cluster where I run the searches (username? password?), and how to format the search instructions. Not a big deal - I just wish I had been more organized in storing this useful information.
20 minutes later: I was unduly pessimistic. I started up Fugu (the program I'd used to interface with the computer cluster) and found that it had not only remembered the name and location of the cluster server, it knew my password. And I found a folder called "run pbs scripts" in the Gibbs folder on my computer, and this turns out to contain the instruction files I used when queueing my searches on the cluster server! (I had even forgotten that they needed to be queued.) Now I just need to get the genome sequences, and get them into the right format (Fasta, in big fragments?), and reread the pages explaining what my previous instructions meant. Then I can set up the new searches and put the files onto the server. I'll still need to log on to the server using Mac's Terminal interface, to put the searches into the queue, but I know those instructions are around somewhere....
Originally I was working with a biophysicist in Taiwan and his excellent graduate student. The student did a lot of nice analysis and was coauthor on one paper with us, but we never got the second part of the work finished (or published). He's since moved on to other things, and the analysis is being redone (new data is available, and we now realize the flaws in the original analysis) by a bioinformatician (bioinformaticist?) working at the National Research Council labs in Ottawa. I'm the main idea person, and the main manuscript-writer, and she's the person who can write Perl scripts and deal efficiently with databases.
Last summer, when we finished our first joint manuscript on USS evolution (pdf here), I did the sensible thing of writing a rough draft of this second manuscript before we'd done most of the analysis. I even put in mock-up figures of the results I expected (based on the previous work by the grad student). I'm told that the best scientists always have a pretty good idea what the paper will say before they do (get their students and post-docs to do) the experiments, but I'm rarely that far ahead.
Since then the bioinformatician has done quite a bit of the work, but I've mostly let my contribution slide while I did more urgent things. Today I read through my rough draft of the manuscript (a very nice aid to my lousy memory) and realized that I'm far from clear about what analysis has been done and what still needs to be done, by her and by me. I'm pretty sure that I just need to read back through our emails and associated attachments to get this clear.
One new bit of analysis will be Gibbs motif searches on the various genomes. I've already done this to death for H. influenzae but now want to do it with other genomes. Unfortunately I've forgotten such important basic information as how to connect to the computer cluster where I run the searches (username? password?), and how to format the search instructions. Not a big deal - I just wish I had been more organized in storing this useful information.
20 minutes later: I was unduly pessimistic. I started up Fugu (the program I'd used to interface with the computer cluster) and found that it had not only remembered the name and location of the cluster server, it knew my password. And I found a folder called "run pbs scripts" in the Gibbs folder on my computer, and this turns out to contain the instruction files I used when queueing my searches on the cluster server! (I had even forgotten that they needed to be queued.) Now I just need to get the genome sequences, and get them into the right format (Fasta, in big fragments?), and reread the pages explaining what my previous instructions meant. Then I can set up the new searches and put the files onto the server. I'll still need to log on to the server using Mac's Terminal interface, to put the searches into the queue, but I know those instructions are around somewhere....
Should we stay (stop) or should we go (on)?
A commenter on my most recent post about our messy microarray data pointed me to a paper suggesting a Bayesian approach to deciding whether apparent expression differences are significant. In principle this sounds great, but despite my attempts last summer to understand Bayesian methods, even this 'easy' paper is over my head. If the array work was a problem close to my heart, or if the preliminary data looked a lot more interesting than it does, I'd probably be prepared to master the necessary statistics. But it's not and it doesn't (sorry Julian), so I'm taking the statistical easy way out.
The previous post outlined the analysis that the technician has now done. For each treatment (two concentrations of rifampicin and one of erythromycin) she kept the data from the three best samples. Then she combined the data for the three samples into 'experiments', and filtered these to get lists of genes whose signals were consistently increased or decreased by at least twofold in all three samples. And she noted the descriptions of the proteins encoded by most of these genes.
The previous post also raised several issues we need to be concerned about. One of these we still don't have information about - that's the level of 'trust' the software has assigned to each gene's data. It's likely that some of the genes should be removed from their lists because the results of the sample used to colour the line are not considered trustworthy, perhaps because the signals are very weak, or because the two replicate signals in the sample disagree. For now I'll set aside the issue of 'trust', and refer to all the genes on the list as being 'significantly changed' by the antibiotic treatment.
What do we learn from these lists?
The experiment with the low concentration of rifampicin gives only a few genes with significant changes (one down, seven up). This is consistent with the 'noisy' appearance of the experiment in GeneSpring's graphical display. The first sample has a lot of variation, and this has little correlation with the variation in the other two samples. The first sample for the one gene that's significantly down is obviously unreliable, so I doubt that this gene is genuinely induced by the treatment. All of the seven 'up' genes are close to the two-fold cutoff in at least one sample, and none are up more than 3.8-fold in their highest sample. Five encode ribosomal proteins; the likely significance of this is discussed below.
The experiment with the higher concentration of rifampicin looks better, and gives more significantly changed genes (15 down, 36 up). None of the 'down' genes are the same as the one seen in the low concentration analysis, increasing my confidence that that one should be ignored. None of the decreases are consistently very strong, and the described functions of these 'down' genes don't suggest any interesting patterns.
The 'up' genes include 18 ribosomal proteins. The TIGR database says that H. influenzae has 55 ribosomal protein genes out of about 1740 total genes, so finding 18 of these in the 36 'up' genes is clearly a significant pattern. This adds confidence to the finding of five ribosomal proteins induced with the low rifampicin concentration, but the confidence is tempered because only two of the five are significantly up in both experiments. In the previous post I raised the concern that the strong signals expected from ribosomal protein genes might be giving an ascertainment bias, but the absence of ribosomal protein genes from the 'down' lists (and from the erythromycin lists discussed below) suggests that this isn't a problem. None of these genes is very strongly induced (most 3-4-fold). Several other proteins in the 'up' list are quite strongly induced in one or more samples, and amino acid and dipeptide transporters seem to be overrepresented.
Analysis of the erythromycin experiment produced 19 'down' genes and 17 'up' genes. Neither list has any ribosomal proteins, increasing my confidence that their over-representation on the rifampicin 'up' lists reflects genuine induction. The 'down' effects are all quite weak, but several of the 'up' effects are strong. There are three pairs of 'up' genes that share operons, which increases the confidence that they are genuinely induced. Six genes are described as 'reductases' (including two dimethyl sulfoxide reductase subunits and biotin sulfoxide reductase), and five genes are involved in arginine/ornithine/putrescine pathways.
Although I don't have the GeneSpring program I can get a good idea of how much trust it has assigned just from the screenshots I have. Only three of the strongly 'up' genes from the high-concentration rifampicin experiment have the dark colour GeneSpring uses to indicate high trust: a dipeptide transporter and two genes of unknown function. This is probably because the first sample is very noisy. Trust is generally stronger in the erythromycin experiment.
Summary:
Treatment with rifampicin at the sub-inhibitory concentration of 0.05 microgram/ml induces expression of genes for ribosomal proteins and probably of genes for amino acid transporters. Does this make biological sense? Maybe. At the much higher inhibitory concentrations rifampicin inhibits transcription by RNA polymerase. If the main effect of a very weak inhibition is a shortage of the proteins the cell needs most of (= ribosomal proteins), it might turn up expression of the corresponding genes.
Treatment with erythromycin at the sub-inhibitory concentration of 0.1 microgram/ml probably induces genes for some reductases and proteins that break down arginine. Does this make biological sense? Not in any way I can see. Erythromycin at inhibitory concentrations blocks protein synthesis. The logic I suggest above for rifampicin would seem to apply more strongly to erythromycin, making me suspect that my application of it to rifampicin is just empty story telling.
This work's 'publishability' would be higher if we found effects on genes associated with virulence to the human host or with resistance to the antibiotic. Unfortunately, although standards for claiming 'virulence gene' status are lamentably low, none of the genes on the 'up' or 'down' lists is identified in any way with virulence.
I'm now going to email my collaborator, asking him to read this post and consider whether it's worth continuing with these experiments.
The previous post outlined the analysis that the technician has now done. For each treatment (two concentrations of rifampicin and one of erythromycin) she kept the data from the three best samples. Then she combined the data for the three samples into 'experiments', and filtered these to get lists of genes whose signals were consistently increased or decreased by at least twofold in all three samples. And she noted the descriptions of the proteins encoded by most of these genes.
The previous post also raised several issues we need to be concerned about. One of these we still don't have information about - that's the level of 'trust' the software has assigned to each gene's data. It's likely that some of the genes should be removed from their lists because the results of the sample used to colour the line are not considered trustworthy, perhaps because the signals are very weak, or because the two replicate signals in the sample disagree. For now I'll set aside the issue of 'trust', and refer to all the genes on the list as being 'significantly changed' by the antibiotic treatment.
What do we learn from these lists?
The experiment with the low concentration of rifampicin gives only a few genes with significant changes (one down, seven up). This is consistent with the 'noisy' appearance of the experiment in GeneSpring's graphical display. The first sample has a lot of variation, and this has little correlation with the variation in the other two samples. The first sample for the one gene that's significantly down is obviously unreliable, so I doubt that this gene is genuinely induced by the treatment. All of the seven 'up' genes are close to the two-fold cutoff in at least one sample, and none are up more than 3.8-fold in their highest sample. Five encode ribosomal proteins; the likely significance of this is discussed below.
The experiment with the higher concentration of rifampicin looks better, and gives more significantly changed genes (15 down, 36 up). None of the 'down' genes are the same as the one seen in the low concentration analysis, increasing my confidence that that one should be ignored. None of the decreases are consistently very strong, and the described functions of these 'down' genes don't suggest any interesting patterns.
The 'up' genes include 18 ribosomal proteins. The TIGR database says that H. influenzae has 55 ribosomal protein genes out of about 1740 total genes, so finding 18 of these in the 36 'up' genes is clearly a significant pattern. This adds confidence to the finding of five ribosomal proteins induced with the low rifampicin concentration, but the confidence is tempered because only two of the five are significantly up in both experiments. In the previous post I raised the concern that the strong signals expected from ribosomal protein genes might be giving an ascertainment bias, but the absence of ribosomal protein genes from the 'down' lists (and from the erythromycin lists discussed below) suggests that this isn't a problem. None of these genes is very strongly induced (most 3-4-fold). Several other proteins in the 'up' list are quite strongly induced in one or more samples, and amino acid and dipeptide transporters seem to be overrepresented.
Analysis of the erythromycin experiment produced 19 'down' genes and 17 'up' genes. Neither list has any ribosomal proteins, increasing my confidence that their over-representation on the rifampicin 'up' lists reflects genuine induction. The 'down' effects are all quite weak, but several of the 'up' effects are strong. There are three pairs of 'up' genes that share operons, which increases the confidence that they are genuinely induced. Six genes are described as 'reductases' (including two dimethyl sulfoxide reductase subunits and biotin sulfoxide reductase), and five genes are involved in arginine/ornithine/putrescine pathways.
Although I don't have the GeneSpring program I can get a good idea of how much trust it has assigned just from the screenshots I have. Only three of the strongly 'up' genes from the high-concentration rifampicin experiment have the dark colour GeneSpring uses to indicate high trust: a dipeptide transporter and two genes of unknown function. This is probably because the first sample is very noisy. Trust is generally stronger in the erythromycin experiment.
Summary:
Treatment with rifampicin at the sub-inhibitory concentration of 0.05 microgram/ml induces expression of genes for ribosomal proteins and probably of genes for amino acid transporters. Does this make biological sense? Maybe. At the much higher inhibitory concentrations rifampicin inhibits transcription by RNA polymerase. If the main effect of a very weak inhibition is a shortage of the proteins the cell needs most of (= ribosomal proteins), it might turn up expression of the corresponding genes.
Treatment with erythromycin at the sub-inhibitory concentration of 0.1 microgram/ml probably induces genes for some reductases and proteins that break down arginine. Does this make biological sense? Not in any way I can see. Erythromycin at inhibitory concentrations blocks protein synthesis. The logic I suggest above for rifampicin would seem to apply more strongly to erythromycin, making me suspect that my application of it to rifampicin is just empty story telling.
This work's 'publishability' would be higher if we found effects on genes associated with virulence to the human host or with resistance to the antibiotic. Unfortunately, although standards for claiming 'virulence gene' status are lamentably low, none of the genes on the 'up' or 'down' lists is identified in any way with virulence.
I'm now going to email my collaborator, asking him to read this post and consider whether it's worth continuing with these experiments.
Does PurR regulate rec2, and does this matter?
So this morning I sat down with the post-docs to pick their brains about the above questions. The conversation kept moving on to the larger issue of how nucleotides, nucleosides and bases affect expression of competence genes, but for now I want to only focus on the PurR repressor and the rec2 gene. My hope is that we can do the necessary experiments to answer the above questions (and write a paper on the results), and that this will give us a clearer framework in which to address the other questions.
But even this goal looks more complicated than I expected (foolish me - I should know better by now). Here 's what we think we know:
The rec2 promoter has a sequence strongly matching the PurR repressor binding site characterized for E. coli. Such sites are also found in all of H. influenzae's purine biosynthesis genes, so we're confident that H. influenzae PurR binds to the same sequence as its homolog. This location of this sequence in the promoter predicts that PurR binding will prevent transcription, as expected for its known repressor function. This thus predicts that transcription of rec2 is repressed by PurR.
Either of the purines guanine and hypoxanthine can act efficiently as corepressors of E. coli PurR (they bind to different sites in the protein). The H. influenzae PurR protein is very similar (55% identical amino acids, shorter by only one amino acid) so we can confidently expect it to have the same corepressors. This then predicts that rec2 is repressed when guanine or hypoxanthine are present at significant concentrations.
The Rec2 protein is needed for the second stage of DNA uptake - passing the DNA across the inner membrane, from the periplasm into the cytoplasm, but not for transport of DNA across the outer membrane. So repressing rec2 is expected to reduce the amount of DNA recombining with the chromosome, and thus the transformation frequency, but not the amount taken up into the periplasm (what we measure when we give cells radioactive DNA).
H. influenzae cells are normally grown in a rich medium that should have lots of purines, so we expect the genes regulated by PurR to be off, and our microarray analysis confirms this. The PurR repressor is expected to be inactive in cells made competent by starvation with MIV medium because it contains no purines, and this is also confirmed by our microarrays; the purine biosynthesis genes turn on within 10 minutes in MIV.
The rec2 promoter also has a strong binding site for the activator protein CRP; this site is of the 'CRP-S' type typical of competence gene promoters, whose induction requires the co-activator protein Sxy. CRP and Sxy are also active in MIV, and this induces transcription of all genes with CRP-S promoters. Thus rec2 appears to have two regulatory sites, one for the CRP/Sxy activator and one for the PurR repressor, and both these effects are expected to increase its transcription in MIV. I want to find out the extent to which PurR derepression contributes to this induction.
One way to test this is to look at induction in the absence of CRP or Sxy. We did that in our original microarray analysis; the effects on rec2 transcription are a bit smaller than those on the other CRP-S genes, consistent with PurR having some effect, but the difference may not be significant.
A reciprocal approach is to look at the effect of knocking out the purR gene. This is potentially much more informative, because the purR- cells are still fully transformable so we can look for effects of purine supplementation on DNA uptake and transformation with and without PurR.
Experiments comparing a purR knockout to purR+ cells showed little or no effect on MIV-induced transformation frequencies. Adding AMP or GMP (at 0.5 mM, tenfold less than previously used) reduced transformation about 1000-fold regardless of PurR. Introducing hypercompetence mutations (sxy-1 or sxy-2) to the purR+ cells reduced the effect of AMP by about 100-fold and of GMP by 10-20-fold. Removing PurR from these mutants sometimes further decreased this effect by about 3-fold, but this may not be significant.
Other experiments, done by a former PhD student and a former undergraduate student, examined the effects of supplementing MIV with PurR corepressors on competence and on rec2 and comA transcription. The competence assays showed no effect on transformation with 5 mM guanine, a slight decrease with 5 mM adenine, and a slight but apparently significant increase with a mixture of uracil and hypoxanthine (5 mM each). The transcription assays found no significant effect of 1 mM guanine or hypoxanthine on expression of lacZ fusions to either rec2 or comA (a control with no PurR site). Purine nucleosides (inosine, adenosine, guanosine) and nucleotides (AMP, GMP), all at 5 mM, did reduce expression of the fusions by 5-10-fold.
None of this strongly supports the hypothesis that PurR has a significant effect on rec2 transcription. However the results with sxy+ cells are likely to be complicated by the effects of nucleotides on the sxy expression needed for induction of the CRP-S genes, including rec2. The results with the hypercompetent sxy-1 and sxy-2 mutants should be independent of this effect, as we know that these cells make large amounts of Sxy protein even when nucleotides have been added. But we still see a strong effect of AMP and GMP that is only slightly (if at all) relieved by knocking out purR.
What additional experiments should be done? I'll have to think more.
But even this goal looks more complicated than I expected (foolish me - I should know better by now). Here 's what we think we know:
The rec2 promoter has a sequence strongly matching the PurR repressor binding site characterized for E. coli. Such sites are also found in all of H. influenzae's purine biosynthesis genes, so we're confident that H. influenzae PurR binds to the same sequence as its homolog. This location of this sequence in the promoter predicts that PurR binding will prevent transcription, as expected for its known repressor function. This thus predicts that transcription of rec2 is repressed by PurR.
Either of the purines guanine and hypoxanthine can act efficiently as corepressors of E. coli PurR (they bind to different sites in the protein). The H. influenzae PurR protein is very similar (55% identical amino acids, shorter by only one amino acid) so we can confidently expect it to have the same corepressors. This then predicts that rec2 is repressed when guanine or hypoxanthine are present at significant concentrations.
The Rec2 protein is needed for the second stage of DNA uptake - passing the DNA across the inner membrane, from the periplasm into the cytoplasm, but not for transport of DNA across the outer membrane. So repressing rec2 is expected to reduce the amount of DNA recombining with the chromosome, and thus the transformation frequency, but not the amount taken up into the periplasm (what we measure when we give cells radioactive DNA).
H. influenzae cells are normally grown in a rich medium that should have lots of purines, so we expect the genes regulated by PurR to be off, and our microarray analysis confirms this. The PurR repressor is expected to be inactive in cells made competent by starvation with MIV medium because it contains no purines, and this is also confirmed by our microarrays; the purine biosynthesis genes turn on within 10 minutes in MIV.
The rec2 promoter also has a strong binding site for the activator protein CRP; this site is of the 'CRP-S' type typical of competence gene promoters, whose induction requires the co-activator protein Sxy. CRP and Sxy are also active in MIV, and this induces transcription of all genes with CRP-S promoters. Thus rec2 appears to have two regulatory sites, one for the CRP/Sxy activator and one for the PurR repressor, and both these effects are expected to increase its transcription in MIV. I want to find out the extent to which PurR derepression contributes to this induction.
One way to test this is to look at induction in the absence of CRP or Sxy. We did that in our original microarray analysis; the effects on rec2 transcription are a bit smaller than those on the other CRP-S genes, consistent with PurR having some effect, but the difference may not be significant.
A reciprocal approach is to look at the effect of knocking out the purR gene. This is potentially much more informative, because the purR- cells are still fully transformable so we can look for effects of purine supplementation on DNA uptake and transformation with and without PurR.
Experiments comparing a purR knockout to purR+ cells showed little or no effect on MIV-induced transformation frequencies. Adding AMP or GMP (at 0.5 mM, tenfold less than previously used) reduced transformation about 1000-fold regardless of PurR. Introducing hypercompetence mutations (sxy-1 or sxy-2) to the purR+ cells reduced the effect of AMP by about 100-fold and of GMP by 10-20-fold. Removing PurR from these mutants sometimes further decreased this effect by about 3-fold, but this may not be significant.
Other experiments, done by a former PhD student and a former undergraduate student, examined the effects of supplementing MIV with PurR corepressors on competence and on rec2 and comA transcription. The competence assays showed no effect on transformation with 5 mM guanine, a slight decrease with 5 mM adenine, and a slight but apparently significant increase with a mixture of uracil and hypoxanthine (5 mM each). The transcription assays found no significant effect of 1 mM guanine or hypoxanthine on expression of lacZ fusions to either rec2 or comA (a control with no PurR site). Purine nucleosides (inosine, adenosine, guanosine) and nucleotides (AMP, GMP), all at 5 mM, did reduce expression of the fusions by 5-10-fold.
None of this strongly supports the hypothesis that PurR has a significant effect on rec2 transcription. However the results with sxy+ cells are likely to be complicated by the effects of nucleotides on the sxy expression needed for induction of the CRP-S genes, including rec2. The results with the hypercompetent sxy-1 and sxy-2 mutants should be independent of this effect, as we know that these cells make large amounts of Sxy protein even when nucleotides have been added. But we still see a strong effect of AMP and GMP that is only slightly (if at all) relieved by knocking out purR.
What additional experiments should be done? I'll have to think more.
Time to do some experiments
I did something at my bench yesterday, for the first time in more than two months. It wasn't a proper experiment - I was only growing up and checking some cells that another researcher has requested from us. But it reminded me of how long it's been since I did any benchwork (more than two months).
I realized several years ago that the pleasure of doing "real" experiments was worth the time away from my desk. This time could productively be spent working on manuscripts and advising other lab members on their experiments, maybe more productively than by doing experiments. But doing experiments is where the best fun is, and I'm not going to let an obsession with productivity deny me the pleasure of doing them.
But first I need to decide what to do. I could take up the "Can E. coli be made competent?" work I left off a couple of months ago. (I had found that the reporter genes I was testing were not suitable for the analysis I wanted to do.) Or I could go back farther and continue the preparations for laser-tweezers analysis of DNA uptake. (I had all the components in place to test the system, and stopped only because other work became more urgent.)
One of the post-docs has just blogged on the role of nucleotides in regulating competence, reminding me that this poorly understood regulation is the critical evidence that bacteria take up DNA as food (that the evolutionary function of DNA uptake is nutrient acquisition). What are the important questions?
First (simplest), does the PurR repressor regulate the rec2 gene? This gene's product plays an essential role in bringing DNA across the inner membrane, and its promoter has what looks like a strong binding site for PurR. We also have preliminary evidence that PurR does repress rec2, from one microarray of cells whose purR gene had been knocked out.
I just searched my previous posts and those of another lab member for "PurR" and found a lot of analyses and information I'd forgotten. So I think I need to get together with the two post-docs that are also thinking about this, and pool our memories and ideas.
I realized several years ago that the pleasure of doing "real" experiments was worth the time away from my desk. This time could productively be spent working on manuscripts and advising other lab members on their experiments, maybe more productively than by doing experiments. But doing experiments is where the best fun is, and I'm not going to let an obsession with productivity deny me the pleasure of doing them.
But first I need to decide what to do. I could take up the "Can E. coli be made competent?" work I left off a couple of months ago. (I had found that the reporter genes I was testing were not suitable for the analysis I wanted to do.) Or I could go back farther and continue the preparations for laser-tweezers analysis of DNA uptake. (I had all the components in place to test the system, and stopped only because other work became more urgent.)
One of the post-docs has just blogged on the role of nucleotides in regulating competence, reminding me that this poorly understood regulation is the critical evidence that bacteria take up DNA as food (that the evolutionary function of DNA uptake is nutrient acquisition). What are the important questions?
First (simplest), does the PurR repressor regulate the rec2 gene? This gene's product plays an essential role in bringing DNA across the inner membrane, and its promoter has what looks like a strong binding site for PurR. We also have preliminary evidence that PurR does repress rec2, from one microarray of cells whose purR gene had been knocked out.
I just searched my previous posts and those of another lab member for "PurR" and found a lot of analyses and information I'd forgotten. So I think I need to get together with the two post-docs that are also thinking about this, and pool our memories and ideas.
New BLAST book!
Thanks to a suggestion from a reader, I ordered the O'Reilly Press book on BLAST. It just arrived and looks to be exactly what I and the rest of my lab need.
We all use BLAST all the time, but we've never really had any understanding of how our search query sequence became the search results. We sort-of knew that this was asking for trouble, but haven't taken the time to learn more. Probably this was partly because doing a BLAST search is so fast and easy that you want to use the results right away, not 'waste time' reading the manual.
The new book has a Glossary! (No more using Google to find hints of what the terms mean.) It has a detailed index! Chapter 2 has a section on Evolution, which opens with the wonderful statement that "BLAST works because evolution is happening."
Yesterday I used my newly gained ability to do local BLAST searches to set up a search for one of the post-docs. We blundered around a bit because I couldn't remember what the different letters controlling the parameter settings did. Now I have the book, all the information I need is at hand.
The only problem is that the book was published in 2002, and some details have changed. Right now I only notice that the BLAST web interface has changed a lot. The available version of BLAST has also changed, from 2.2.6 (new when the book was written) to 2.2.16. I suspect I'll need to read the book before I'll have the background to let me understand the changes.
We all use BLAST all the time, but we've never really had any understanding of how our search query sequence became the search results. We sort-of knew that this was asking for trouble, but haven't taken the time to learn more. Probably this was partly because doing a BLAST search is so fast and easy that you want to use the results right away, not 'waste time' reading the manual.
The new book has a Glossary! (No more using Google to find hints of what the terms mean.) It has a detailed index! Chapter 2 has a section on Evolution, which opens with the wonderful statement that "BLAST works because evolution is happening."
Yesterday I used my newly gained ability to do local BLAST searches to set up a search for one of the post-docs. We blundered around a bit because I couldn't remember what the different letters controlling the parameter settings did. Now I have the book, all the information I need is at hand.
The only problem is that the book was published in 2002, and some details have changed. Right now I only notice that the BLAST web interface has changed a lot. The available version of BLAST has also changed, from 2.2.6 (new when the book was written) to 2.2.16. I suspect I'll need to read the book before I'll have the background to let me understand the changes.
Troubleshooting a cloning experiment
One of the post-docs has been having a problem cloning the H. influenzae sxy gene into an E. coli plasmid vector. She needs the gene to be inserted in the ‘forward’ orientation but all her clones have it in the ‘reverse’ orientation. We’ve been operating on the assumption that this is because expression of sxy is toxic (we’ve lots of other evidence for this), but a bit of troubleshooting yesterday suggested that the explanation may just be a technical problem with the cloning.
The vector she’s using is one we obtained containing another insert, so she cut out the unwanted insert with the restriction enzyme SfiI, hoping to create a ‘no-insert’ version she could use to insert the sxy gene into. (I’ll explain the relevant properties of this enzyme below.) But the new SfiI ends of the no-insert version wouldn’t ligate together or couldn’t be recut after ligation (I forget which), so for her sxy cloning she instead just used the gel-purified vector fragment produced by the original SfiI digestion.
She designed SfiI restriction sites into the primers she used to amplify the sxy gene from H. influenzae chromosomal DNA, so alls he needed to do was digest the PCR product with SfiI, incubate it and her vector fragment with ligase, and transform the mixture into competent E. coli (selecting for the chloramphenicol resistance gene on the vector).
She got lots of colonies, but they all had the sxy insert in the wrong orientation. We now think this is because of a peculiarity of the SfiI enzyme. Wikipedia's explanation of how normal restriction enzymes work can be found here. SfiI’s recognition site is written as GGCCNNNNNGGCC; what’s peculiar is that it doesn’t about the sequence of the bases it cuts at (shown as NNNNN) – it only cares about the flanking GGCC bases. Typical restriction enzymes have no Ns in their recognition sites, so every cleavage site is the same, and because the sites are symmetrical the ends of the fragments have the same ‘sticky’ bases and can form base pairs that allow the ends to be ligated together. But the various SfiI sites have different bases at the N positions and, because the cut site is between the 4th and 5th N (moving 5’ to 3’) on each strand, ends generated from different cut sites can’t base pair and this can’t be efficiently ligated.
We now realize that the two SfiI sites flanking the original insert of the vector had different NNNNN sequences, and that this difference explains why the ends of the vector fragment couldn’t be ligated together (or why any rare plasmid resulting from ligation was no longer recognized by the enzyme). Furthermore, the SfiI sites on the PCR primers used to amplify sxy also had NNNNN sequences incompatible with the ends of the vector fragment. The details of the different NNNNN sequences are still unclear, so I’m not sure that this explains why inserts were obtained fairly efficiently but only in one direction.
It wouldn’t be unreasonable to be annoyed by discovering that we’d overlooked an important detail in our experimental design. But I'm always pleased when troubleshooting has discovered an error. I guess that I find it reassuring that we’ve been able to discover the reason why an experiment has been persistently not working. Of course finding one reason doesn’t mean there is no other problem with the experiment, but we’re optimistic.
The vector she’s using is one we obtained containing another insert, so she cut out the unwanted insert with the restriction enzyme SfiI, hoping to create a ‘no-insert’ version she could use to insert the sxy gene into. (I’ll explain the relevant properties of this enzyme below.) But the new SfiI ends of the no-insert version wouldn’t ligate together or couldn’t be recut after ligation (I forget which), so for her sxy cloning she instead just used the gel-purified vector fragment produced by the original SfiI digestion.
She designed SfiI restriction sites into the primers she used to amplify the sxy gene from H. influenzae chromosomal DNA, so alls he needed to do was digest the PCR product with SfiI, incubate it and her vector fragment with ligase, and transform the mixture into competent E. coli (selecting for the chloramphenicol resistance gene on the vector).
She got lots of colonies, but they all had the sxy insert in the wrong orientation. We now think this is because of a peculiarity of the SfiI enzyme. Wikipedia's explanation of how normal restriction enzymes work can be found here. SfiI’s recognition site is written as GGCCNNNNNGGCC; what’s peculiar is that it doesn’t about the sequence of the bases it cuts at (shown as NNNNN) – it only cares about the flanking GGCC bases. Typical restriction enzymes have no Ns in their recognition sites, so every cleavage site is the same, and because the sites are symmetrical the ends of the fragments have the same ‘sticky’ bases and can form base pairs that allow the ends to be ligated together. But the various SfiI sites have different bases at the N positions and, because the cut site is between the 4th and 5th N (moving 5’ to 3’) on each strand, ends generated from different cut sites can’t base pair and this can’t be efficiently ligated.
We now realize that the two SfiI sites flanking the original insert of the vector had different NNNNN sequences, and that this difference explains why the ends of the vector fragment couldn’t be ligated together (or why any rare plasmid resulting from ligation was no longer recognized by the enzyme). Furthermore, the SfiI sites on the PCR primers used to amplify sxy also had NNNNN sequences incompatible with the ends of the vector fragment. The details of the different NNNNN sequences are still unclear, so I’m not sure that this explains why inserts were obtained fairly efficiently but only in one direction.
It wouldn’t be unreasonable to be annoyed by discovering that we’d overlooked an important detail in our experimental design. But I'm always pleased when troubleshooting has discovered an error. I guess that I find it reassuring that we’ve been able to discover the reason why an experiment has been persistently not working. Of course finding one reason doesn’t mean there is no other problem with the experiment, but we’re optimistic.
The new CRP/CRP-S manuscript
Now that we’ve resubmitted both the USS manuscript and the Sxy manuscript, we’re starting to fix up yet another manuscript, this one about the interactions between CRP and its recognition sites in H. influenzae and E. coli. This manuscript was reviewed by a high quality journal last spring and politely rejected. The reviewers thought that our research was fine, but the results were not considered important enough to meet this journal’s standards,
I hadn’t looked at the manuscript since we submitted it in May, and had forgotten much of what it said, so I was able to reread it now with an open mind. I found lots of places where we could have done a better job of explaining the significance of what we were reporting. I then went over the manuscript with the post-doc who had done the work and most of the writing (as part of his PhD research). He has designed several nice additional experiments, and we had to decide which of these were worth doing before submitting the manuscript to another journal. In going over these and my rewriting ideas we saw that the manuscript could become much more strongly focused. So now we’re quite enthusiastic about what we (well, I) originally saw as a rather inconclusive little paper.
It’s too bad we weren’t able to make these improvements before submitting it the first time – the reviewers might then have found it acceptable. But I’ve had the same experience several times before: A manuscript we think very important is rejected by a very competitive journal. We polish it and submit it to another competitive journal. It’s again rejected so we polish it some more, and add in the new data or analyses that have accumulated since the first submission. It’s now accepted by a less prestigious journal, but is so much better that we think it would have been acceptable to the first journal we tried. If only....
I hadn’t looked at the manuscript since we submitted it in May, and had forgotten much of what it said, so I was able to reread it now with an open mind. I found lots of places where we could have done a better job of explaining the significance of what we were reporting. I then went over the manuscript with the post-doc who had done the work and most of the writing (as part of his PhD research). He has designed several nice additional experiments, and we had to decide which of these were worth doing before submitting the manuscript to another journal. In going over these and my rewriting ideas we saw that the manuscript could become much more strongly focused. So now we’re quite enthusiastic about what we (well, I) originally saw as a rather inconclusive little paper.
It’s too bad we weren’t able to make these improvements before submitting it the first time – the reviewers might then have found it acceptable. But I’ve had the same experience several times before: A manuscript we think very important is rejected by a very competitive journal. We polish it and submit it to another competitive journal. It’s again rejected so we polish it some more, and add in the new data or analyses that have accumulated since the first submission. It’s now accepted by a less prestigious journal, but is so much better that we think it would have been acceptable to the first journal we tried. If only....
Arrays and GeneSpring
Yesterday we took another look at the microarray data that we hope will show whether very low antibiotic doses change expression of H. influenzae genes. As I posted the other day, the big problem is deciding whether the differences we see in some genes are due to the antibiotic treatment or just chance.
I think that this project is at a tipping point. (My collaborator, who initiated the project, may think differently.) We need to decide whether or not to keep putting time and money and brainpower into it, and this decision will depend on the results we find in the data we have. If we find that the antibiotic treatments have had strong effects on gene expression, and especially if the genes involved have functions that tell a scientifically interesting story, we’ll want to go on to collect more data and solidify the data we have. But if the apparent effects are weak and unreliable we may decide not to proceed.
How are we going to analyze the data we have? Our plan now is to first set aside the worst (most error-prone) microarray replicates from each of our three treatments, keeping only the three replicates that appear to have the smallest random effects.
We’ll then ‘filter’ each treatment to identify genes whose expression appears to have been at least doubled by the antibiotic treatment in all of the three replicates. We had thought this was going to be troublesome to do, but I discovered that GeneSpring has a number of automatic filtering routines that can easily be used to identify exactly the sets of genes that meet these constraints. This will probably give a list of about 20 ‘genes’ from each treatment. (We’ll also do the same analysis for genes whose expression is consistently reduced by half or more; everything I say below applies to the reduced-expression sets too.)
I put ‘genes’ in quotes in the previous paragraph to bring out the issue that not all of the spots on the microarrays represent genuine H. influenzae genes. Some are just control spots of fluorescent dye, and GeneSpring knows to ignore these. But others are various control DNAs including genes from other organisms. As presently set up, Genepring doesn’t know that these aren’t real H. influenzae genes. (I didn’t set up this GeneSpring; it belongs to another lab. I think I know how to add the H. influenzae genome data it needs to do this but haven’t had time to try it.) It won’t be a lot of work to go through each list, removing the entries that aren’t real genes.
Two of the three treatments are different concentrations of the same antibiotic. Comparing the lists produced by these two treatments will help us decide whether the genes on the lists are there because of real antibiotic effects or chance –genes that show up on both lists can be confidently viewed as being genuinely induced by the antibiotic treatment.
Then we’ll want to look at the identities of the H. influenzae genes that remain. First, how many such genes are there? Finding that only a few genes are changed would be less interesting than finding many. How strongly are they induced (or repressed)? Weak effects are not as interesting as strong ones, and are more likely to be due to chance. How consistent is the change across the three replicates?
Another issue is the level of ‘trust’ that GeneSpring has assigned to each gene’s expression level, which tells us how reliable Gene Spring thinks the data are. This depends on several factors, though I’m not sure whether GeneSpring considers all of them. First, each array has two replicate ‘spots’ for each genes, and each reported expression level is the average of these two spots’ scores. If the two are not very similar, the average is not very trustworthy. Second, results for genes that are weakly expressed may not be as trustworthy as those for strongly expressed genes because the signal is too weak. Third, the software that reads the array images (we use Imagene) scores the background around each spot, and if the background is too high the spot score is not very trustworthy. So we’ll pay more attention to results that are assigned high ‘trust’.
Once all these factors have been taken into account we'll get to the most interesting one: what do the affected genes do? Some will be genes about which little or nothing is known – identified as genes because they could code for proteins, and often because the same hypothetical proteins also show up in other bacteria. But some will be genes whose functions are known, and this is the information we’ll use to decide what to do next.
One additional concern is ‘ascertainment bias. Because microarray analysis is more accurate for genes that are strongly expressed in the untreated cells (produce lots of mRNA), it is more likely to confidently detect relatively small changes in gene expression in these genes, and to miss or not trust small changes in weakly-expressed genes. Several of the genes on the preliminary lists encode ribosomal proteins, which we know are normally highly expressed. If we find that, say, 40% of the induced genes encode ribosomal proteins, does this mean that ribosomal proteins are preferentially induced by antibiotic treatment, or only that they are the easiest to detect? One check on this would be whether ribosomal protein genes also show up in the sets of genes that appear to be repressed by the treatment; if so this is probably an ascertainment bias effect.
If each treatment has produced only a few genes with significantly changed expression, and these barely make the two-fold cutoff, we might decide not to proceed with additional experiments, especially if we don’t know what these genes do.
If some ‘interesting’ genes show significant effects, we will probably use an independent method to confirm that they really are induced. This method is real-time (quantitative) PCR. It’s time-consuming and expensive, but it can give more accurate measurements of the amounts of specific RNAs in different samples.
And if the results are promising we’ll go on to do more microarray analysis, this time using cells carrying mutations that make them resistant to the harmful effects of high doses of these antibiotics. We suspect that these cells will respond differently to the low doses we’re tested on normal cells. Because a lot is known about how such resistance mutations act, their effects on low-dose responses will help us understand how the low doses exert their effects, and the significance of these effects for antibiotics used to treat infections.
(I'm away from the lab for a few days; I wrote this and the next two posts on the plane. They can fill the gap until I get back.)
I think that this project is at a tipping point. (My collaborator, who initiated the project, may think differently.) We need to decide whether or not to keep putting time and money and brainpower into it, and this decision will depend on the results we find in the data we have. If we find that the antibiotic treatments have had strong effects on gene expression, and especially if the genes involved have functions that tell a scientifically interesting story, we’ll want to go on to collect more data and solidify the data we have. But if the apparent effects are weak and unreliable we may decide not to proceed.
How are we going to analyze the data we have? Our plan now is to first set aside the worst (most error-prone) microarray replicates from each of our three treatments, keeping only the three replicates that appear to have the smallest random effects.
We’ll then ‘filter’ each treatment to identify genes whose expression appears to have been at least doubled by the antibiotic treatment in all of the three replicates. We had thought this was going to be troublesome to do, but I discovered that GeneSpring has a number of automatic filtering routines that can easily be used to identify exactly the sets of genes that meet these constraints. This will probably give a list of about 20 ‘genes’ from each treatment. (We’ll also do the same analysis for genes whose expression is consistently reduced by half or more; everything I say below applies to the reduced-expression sets too.)
I put ‘genes’ in quotes in the previous paragraph to bring out the issue that not all of the spots on the microarrays represent genuine H. influenzae genes. Some are just control spots of fluorescent dye, and GeneSpring knows to ignore these. But others are various control DNAs including genes from other organisms. As presently set up, Genepring doesn’t know that these aren’t real H. influenzae genes. (I didn’t set up this GeneSpring; it belongs to another lab. I think I know how to add the H. influenzae genome data it needs to do this but haven’t had time to try it.) It won’t be a lot of work to go through each list, removing the entries that aren’t real genes.
Two of the three treatments are different concentrations of the same antibiotic. Comparing the lists produced by these two treatments will help us decide whether the genes on the lists are there because of real antibiotic effects or chance –genes that show up on both lists can be confidently viewed as being genuinely induced by the antibiotic treatment.
Then we’ll want to look at the identities of the H. influenzae genes that remain. First, how many such genes are there? Finding that only a few genes are changed would be less interesting than finding many. How strongly are they induced (or repressed)? Weak effects are not as interesting as strong ones, and are more likely to be due to chance. How consistent is the change across the three replicates?
Another issue is the level of ‘trust’ that GeneSpring has assigned to each gene’s expression level, which tells us how reliable Gene Spring thinks the data are. This depends on several factors, though I’m not sure whether GeneSpring considers all of them. First, each array has two replicate ‘spots’ for each genes, and each reported expression level is the average of these two spots’ scores. If the two are not very similar, the average is not very trustworthy. Second, results for genes that are weakly expressed may not be as trustworthy as those for strongly expressed genes because the signal is too weak. Third, the software that reads the array images (we use Imagene) scores the background around each spot, and if the background is too high the spot score is not very trustworthy. So we’ll pay more attention to results that are assigned high ‘trust’.
Once all these factors have been taken into account we'll get to the most interesting one: what do the affected genes do? Some will be genes about which little or nothing is known – identified as genes because they could code for proteins, and often because the same hypothetical proteins also show up in other bacteria. But some will be genes whose functions are known, and this is the information we’ll use to decide what to do next.
One additional concern is ‘ascertainment bias. Because microarray analysis is more accurate for genes that are strongly expressed in the untreated cells (produce lots of mRNA), it is more likely to confidently detect relatively small changes in gene expression in these genes, and to miss or not trust small changes in weakly-expressed genes. Several of the genes on the preliminary lists encode ribosomal proteins, which we know are normally highly expressed. If we find that, say, 40% of the induced genes encode ribosomal proteins, does this mean that ribosomal proteins are preferentially induced by antibiotic treatment, or only that they are the easiest to detect? One check on this would be whether ribosomal protein genes also show up in the sets of genes that appear to be repressed by the treatment; if so this is probably an ascertainment bias effect.
If each treatment has produced only a few genes with significantly changed expression, and these barely make the two-fold cutoff, we might decide not to proceed with additional experiments, especially if we don’t know what these genes do.
If some ‘interesting’ genes show significant effects, we will probably use an independent method to confirm that they really are induced. This method is real-time (quantitative) PCR. It’s time-consuming and expensive, but it can give more accurate measurements of the amounts of specific RNAs in different samples.
And if the results are promising we’ll go on to do more microarray analysis, this time using cells carrying mutations that make them resistant to the harmful effects of high doses of these antibiotics. We suspect that these cells will respond differently to the low doses we’re tested on normal cells. Because a lot is known about how such resistance mutations act, their effects on low-dose responses will help us understand how the low doses exert their effects, and the significance of these effects for antibiotics used to treat infections.
(I'm away from the lab for a few days; I wrote this and the next two posts on the plane. They can fill the gap until I get back.)