RRResearch: microarray data

Showing posts with label microarray data. Show all posts

Should we stay (stop) or should we go (on)?

By Rosie Redfield on Saturday, August 11, 2007

A commenter on my most recent post about our messy microarray data pointed me to a paper suggesting a Bayesian approach to deciding whether apparent expression differences are significant. In principle this sounds great, but despite my attempts last summer to understand Bayesian methods, even this 'easy' paper is over my head. If the array work was a problem close to my heart, or if the preliminary data looked a lot more interesting than it does, I'd probably be prepared to master the necessary statistics. But it's not and it doesn't (sorry Julian), so I'm taking the statistical easy way out.

The previous post outlined the analysis that the technician has now done. For each treatment (two concentrations of rifampicin and one of erythromycin) she kept the data from the three best samples. Then she combined the data for the three samples into 'experiments', and filtered these to get lists of genes whose signals were consistently increased or decreased by at least twofold in all three samples. And she noted the descriptions of the proteins encoded by most of these genes.

The previous post also raised several issues we need to be concerned about. One of these we still don't have information about - that's the level of 'trust' the software has assigned to each gene's data. It's likely that some of the genes should be removed from their lists because the results of the sample used to colour the line are not considered trustworthy, perhaps because the signals are very weak, or because the two replicate signals in the sample disagree. For now I'll set aside the issue of 'trust', and refer to all the genes on the list as being 'significantly changed' by the antibiotic treatment.

What do we learn from these lists?

The experiment with the low concentration of rifampicin gives only a few genes with significant changes (one down, seven up). This is consistent with the 'noisy' appearance of the experiment in GeneSpring's graphical display. The first sample has a lot of variation, and this has little correlation with the variation in the other two samples. The first sample for the one gene that's significantly down is obviously unreliable, so I doubt that this gene is genuinely induced by the treatment. All of the seven 'up' genes are close to the two-fold cutoff in at least one sample, and none are up more than 3.8-fold in their highest sample. Five encode ribosomal proteins; the likely significance of this is discussed below.

The experiment with the higher concentration of rifampicin looks better, and gives more significantly changed genes (15 down, 36 up). None of the 'down' genes are the same as the one seen in the low concentration analysis, increasing my confidence that that one should be ignored. None of the decreases are consistently very strong, and the described functions of these 'down' genes don't suggest any interesting patterns.

The 'up' genes include 18 ribosomal proteins. The TIGR database says that H. influenzae has 55 ribosomal protein genes out of about 1740 total genes, so finding 18 of these in the 36 'up' genes is clearly a significant pattern. This adds confidence to the finding of five ribosomal proteins induced with the low rifampicin concentration, but the confidence is tempered because only two of the five are significantly up in both experiments. In the previous post I raised the concern that the strong signals expected from ribosomal protein genes might be giving an ascertainment bias, but the absence of ribosomal protein genes from the 'down' lists (and from the erythromycin lists discussed below) suggests that this isn't a problem. None of these genes is very strongly induced (most 3-4-fold). Several other proteins in the 'up' list are quite strongly induced in one or more samples, and amino acid and dipeptide transporters seem to be overrepresented.

Analysis of the erythromycin experiment produced 19 'down' genes and 17 'up' genes. Neither list has any ribosomal proteins, increasing my confidence that their over-representation on the rifampicin 'up' lists reflects genuine induction. The 'down' effects are all quite weak, but several of the 'up' effects are strong. There are three pairs of 'up' genes that share operons, which increases the confidence that they are genuinely induced. Six genes are described as 'reductases' (including two dimethyl sulfoxide reductase subunits and biotin sulfoxide reductase), and five genes are involved in arginine/ornithine/putrescine pathways.

Although I don't have the GeneSpring program I can get a good idea of how much trust it has assigned just from the screenshots I have. Only three of the strongly 'up' genes from the high-concentration rifampicin experiment have the dark colour GeneSpring uses to indicate high trust: a dipeptide transporter and two genes of unknown function. This is probably because the first sample is very noisy. Trust is generally stronger in the erythromycin experiment.

Summary:

Treatment with rifampicin at the sub-inhibitory concentration of 0.05 microgram/ml induces expression of genes for ribosomal proteins and probably of genes for amino acid transporters. Does this make biological sense? Maybe. At the much higher inhibitory concentrations rifampicin inhibits transcription by RNA polymerase. If the main effect of a very weak inhibition is a shortage of the proteins the cell needs most of (= ribosomal proteins), it might turn up expression of the corresponding genes.

Treatment with erythromycin at the sub-inhibitory concentration of 0.1 microgram/ml probably induces genes for some reductases and proteins that break down arginine. Does this make biological sense? Not in any way I can see. Erythromycin at inhibitory concentrations blocks protein synthesis. The logic I suggest above for rifampicin would seem to apply more strongly to erythromycin, making me suspect that my application of it to rifampicin is just empty story telling.

This work's 'publishability' would be higher if we found effects on genes associated with virulence to the human host or with resistance to the antibiotic. Unfortunately, although standards for claiming 'virulence gene' status are lamentably low, none of the genes on the 'up' or 'down' lists is identified in any way with virulence.

I'm now going to email my collaborator, asking him to read this post and consider whether it's worth continuing with these experiments.

Thinking statistically about array results

By Rosie Redfield on Friday, July 27, 2007

A few days ago I posted about a problem we're having interpreting some microarray data. That problem could be simply resolved by finding the missing notebook, or activating the relevant memory synapses in the brain of the student who did the work. But there's a bigger and more intrinsic problem with this type of experiment, and that's deciding whether observed differences are due to antibiotic effects or to chance.

Consider data from a single array slide, comparing the amounts of mRNA from each of ~1700 genes from antibiotic-treated and untreated cells. If the antibiotic has no effect on a gene we expect the amounts from treated and untreated cells to be 'similar'. We don't expect them to be exactly the same (a ratio of exactly 1.00) because all sorts of factors generate random variation in signal intensity. So we need to spell out criteria that let us distinguish differences due to these chance factors from differences caused by effects of the antibiotic treatment. Here are some (better and worse) ways to do this:

1. The simplest way is to specify a threshold for significance, for example declaring that only ratios larger than 2.0 (or smaller than 0.5) will be considered as significant (i.e. not due to chance). But this isn't a good way to decide.

2. We could instead use the standard statistical cutoff, declaring that only the 5% highest and 5% lowest values would be considered significant. One problem here is that this criterion would tell us that some differences were significant (5% of 1700 x 2 =170) even if in fact the antibiotic treatment had absolutely no effect.

3. We could improve this by doing a control experiment to see how big we expect chance effects to be. One simple control is to use a microarray where both RNAs are from the same treatment. We can then use a scatter plot to compare the scores for each point. The diagonal line then represents the 1.00 ratio expected in the absence of random factors, and the degree of scatter of the ratios away from 1.00 tells us how big our chance effects typically are.

We could then decide to consider as significant only effects that are bigger than anything seen in the control. Sometimes this could be easy. For example, if expression of one gene was 20-fold higher after antibiotic treatment, whereas control effects never exceeded 5-fold, we'd be pretty sure the 20-fold increase was caused by the antibiotic.

But what if the control effects were all below 2-fold, and two of the genes in our treated RNA sample were up 2.5-fold? Does this mean the antibiotic caused this increase? How to decide?

We need to do enough replicates that the probability of any gene's expression being above our cutoff just by chance is very small. For example, we could require that a gene be expressed above the 2-fold cutoff in each of four independent replicates. Even if our arrays had a lot of random variation, comparing replicates can find the significant effects. Say that our controls tell us that 10% of the genes are likely to score above the 2-fold cutoff just by chance, and we do see about 170 scoring that high in our first real experimental array (comparing treated and non-treated cells). If we do another experimental array, we again expect 10% to score above the cutoff, but if only chance determines score, only about 17 of these are expected to have also scored this high in the first replicate. If we do a third replicate, we only expect 1 or 2 of the genes scoring above 2-fold to have scored this high in both of the first two replicates. You see the pattern. So, if our real four replicates include 6 genes that scored above 2 in all four, we can be quite confident that this is not due to chance. Instead the antibiotic treatment must have caused their expression to be increased.

The real analysis is more complicated, in part because different replicates have different amounts of error, and in part because we might want to take into account the actual expression levels rather than using the simplistic cutoff of 2-fold. Our preliminary look at the real replicate data last week suggested that a few genes were consistently (well, semi-consistently) increased. Tomorrow we're going to take another look at the real data and see if we think this is a result we have confidence in.

Field of Science

Should we stay (stop) or should we go (on)?

Thinking statistically about array results