RRResearch: Evidence for a likely sample switch in the RNA-seq dataset (or not)

I've been working on the toxin/antitoxin manuscript, trying to extract all the conclusions from the RNA-seq data trove. We have two odd results, and I now think they are both best explained by a sample switch in the first set of samples.

One odd result that the former post-doc drew my attention to is that, when the antitoxin is deleted, expression of the competence genes appears to be down at the last time point ('M3', 100 minutes incubation in MIV competence-induction medium).

The other odd result, which I just discovered a couple of days ago, is that, when the toxin gene is deleted, expression of the competence genes appears to be up at the second time point ('M1'; 10 minutes in MIV).

Each of these results is based on the mean of three biological replicates (samples pf the same strains cultured on different days). I now think that they're reciprocal consequences of the same problem - switched identities of one pair of samples prepared on the same day.

History: I was originally focusing on the apparent up-regulation in the toxin deletion, which I discovered in comparisons of the toxin ('toxx') and toxin+antitoxin ('taxx') knockouts. I was looking at this comparison because it's the only one where we would expect to see any expression differences that might be caused by action of the antitoxin on genes other than the toxin, and I wanted to know if we could rule these out.

The former summer undergrad had done pairwise comparisons of all the different mutants we'd tested, using both the Edge and DESeq2 packages, so I looked at the Excel files he'd generated comparing the toxx and taxx samples, sorting the expression ratios for each timepoint. I was quite surprised to see that, with the Edge dataset, the genes with the most extreme expression differences at the M1 timepoint were ALL the competence genes (see below). But there was no overexpression at the M2 timepoint, contrary to what I would expect if this was a competence-related effect, and inconsistent effects at M0 and M3.

So I was worried that this might be due to a problem with only one of the three replicate samples that had been averaged, so I looked at the before-averaging data. Initially I suspected that two of the taxx samples (M2_E and M3_E) had been switched. That might still be true, but it was a small effect compared to the bigger toxx anomaly I found when I plotted bar graphs of competence gene coverage for all the taxx and toxx samples. The graph below is for expression of the comABCDEF operon in the taxx mutant (∆toxT), but I found the same anomaly for the other operons: the M1_A sample has much higher expression levels than we normally see at this time (usually only slightly higher than the M0 timepoint).

Now I was suspicious that this 'A' sample might be misidentified - not at M1 timepoint at all. So I looked at all the samples that had been prepared on this day (Day 'A'). These samples were prepared by the former research associate; they were the first RNA prep she did for what turned into the big RNA-seq dataset. Here's the plot of all her Day A samples.

Consistent with my sample-switch hypothesis, the overly high competence-gene expression levels in the toxx M1 sample is balanced by overly low competence gene expression in the antx (∆toxA) M3 sample! Again I'm only showing the comABCDEF operon, but pilABCD and comNOPQ show the same pattern.

So my new hypothesis is that the antx_M3 sample and toxx_M1 sample were switched. This is a good discovery, because it probably explains both the apparent reduction of competence gene expression at M3 in the antx samples and the apparent increase in competence-gene expression at M1 in the toxx samples. But it's a big hassle, because if I'm right we'll need to redo all the bioinformatics analyses that involve these samples. Luckily the summer undergrad is still in the picture, and the R scripts left us with should make this task easy.

But I want to be as confident as possible that my switch hypothesis is correct. The best prediction is that we should see overexpression of toxT and absence of toxA in sample toxx_M1, and the reverse in sample antx_M3.

(...Pause while I create this graph...)

But we don't!

One confounding issue is that the expression scoring detects coverage of the remaining ends of the genes, because they weren't completely deleted. (I can get the former summer student to look at the actual toxT and toxA coverage for each sample to confirm whether the deletions are present.)

With this taken into account, I think we see the expression we would expect if the samples were not switched. For the antx samples, we see elevated expression of toxT and toxA at all time points, and for the toxx samples we see normal (like KW20) expression of toxA and reduced expression of toxT. Importantly, the antx_M3 sample has much higher expression of toxT than the toxx_M1 sample. So I think my hypothesis must be wrong!

OK, now I've checked the expression of toxA and toxT in the samples from other days, and they're nicely consistent with the expression in the Day A samples. So I guess the samples are not switched. DAMN!

So why are the competence gene levels so high in the toxx_M1 sample? I suppose the research associate could just have been delayed in collecting this sample, so it has expression levels closer to those usually seen at 30 minutes. (Unfortunately her notebook for this period has been lost.)

Maybe it will all seem clearer tomorrow...

Field of Science

RRResearch

Evidence for a likely sample switch in the RNA-seq dataset (or not)

No comments:

Post a Comment