How do non-competence genes respond to competence inducting treatment?

For the RNAseq part of the toxin-antitoxin paper, we should describe what we learn about how transfer to the competence-inducing starvation medium MIV affects genes not known to be involved in competence.

The former undergrad left us with a set of Edge and DEseq2 analyses of changes in gene expression.  I discussed them here last summer (http://rrresearch.fieldofscience.com/2016/08/making-sense-of-rna-seq-comparisons.html).  Unfortunately I don't know how to properly interpret them.  The former post-doc suggested some analyses, but I'm reluctant to dive into these until I have a better idea of what I'd be getting into.

These analyses are comparing the time-course expression for the wildtype strain against a knockout-mutant strain.  So I think they're useful for identifying changes that are dependent on genes whose knockouts we've tested, but not for identifying genes that change for other reasons.

Below is a figure created in a different analysis.  This 'cluster' analysis creates subgroups of genes based on how their expression changes between the three sBHI time points or the four transfer-to-MIV time points.   Cluster I includes all of the previously identified CRP-S-regulated competence genes except ssb and HI1631, and two other genes, HI1384 and HI1385 (second figure below), which encode ferritin and are reported to be induced when iron levels are high.  ssb is consistently induced but usually no more than twofold, so it's lost its 'competence gene' status in this analysis.  HI1631 (unknown function) is induced about 10-fold with the typical CRP-S timing, but its maximum expression is quite low and it's put in cluster M.



But how can I find all the genes that are induced?  A folder called Find Significant' has a script that seems to function like Edge and DEseq2, comparing induction patterns between two strains.  There's a folder called Significant Gene Comparisons, but again it's comparing wildtype treatments to mutant treatments.

What we learned from the RNAseq data: Part 1

So, we did a massive RNAseq study of competence, with 124 samples of H. influenzae cultures at different growth stages, in the rich medium sBHI and the competence-inducing medium MIV, and with wildtype or mutant genes affecting competence.  (You can use the Search box to find all the previous posts about this work...)  




Here I specifically want to think about what we learned about competence from the competence-induced cultures (cells transferred from sBHI to MIV).  We sampled cultures at the T=0 point indicated by the star in the above diagram, and the 10, 30 and 100 minute times in MIV.  I'll also consider the results for strains with knockout mutations in the competence-regulating genes sxy and crp, but not for the toxin-and antitoxin mutants or hypercompetence mutants. 

I'm working on this because it will be a section in our manuscript about the competence-induced toxin-antitoxin system we've discovered.  After describing what the RNAseq results tell us about the general features of competence regulation, the manuscript will go on to describe what they tell us about expression of the toxA and toxT genes.

About 15 years ago we used microarray analysis in a very similar study (Redfield et al 2005); the diagram above shows the sampling strategy from that work.  Here I want to focus on how the new RNAseq results affect its conclusions - what's strengthened, what's changed.  This will be the first time we've described the RNAseq work in a publication, so it's important to present the basic information about the study.  Eventually there will be a second paper describing the the effects of the hypercompetence mutations on 'spontaneous' competence development in rich medium.

Because the T=0 samples were taken just before the cells were removed from rich medium, at OD600 = ~0.25, they give us gene expression levels for log-phase growth (when the growth rate is maximum, not limited by competition with other cells).  If we want we can strengthen these conclusions by also using data from the other 'arm' of the RNAseq study, because this took samples at a lower cell density (OD600 = 0.02).  If cells at both densities are truly in log-phase growth, their patterns of gene expression should be the same.  If they're not, this becomes a topic for the hypercompetence manuscript, not here.

A former Honours undergrad did a great deal of analysis of this work, and left us with excellent graphics and the R scripts that generate them.  Below I'll illustrate the findings with his figures and some the former postdoc made, and discuss whether we need to generate different ones.

Here's a graphic showing the relative expression of every CRP-S gene in the four time point samples (M0 is T=0, M1 is T=10 min, M2 is T=30 min, and M3 is T=100 min).  (Oops, I cropped off the Y axis - it's a log scale going up to 10^3, with the X axis crossing at 1.)  So we see that all genes are induced (most by 10-100 fold) at 30 min, and mRNA levels stay that high through 100 min.



Here's a figure showing how the regulatory genes change.  (Ignore the bars for the sBHI time points B1, B2 and B3.)  Basically, sxy mRNA is strongly induced within 10 min (M1), well before the CRP-S genes in the graphic above, and cya and crp expression changes only slightly.  The sxy mRNA isn't efficiently translated to Sxy protein unless purines are depleted. The drop in cya expression on transfer to MIV is expected, since it is known to be negatively autoregulated by CRP and cAMP.  The small but significant rise in crp mRNA was also seen in the microarray study - it's cause isn't known.



Here is a mock-up of an overview figure suggested by the former postdoc, with not quite the right data.  It illustrates how competence gene expression changes after 10 min, 30 min and 100 min in MIV, and how the 30 min changes are altered if sxy or crp is knocked out.  The tips of the grey lines represent expression changes of all the other genes; they'll be replaced by grey dots in the final version. 



The big question is: How does this advance on what we already concluded from the old microarray data?  I think I had better printout and carefully reread that paper, rather than relying on my memory.




A new plan for contamination correction

The grad student and I had an intense Skype discussion with the former postdoc this morning.  We realized that there's a much easier way to estimate the total Rd contamination in the uptake samples, so we're going to do this.
Preamble:  The former postdoc isn't convinced that this will be better than what we were doing. He likes the simplicity of using the dual-genome alignments directly to calculate uptake ratios, and doesn't think that all the missing data from multiply-mapped positions is a concern.  I don't like discarding what is otherwise good data, and hope that using all the data we can will reduce the noise due to low coverage positions in our uptake ratios.  (The grad student prudently kept his opinions to himself.)
Easy contamination estimate:  We actually already have this data for the eight NP samples, and can easily get it for the eight GG samples.  Way back, the postdoc determined how many of the reads in each sample mapped to the Rd part of the concatenated Rd-NP genome, and how many mapped to the NP part (table below).  Reads from repeats and no-SNP segments were excluded.  The fraction mapping to Rd (Rd/total in the table below) is the fraction of Rd DNA contaminating the sample.  (This data is from a file named meanreads per genome.xlsx.  UP1, 2 and 3 are the replicate long-fragment uptake samples, UP4, 5, and 6 are the replicate short fragment uptake samples, and UP13 and UP15 are the corresponding long and short input samples.)


It's possible that different genome regions are contaminated to different extents (e.g. if some Rd and/or NP sequences are poor substrates for library formation or sequencing), but we'll proceed on the assumption that contamination level is fairly even across the genome, perhaps checking later**.

How to correct the uptake coverage levels for Rd contamination:  We need to only correct the parts of the genome where Rd reads are aligning.  The obvious way to identify these regions is to align reads from pure Rd DNA to the NP reference genome.  At each position the relative coverage is a measure of the extent of contamination we expect due to the Rd contamination in the uptake samples.  Nicely, this coverage measure will intrinsically include the effects of any differences in the sequence-ability of different parts of the Rd genome.

So:
  1. (blue figure )  Align the reads from a pure Rd sample to the NP genome.  (I don't think an Rd control was included in the sequencing done specifically for this project, but we have sequenced pure Rd DNA in other experiments.)  The reads from Rd-unique parts of the Rd genome will be discarded as unmappable.
  2. (blue figure ) Record the Rd coverage at each position.
  3. (yellow figure ) For each NP uptake sample to be corrected for contamination, align its reads to the NP reference genome and record the coverage at each position.  Calculate the average coverage (or normalized coverage) across the whole genome.  
  4. (yellow figure ) Use the NP uptake sample's contamination level (from the table above) to calculate the average amount of this coverage that is due to Rd contamination.  (E.g., if sample UP3 has average coverage of 350 when aligned to NP alone, then 350* 0.0884 = 30.9 of this comes from contamination.) 
  5. (blue figure ) Normalize the position-specific Rd contamination levels (from point 2 above) to the Rd read coverage predicted to be due this contamination level (from point 4 above, 30.9 in the example).
  6. (yellow figure ) Subtract the normalized Rd contamination from the total coverage at each genome position to get its contamination-corrected coverage.


If the uptake coverages weren't normalized at step 3, normalize them before combining them to get the mean coverages over the three uptake samples.

Then compare the resulting uptake-ratio maps to those we previously obtained.

Now, let's see what the former postdoc and grad student think of this plan.  It might not be exactly what we discussed.


** How to check if contamination is evenly distributed across the genome:  The former postdoc has a list of ~34,000 well-behaved SNPs that distinguish Rd and NP.  After mapping an NP uptake sample onto the NP reference genome, at each position we can score the frequency of the Rd and NP alleles.  Since the background error rate is low, we can just take the fraction of Rd alleles as the contamination level at each position.  Now map these contamination levels across the NP genome.  We expect to see a fairly constant level.  At positions where Rd has no homolog to the NP sequence, there will be no data points because there will be no SNPs to score.  There will certainly be noise in these levels.  But maybe we'll see smooth variation that suggests that some parts of the NP genome have higher Rd contamination than other parts.

How will contaminating Rd reads map onto NP?

Based on the analysis in the previous post, we can calculate lower-bound and upper-bound estimates for the Rd contamination levels  for each of our 12 'uptake' DNA samples (3 replicates each of 4 treatments).  And we can correct the lower-bound estimate (and maybe the upper-bound one too) to get an estimate of the true contamination level for each sample. But what do we do with this information?

To think about this, first consider how we expect the contamination to affect the apparent coverage for each position in the NP (or GG) genome.  

We can map each sample just to its own NP genome.  The NP-derived reads (bright red and blue) will map to their correct positions (at internal repeats we'll see the average coverage).  Now the Rd-derived reads from positions that have strong similarity to NP locations (dark red in the figure) should map to their homologs, including all the Rd reads from repeats and no-SNP segments.   I think that the Rd-derived sequences that don't have NP homologs (yellow in the figure) and thus can't be mapped onto NP will be unmappable and given Q=0 scores. This will be about 10-15% of the Rd reads (a known value).  So most NP positions will have their NP read coverage plus (say) 20% extra coverage from the Rd reads. But some NP positions (again 10-15%) don't have Rd homologs (blue), and these will have only their NP coverage.




Grad student and former postdoc, how should we handle this?  If we have a table listing which NP positions have no Rd homolog, we could just do the contamination correction on the positions that do have homologs.  But how well this works depends on how well we understand the limits of the algorithm that maps the reads onto the reference sequence.

Try #3: Analyzing the uptake data

OK, here's another try at understanding the interactions of repeated sequences, reads with no SNPs, and RD contamination in our DNA uptake sequencing data.

A reminder of what the problem is that I'm trying to solve:  We have three 'uptake' samples for each treatment in our big analysis of DNA uptake specificity, but they can't be directly compared because, in the different samples, the recovered 'donor' DNA (from strain NP) probably has different levels of contamination by the recipient strain Rd.  We have no direct way to measure this contamination, but I think we can infer it from other features of the samples.  (The next post considers how to correct for the % contamination.)

This time I've simplified the diagrams by considering only a short segment containing one rDNA repeated sequence (purple) and one segment with no SNPs (green).  When the reads from the contaminated DNA are aligned to the concatenated Rd-NP genomes, reads from both repeats and no-SNP locations will be 'multiply-mapping', and the alignment program will randomly assign each of these reads to one of the places they align with. 

Diagram A. The first diagram is alignment of reads of the NP input sample to the concatenated Rd-NP genomes.


At the rDNA repeat and the no-SNP segment the coverage mapped to NP is reduced to 50%, and the coverage mapped to Rd is increased to 50%.  These 50% coverages will be constant over all repeats and no-SNP segments.

The multipy-mapping reads (repeats and no-SNP) will be given quality scores Q= 0. Some reads that are poor quality for other reasons may also be given this score - the grad student is going to check how big a contribution this is.  We can then calculate the fraction of reads that are multiply-mapping and use this later to improve our estimates of the Rd contamination in the uptake samples.

The next two diagrams separately show the expected mapping of each of the components of an NP uptake sample that is contaminated with 20% Rd DNA.

Diagram B.  Here's the expected NP contribution:


All the unique NP reads map at the unique NP positions; this is only 80% of the coverage because 20% is Rd contamination.  Only 40% of the multiply-mapping reads are assigned to the multiply-mapping positions in the NP genome.  The other 40% are assigned to their Rd homologs.

Diagram C.  Here's the expected Rd contribution:


The other 20% of the reads map to Rd.  At unique Rd positions this gives 20% coverage, but at the multiply-mapping positions this is reduced to 10% because the other 10% are assigned to their NP homologs. 

Diagram D.  Now here's the total mapping of the NP uptake sample (including the contamination):  (This is what we would actually see when we aligned an NP uptake sample with 20% Rd contamination to the concatenated Rd-NP genomes.)


The NP-aligned sequences stay at 80% coverage for the unique NP sequences, but coverage at the multiply-mapping segments increases to 50% due to the 10% from the Rd sequences.  Similarly the coverage at the Rd unique sequences stays at 20%, but increases to 50% at the multiply-mapping segments.

OK, how can we use this analysis to work backwards?  We need to start with Diagram D and infer that there is the 20% Rd contamination assumed in Diagrams B and C .

If the data were as clean as I've drawn it we could just look at the Rd coverage in Diagram D and fit a horizontal line to the 'typical' 20% coverage.  But the real data may be much noisier than this, so it will probably not be possible to decide where to put such a line.

What if we recalculated Diagram D, this time excluding all the reads with Q=0 scores?

Diagram E. 


Perhaps now we can draw the desired line through the high-coverage positions.  But this is still a bit subjective, so here's another approach:

First, we calculate mean coverage of the uptake sample over the Rd genome (Q=0 reads excluded).  This gives us a lower bound for the true contamination.  

Then, from Diagram D, we calculate mean coverage of the uptake sample over the Rd genome including the Q=0 reads.  This gives us an upper bound for the true contamination.  

Now, can we refine these estimates to get the actual contamination?  To do so we need to know what fraction of Rd positions in Diagrams D and E were excluded by excluding multiply-mapping reads, and what fraction were over-covered by including them.

Back at Diagram A, we calculated the fraction of reads that had Q=0 scores due to multiple-mapping.  If we assume that true coverage is independent of SNPs and repeats, this fraction should also approximate the fraction of the genome whose coverage was lost due to multiply-mapping reads (call it '%Q').  We can use this %Q to correct the contamination estimates.  

Correction for lower-bound estimate:  Divide by (1 - %Q).

Can we also correct the upper bound estimate?  I don't think so (at least not simply), because the degree of over-coverage at the multiply-mapping positions is not fixed - the higher the true contamination is, the smaller the over-coverage effect.  That is, we know that the multiply-mapping positions will have 50% coverage, but unless we already know the amount of contamination we can't correct this for the amount of contamination. 

However we do have a refined contamination estimate from the lower-bound estimate (call it 'Cu'), and I think we can use this to correct our upper-bound estimate.  If we've done this right we should get the same Cu value back.

Correction for upper-bound estimate:  Subtract (%Q(50%-Cu)).

Will the real value be midway between the estimates?  No, it will be closer to the lower bound.  The smaller the true contamination is, the more it will be overestimated by including the multiply-mapping reads.

Then what?  Hmm, I see that I haven't thought this part through.  I think I'll start a new post for that.

We need to do the lower-bound and upper-bound calculations and corrections for each of our 12 'uptake' samples (3 replicates each of 4 treatments).  This will give us good estimates of the % contamination for each sample.

e can map each sample only to its own Np genome (or GG).  Now the Rd-derived reads from positions that have strong similarity to NP locations should map there, and the Rd-derived sequences that cannot be mapped onto NP will be given Q=0 scores.


 and reduce the coverage levels in proportion to the contamination.