Field of Science

New contamination analysis suggests alternate strategy

Yesterday's post considered the factors affecting our current strategy for removing the contamination Rd-derived reads from our 'uptake' samples.  Since then I've looked at the read-mapping results using our new perfected reference sequences.  The problem of reads that have no SNPs and thus map ambiguously to both Rd and NP/GG is worse than we thought.  Below I'll describe this data and the new strategy I want to consider.

The former post-doc sent us a spreadsheet with all the summary data from mapping each of our samples ('input', 'uptake' and controls) to various combinations of genomes.  I'll start by considering what we see when we map the 'input' genomes (NP and GG) and the recipient genome (Rd) onto artificial genomes consisting of an Rd-genome 'chromosome' and an input-genome 'chromosome (NP or GG).  These samples have no contamination, so any reads that map to the wrong 'chromosome' will do so because of these factors (also described in the previous post):

  1. Sequencing errors:  Some NP-derived (or GG-derived) reads will have sequencing errors that create the Rd alleles of SNPs, causing them to align better to the Rd homolog. 
  2.  Shared segments:  Although the Rd and NP (or GG) genome sequences differ on average by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical, including each genome's six copies of the ribosomal RNA genes.  Reads from within these segments cannot be attributed to one source and are randomly assigned locations by the mapping program.  Half of them will be assigned to Rd and half to NP (or GG), regardless of the actual level of contamination, and all will be given MAPQ scores = 0.
Here are the results.  I've deleted the columns with the raw coverage data and just show the % of reads in each sample that were incorrectly assigned.



Depending on the combination of reference genomes used, between 4.75% and 7.61% of the reads are assigned to the incorrect genome, either due to sequencing errors or because the segment they cover is identical in both genomes (does not include any SNP positions).  

The likelihood that a read will be mis-mapped depends on its length - short reads are more likely to not include any SNP positions.  This probably explains the lower mis-assignment rates of the 'long' NP and GG samples.

Very little of this mis-assignment is due to sequencing errors.  The table below shows how the % mis-assignments change when the reads with MAPQ scores of 0 are removed (XXX rows).


Excluding the ambiguously mapping reads dramatically reduces the mis-mapping, to between 0.05% and 0.21%.  This residual mis-mapping reflects reads that did not match the correct genome, but matched a SNP allele in the incorrect genome.  This will be due to sequencing errors that change the SNP allele.

With this information we can calculate the actual fraction of reads that are ambiguously mapped, as 
2 * (mis-mapped reads - seq. errors)

This is a lot more than I expected!  Only a small part of the mis-mapping is due to the 6 ribosomal RNA operons.  Each operon has about 4.5 kb of identical sequence, which will cause about 25 kb (1.4%) of the genome to map ambiguously.  Most of the ambiguous mapping must thus be due to identical sequences in other parts of the genome.  These are likely to be mainly in coding sequences that are either highly conserved or that are identical by descent due to previous recombination between recent ancestors of these strains.

New plan for contamination correction:  Our original contamination-correction plan was to remove the contaminating reads from the uptake samples.  We planned to use mapping to concatenated 'genomes' to identify and discard reads that mapped unambiguously to Rd and that mapped ambiguously to both genomes.  But the above analysis shows that we would be discarding a large fraction of the non-contaminating (NP or GG) reads from many parts of the genome (probablythe most conserved parts).

So here's an alternative.  Instead of trying to remove the contamination from the 'uptake' samples (the numerator in our uptake-ratio calculation), let's add fake contamination to the 'input' samples (the denominator in our calculation).  This idea was suggested by some fake-contamination samples the former post-doc created to do some previous tests.

Here's how it would work:  
  1. First we estimate the actual contamination level of each 'uptake' sample, by mapping it onto the concatenated pair of genomes and using the above analysis to correct for the mis-mapping.   (This calculation is shown below.)
  2. Then we create, for each uptake sample, a fake 'input+contamination' sample that matches the Rd contamination of the corresponding uptake sample, by adding the appropriate number of (randomly-selected?) Rd reads to its reads. 
  3. Then we map each sample's reads (uptake and input+contamination) to the genome it's supposed to represent, not paying any attention to possible contamination.
  4. Then we normalize the coverages of all these mapped samples to the same size (reads per million bp or ???).
  5. Then we calculate the uptake ratios for each uptake sample, by dividing the normalized 'uptake' coverage at each genome position by the corresponding normalized 'input+contamination' coverage. 
  6. Finally, we calculate mean uptake ratio at each position for the sets of three replicate uptake samples.  


Actual contamination:  The calculations to estimate the actual contamination level of each 'uptake' sample took a bit of algebra to set up (see *** below), but I checked them a couple of ways and I'm pretty sure they're correct.


The final column is the calculated contamination level for each sample.  This is the fraction of the DNA in the sample that came from the chromosomes of the recipient Rd cells instead of from donor NP or GG DNA taken up by these cells.  These values are surprisingly close to the values in yesterday's post, which were calculated in a completely different way, using only the reads that did not map ambiguously.  That's fine - it tells us that there's nothing special about the ambiguously mapping reads.

I'm not absolutely sure that this strategy will eliminate the effects of contamination (we can test it with a fake-contaminated uptake sample).  If it does, we should see that the uptake ratio peaks are better correlated with the quality-scores of the uptake sequences (the large scatter we saw previously should be reduced).  And it will certainly reduce the problems associated with the many regions with low or zero coverage created by discarding reads with MAPQ=0.

What next: The first step is for the grad student and former post-doc to read and comment on this post.  If they agree that what I'm proposing a reasonable approach, we'll test it.  

What about using spanning coverage instead of single reads? The grad student is now generating the spanning-coverage, but the goal of using it to correctly position the ambiguous reads is challenging.  In principle we won't need to tag individual reads as ambiguous if we use this new contamination-correction strategy.

***  The contamination calculation:

The input and control analysis above tells us what fraction of the reads mapping-to-Rd in each uptake sample results from mis-assignment of ambiguously mapping reads rather than from contamination.

Values:
M = Observed Rd reads/total reads in this uptake sample (Rd/Rd+NP or Rd/Rd+GG) 
I = Rd reads/total reads in the corresponding input sample 
R = Rd reads/total reads in the corresponding Rd control sample
C = inferred Rd contamination level

I worked out the calculation by instead considering an imaginary sample whose contamination level C was known:




Calculation:
C = (M -I)/(R-I)


Coverage and contamination in our DNA-uptake dataset

We finally have the corrected reference genome sequences for our DNA uptake deep-sequencing data.  The genome sequences from NCBI had errors and polymorphisms relative to the specific strains we used in the experiments, so the former post-doc used the sequence data from this experiment to create reference sequences that perfectly match the DNAs we used.

The bright-coloured table below shows 8 of our 16 samples.  These are the samples that examined uptake of DNA from strain 'NP' (86-028NP) into strain 'Rd'.  Not included are the equivalent 8 sample that used DNA from strain 'GG"' (PittGG).  These reads have been mapped onto a reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome, and all reads that map equally well to both genomes have been removed (reasons for this are explained below).   (The numbers in the table are from our earlier preliminary data analysis, so they will probably change a little bit with the new analyses.)

Issues to consider:

Reads or spanning coverage:  Should our analysis use coverage data derived from considering each sequencing read independently (what we have been doing) or from the 'spanning-coverage' sequences created by considering each pair of paired-end reads plus its intervening sequences?


We agree that spanning coverage will be better.  The former post-doc was going to create the BED files with the spanning coverage data but he got bogged down in other work, so the grad student is now working on this.  For now we're using the read-based BED files.

Mean coverage:  The numbers in the 'NP mean' column give the average number of reads from each sample that aligned to each genome position.  Mean coverage ranges from 166 reads (sample UP2) to 482 reads (sample UP9).  This is plenty for our purposes.

Variation in coverage across the genome:  Our preliminary analyses did not give us a good view of this, for complicated reasons.  In the next day or so the grad student should be generating these graphs for the 'input' samples (the UP13 and UP15 samples mapped onto NP, and the UP14 and UP16 samples mapped onto GG).  This will let us see how even the coverage is across each genome, and how reproducible the variation is for each genome's two input samples (short and long) .

Our concern is that there may be positions or segments with very low coverage (less than 10 reads). Such low coverage would reduce our ability to detect the effects of DNA uptake biases, which requires dividing the mean 'uptake' coverage at each position by the 'input' coverage.  If the low coverage is due to random noise in the analysis, it will be evened out for the 'uptake data because we have three replicates for each DNA length,.  But we only have a single replicate for each DNA length, so the divisor would still retain all the noise.

The relative roles of noise and biases will be checked by comparing the long and short input samples to each other, and the replicate uptake samples to each other.  Given our overall high coverage thbig problem may not be random events but sequence biases in the procedures used for library creating and sequencing.  Unfortunately these would not be averaged out over the three uptake replicates.

One complication is the presence in the genomes of repeated sequences, especially the six copies of the 16S and 23S ribosomal RNA genes.  The mapping software doesn't know which copy reads from within these segments are derived from, so for each such read it randomly choses one repeat copy as the mapped location, and assigns the read a map quality score (MAPQ) of 0.  This random assignment is not a problem when the sample consists of reads from a single genome with no contamination, but it contributes to the difficulty of eliminating reads from Rd DNA that contaminates the uptake samples (described below).


Presence of contaminating Rd DNA in the uptake samples:  The 'uptake' samples consist of NP (or GG) DNA recovered after being taken up by Rd cells.  The recovery process is not perfect, and each sample contains a significant amount of contaminating Rd DNA, whose reads need to be removed before the effects of uptake biases can be determined.

The former post-doc found that the Rd reads could be removed by aligning each uptake sample to a fake reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome.  In this 'competitive' alignment, almost all the contaminating Rd-derived reads are mapped to the Rd 'chromosome' (where they can be ignored) and almost all the desired NP-derived reads are mapped to the NP 'chromosome' (where they can be analyzed for evidence of uptake biases).

Why 'almost all' rather than 'all':  Several factors will cause reads to be mapped to the wrong 'chromosomes' in this contamination-removal step.
  1. Sequencing errors:  First and least important, some NP-derived reads will have sequencing errors that create the Rd alleles of Rd-NP SNPs, causing them to align better to the Rd homolog.  This will slightly raise the apparent baseline level of Rd contamination but isn't a serious problem.  The reverse will also happen, but we can also ignore them, especially because Rd-derived reads are the minority.  
  2. Repeated segments:  Second, the rRNA genes of Rd and NP have identical sequences, so when these are randomly assigned locations (see above), half of them will be assigned to Rd and half to NP, regardless of the actual level of contamination, and all will be given MAPQ scores = 0.  This is potentially a serious problem, since it reduces the apparent NP-derived coverage of these regions by 50%.  Because there are no uptake sequences in the rRNA genes, provided we treat the input and uptake samples identically this will only affect our ability to see effects of uptake biases in sequences within a few hundred bp of the repeated segment.   
  3. Shared segments:  Third, although NP and Rd genome sequences differ, on average, by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical. The mapping software will treat reads from these segments just like it treats reads from the rRNA repeats, randomly placing them at one of the two possible positions (in Rd or NP) and assigning them MAPQ scores = 0.
(Factors 2 and 3 are not significant right at the edges of repeated and shared segments, but become more important as the distance from the nearest SNP approaches the average read length.  Use of spanning coverage rather than single reads should improve our ability to identify and remove the contamination.)
The former postdoc handled factors 2 and 3 by removing from the analysis all reads that had MAPQ=0.  The data in the table above shows the read coverages AFTER the reads with MAPQ=0 have been removed.  The UP13 and UP15 input samples still have a small fraction (about 0.1% ) of Rd contamination rather than 0% contamination.  This may be due to Factor 1 above.

For the ribosomal repeats (and other repeats) removing all the MAPQ=0 reads may be valid, because we genuinely do not know where they came from.  But it's more complicated for the shared segments.  If uptake coverage were even over the genome, we'd know that most of the reads came from NP.  But uptake coverage is not even, especially for the 'short' DNA samples (that's the whole point of the experiment).  Positions close to uptake sequences will have high NP coverage, and only a small contribution from Rd contamination.  Positions far from uptake sequences will have very low NP coverage, so the even contribution from the Rd contamination will have a big effect.




Does expression of the toxA operon depend on ToxT as well as ToxA?

Short answer:  Yes, but not in the way we expected.

First, here's a diagram showing the toxTA operon and the mutants we're examining:


The grey bars show the extents of the deletions.  The ∆toxT and ∆toxTA mutants have a SpcR/strR cassette inserted at the deletion point, but the ∆toxA mutant has only a short 'scar' sequence at the deletion point.

A few months ago I wrote a post about evidence that ToxA prevents transcription of the toxTA operon from an unexpected internal promoter.  Here's a better version of the graph I showed there (note that transcription is going from right to left):



It looks like there are two promoters.  The CRP-S promoter is competence-induced so inactive in the M0 samples and maximally active in the M2 samples.  The unexpected internal promoter (labeled 'toxTA' in red) is not active in wildtype but highly active in all the ∆toxA samples.  Really I should consistently refer to this as the 'putative' internal promoter, but I won't bother for this post.

The line heights in the above graphs are hard to interpret because of the log scale, so here's a linear scale view.



These expression patterns appear to show normal repression of the 'toxTA" internal promoter by ToxA, and release from repression in the ∆toxA mutant.  The internal location of the ToxA-repressed promoter is unexpected, but its repression by ToxA fits the pattern observed for the well-studied homologs of this type of toxin/antitoxin system.

In these homologs, transcription of the operon is actually repressed by a toxin-antitoxin complex, not by antitoxin alone.  So we would expect to see that repression of the internal toxTA promoter is also released by knockouts of toxT or of both genes.  But that's not what we see.

We can't check the effect of a toxT knockout on transcription from the internal promoter, because the internal promoter is deleted in our toxT mutants.  But we can check the effect of deleting both toxT and ToxA, because the deletion in our double mutant starts well downstream of the internal promoter.

Surprisingly, transcription from the internal promoter is not increased in the double mutant.  The first figure shows coverage for wildtype, ∆toxA and ∆∆toxTA at time point M0.  We see that the CRP-S promoter is inactive in all three strains, and the internal promoter is very active only in ∆toxA (purple).  It's actually less active in the double knockout (green) than in KW20 (blue).


This figure shows transcription of the same strains at time point M2, when the CRP-S promoter is most active.  Now we see transcription from the CRP-S promoter in all strains, although definitely weaker in ∆∆toxTA.  We again see strong transcription from the internal promoter in ∆toxA (purple), but very little in ∆∆toxTA.


Could some sort of artefact be responsible for the apparent lack of transcription from the internal promoter in the double knockout?  One likely candidate is read-mapping artefacts caused by the presence of toxTA deletions and insertions in the mutant reads but not in the wildtype genome sequence they are being aligned to.

So we could check for these effects, one of the former honours students took a a set of mutant-specific reference sequences for the toxTA region, and separately aligned each set of mutant reads to its corresponding mutant reference sequence, and explained to me how to examine the reads and coverages using the Integrated Genome Viewer (IGV).

For all the M0 (log phase growth) and M2 samples (max CRP-S induction) I noted the number of reads covering position 400 (at the toxT start codon, ~ 35 bp downstream from the CRP-S promoter, and covering position 500 (about 75 bp downstream from the internal promoter).  (I didn't bother analyzing M1 and M3 samples.)  I normalized each coverage value by the number of reads in the sample, and calculated the mean coverage over the three replicate samples for each time point.

Here's a linear-scale graph:

And here's the corresponding log-scale graph:

And here's the conclusions:


This rules out read-alignment artefacts as an explanation for the apparently low transcription from the internal promoter in the toxTA mutant, and from the CRP-S promoter in the toxT mutant.

I'm now going to go back and generate the data for M1 and M3.  Then I'll update this post.  For now, ignore the notes below.




OK, the data for the M1 and M3 time points don't change the conclusions at all.


So, questions we still don't know the answers to:

Why is competence-induced transcription from the CRP-S promoter down modestly in all the toxTA mutants?  (Compare actual mean M2 values: wildtype: 830, ∆toxA: 513, ∆toxT: 213; ∆∆toxTA: 173.)
In wildtype cells in log phase, neither antitoxin (ToxA) or or toxin (ToxT) are likely to be present.  In wildtype cells at M2, both proteins are likely to be accumulating.  We don't know whether there will be more of one than the other - usually the toxin is more stable, and the antitoxin is unstable unless it is bound to toxin.  We don't know what the HI0660 toxin's 'toxic' activity is, and we don't know of any other expected activity for the HI0659 antitoxin except binding toxin and repressing toxTA transcription.
The lower transcript levels in the mutants suggests that both ToxA and ToxT contribute positively to transcription from the CRP-S promotor, or to the stability of the resulting transcripts.  
Could toxin in wildtype cells, and unopposed toxin in ∆toxA, be somehow stabilizing transcripts?  Doesn't make much sense to me.
Why is log-phase transcription from the putative internal promoter way up in ∆toxA but way down in ∆∆toxTA?  (Compare actual M0 values: wildtype: 134; ∆toxA: 1713; ∆∆toxTA: 170.)
Taken at face value, this would seem to mean that ToxT actively stimulates toxTA transcription, or stabilizes another transcription factor.
I hope someone else has some ideas!

Learning to use the NCBI Gene Expression Omnibus

As part of our workup for the toxin/antitoxin manuscript, I want to find expression data for the homologs of the Haemophilus influenzae toxin and antitoxin genes.  The former post-doc recommends that I use NCBI's Gene Expression Omnibus ('GEO') for this.

I'll need to learn how to search the GEO for specific accession data and data from specific taxa.

I'll also need to find out the specific identifiers for the genes I'm interested in, in the species I'm interested in.  I think I can use BLAST searches (queried with the H. influenzae sequences) to find the species and links to the DNA sequences of the homologs, and then I can look at the gene records to find the strain and gene identifiers.

Then I need to check if anyone has reported doing gene-expression studies on this strain or species (ideally the same strain, but I think/hope the gene identifiers will be consistent across strains).  These reports should contain the GEO accession numbers for the data.

Then I can ask GEO for the data for this study.  But I don't know what format the data will be in, nor how hard it will be to find the information about the genes I'm interested in.

The best situation would be if GEO has done its own analysis on the data and made this available as a curated 'GEO Dataset' or a 'GEO Profile'.  I could then search the GEO Profile to see their analysis of the particular genes I want to learn about.

Here's a phylogenetic tree showing the relationships between the concatenated toxin/antitoxin sequences in the species they're found in.  (I should also have a tree showing the relationships of the species themselves, but I haven't drawn that yet.)



So I'll start by searching BLAST with the amino acid sequence of HI0660, the H. influenzae toxin, and limiting the search to Streptococcus species.  Because I want to find the genes, not the genomes, I'll do a BLASTP search against Streptococcus protein sequences in the database.  OK, this gets me species and accession numbers for the (mostly 'hypothetical') proteins, but not usually strain identifiers or gene names.  

Will I be able to use a gene accession number to search GEO?  Apparently yes!



Now let's try searching GEO for the gene accession number of a gene I'm sure it has in at least one Profile:  The H. influenzae CRP gene:


But GEO does find it if I search for it by its gene name: 'HI0957'.  


I think the red bars are expression levels in the six samples of the study.  Yes, and clicking on them gets me the actual data from the study.  Now I can see the Y-axis and discover that, although the red bars look dramatically different, the differences are actually quite small (range 6300-6300).



So this control search tells me that I need to know gene names (or whatever identifiers are being used) for the genes whose expression I want to learn about.  Hmmm...  I wonder if GEO provides any help for this issue.

I also tried starting with a TBLASTN search, which will get me the position in the genome where the homolog is encoded.  I can then look at the genome sequence, find the position, and see if the gene has a name.  The Streptococcus pneumoniae homolog of HI0660 (the H. influenzae toxin) is SP_1143, but searching GEO for this finds nothing

How do non-competence genes respond to competence inducting treatment?

For the RNAseq part of the toxin-antitoxin paper, we should describe what we learn about how transfer to the competence-inducing starvation medium MIV affects genes not known to be involved in competence.

The former undergrad left us with a set of Edge and DEseq2 analyses of changes in gene expression.  I discussed them here last summer (http://rrresearch.fieldofscience.com/2016/08/making-sense-of-rna-seq-comparisons.html).  Unfortunately I don't know how to properly interpret them.  The former post-doc suggested some analyses, but I'm reluctant to dive into these until I have a better idea of what I'd be getting into.

These analyses are comparing the time-course expression for the wildtype strain against a knockout-mutant strain.  So I think they're useful for identifying changes that are dependent on genes whose knockouts we've tested, but not for identifying genes that change for other reasons.

Below is a figure created in a different analysis.  This 'cluster' analysis creates subgroups of genes based on how their expression changes between the three sBHI time points or the four transfer-to-MIV time points.   Cluster I includes all of the previously identified CRP-S-regulated competence genes except ssb and HI1631, and two other genes, HI1384 and HI1385 (second figure below), which encode ferritin and are reported to be induced when iron levels are high.  ssb is consistently induced but usually no more than twofold, so it's lost its 'competence gene' status in this analysis.  HI1631 (unknown function) is induced about 10-fold with the typical CRP-S timing, but its maximum expression is quite low and it's put in cluster M.



But how can I find all the genes that are induced?  A folder called Find Significant' has a script that seems to function like Edge and DEseq2, comparing induction patterns between two strains.  There's a folder called Significant Gene Comparisons, but again it's comparing wildtype treatments to mutant treatments.

What we learned from the RNAseq data: Part 1

So, we did a massive RNAseq study of competence, with 124 samples of H. influenzae cultures at different growth stages, in the rich medium sBHI and the competence-inducing medium MIV, and with wildtype or mutant genes affecting competence.  (You can use the Search box to find all the previous posts about this work...)  




Here I specifically want to think about what we learned about competence from the competence-induced cultures (cells transferred from sBHI to MIV).  We sampled cultures at the T=0 point indicated by the star in the above diagram, and the 10, 30 and 100 minute times in MIV.  I'll also consider the results for strains with knockout mutations in the competence-regulating genes sxy and crp, but not for the toxin-and antitoxin mutants or hypercompetence mutants. 

I'm working on this because it will be a section in our manuscript about the competence-induced toxin-antitoxin system we've discovered.  After describing what the RNAseq results tell us about the general features of competence regulation, the manuscript will go on to describe what they tell us about expression of the toxA and toxT genes.

About 15 years ago we used microarray analysis in a very similar study (Redfield et al 2005); the diagram above shows the sampling strategy from that work.  Here I want to focus on how the new RNAseq results affect its conclusions - what's strengthened, what's changed.  This will be the first time we've described the RNAseq work in a publication, so it's important to present the basic information about the study.  Eventually there will be a second paper describing the the effects of the hypercompetence mutations on 'spontaneous' competence development in rich medium.

Because the T=0 samples were taken just before the cells were removed from rich medium, at OD600 = ~0.25, they give us gene expression levels for log-phase growth (when the growth rate is maximum, not limited by competition with other cells).  If we want we can strengthen these conclusions by also using data from the other 'arm' of the RNAseq study, because this took samples at a lower cell density (OD600 = 0.02).  If cells at both densities are truly in log-phase growth, their patterns of gene expression should be the same.  If they're not, this becomes a topic for the hypercompetence manuscript, not here.

A former Honours undergrad did a great deal of analysis of this work, and left us with excellent graphics and the R scripts that generate them.  Below I'll illustrate the findings with his figures and some the former postdoc made, and discuss whether we need to generate different ones.

Here's a graphic showing the relative expression of every CRP-S gene in the four time point samples (M0 is T=0, M1 is T=10 min, M2 is T=30 min, and M3 is T=100 min).  (Oops, I cropped off the Y axis - it's a log scale going up to 10^3, with the X axis crossing at 1.)  So we see that all genes are induced (most by 10-100 fold) at 30 min, and mRNA levels stay that high through 100 min.



Here's a figure showing how the regulatory genes change.  (Ignore the bars for the sBHI time points B1, B2 and B3.)  Basically, sxy mRNA is strongly induced within 10 min (M1), well before the CRP-S genes in the graphic above, and cya and crp expression changes only slightly.  The sxy mRNA isn't efficiently translated to Sxy protein unless purines are depleted. The drop in cya expression on transfer to MIV is expected, since it is known to be negatively autoregulated by CRP and cAMP.  The small but significant rise in crp mRNA was also seen in the microarray study - it's cause isn't known.



Here is a mock-up of an overview figure suggested by the former postdoc, with not quite the right data.  It illustrates how competence gene expression changes after 10 min, 30 min and 100 min in MIV, and how the 30 min changes are altered if sxy or crp is knocked out.  The tips of the grey lines represent expression changes of all the other genes; they'll be replaced by grey dots in the final version. 



The big question is: How does this advance on what we already concluded from the old microarray data?  I think I had better printout and carefully reread that paper, rather than relying on my memory.




A new plan for contamination correction

The grad student and I had an intense Skype discussion with the former postdoc this morning.  We realized that there's a much easier way to estimate the total Rd contamination in the uptake samples, so we're going to do this.
Preamble:  The former postdoc isn't convinced that this will be better than what we were doing. He likes the simplicity of using the dual-genome alignments directly to calculate uptake ratios, and doesn't think that all the missing data from multiply-mapped positions is a concern.  I don't like discarding what is otherwise good data, and hope that using all the data we can will reduce the noise due to low coverage positions in our uptake ratios.  (The grad student prudently kept his opinions to himself.)
Easy contamination estimate:  We actually already have this data for the eight NP samples, and can easily get it for the eight GG samples.  Way back, the postdoc determined how many of the reads in each sample mapped to the Rd part of the concatenated Rd-NP genome, and how many mapped to the NP part (table below).  Reads from repeats and no-SNP segments were excluded.  The fraction mapping to Rd (Rd/total in the table below) is the fraction of Rd DNA contaminating the sample.  (This data is from a file named meanreads per genome.xlsx.  UP1, 2 and 3 are the replicate long-fragment uptake samples, UP4, 5, and 6 are the replicate short fragment uptake samples, and UP13 and UP15 are the corresponding long and short input samples.)


It's possible that different genome regions are contaminated to different extents (e.g. if some Rd and/or NP sequences are poor substrates for library formation or sequencing), but we'll proceed on the assumption that contamination level is fairly even across the genome, perhaps checking later**.

How to correct the uptake coverage levels for Rd contamination:  We need to only correct the parts of the genome where Rd reads are aligning.  The obvious way to identify these regions is to align reads from pure Rd DNA to the NP reference genome.  At each position the relative coverage is a measure of the extent of contamination we expect due to the Rd contamination in the uptake samples.  Nicely, this coverage measure will intrinsically include the effects of any differences in the sequence-ability of different parts of the Rd genome.

So:
  1. (blue figure )  Align the reads from a pure Rd sample to the NP genome.  (I don't think an Rd control was included in the sequencing done specifically for this project, but we have sequenced pure Rd DNA in other experiments.)  The reads from Rd-unique parts of the Rd genome will be discarded as unmappable.
  2. (blue figure ) Record the Rd coverage at each position.
  3. (yellow figure ) For each NP uptake sample to be corrected for contamination, align its reads to the NP reference genome and record the coverage at each position.  Calculate the average coverage (or normalized coverage) across the whole genome.  
  4. (yellow figure ) Use the NP uptake sample's contamination level (from the table above) to calculate the average amount of this coverage that is due to Rd contamination.  (E.g., if sample UP3 has average coverage of 350 when aligned to NP alone, then 350* 0.0884 = 30.9 of this comes from contamination.) 
  5. (blue figure ) Normalize the position-specific Rd contamination levels (from point 2 above) to the Rd read coverage predicted to be due this contamination level (from point 4 above, 30.9 in the example).
  6. (yellow figure ) Subtract the normalized Rd contamination from the total coverage at each genome position to get its contamination-corrected coverage.


If the uptake coverages weren't normalized at step 3, normalize them before combining them to get the mean coverages over the three uptake samples.

Then compare the resulting uptake-ratio maps to those we previously obtained.

Now, let's see what the former postdoc and grad student think of this plan.  It might not be exactly what we discussed.


** How to check if contamination is evenly distributed across the genome:  The former postdoc has a list of ~34,000 well-behaved SNPs that distinguish Rd and NP.  After mapping an NP uptake sample onto the NP reference genome, at each position we can score the frequency of the Rd and NP alleles.  Since the background error rate is low, we can just take the fraction of Rd alleles as the contamination level at each position.  Now map these contamination levels across the NP genome.  We expect to see a fairly constant level.  At positions where Rd has no homolog to the NP sequence, there will be no data points because there will be no SNPs to score.  There will certainly be noise in these levels.  But maybe we'll see smooth variation that suggests that some parts of the NP genome have higher Rd contamination than other parts.

How will contaminating Rd reads map onto NP?

Based on the analysis in the previous post, we can calculate lower-bound and upper-bound estimates for the Rd contamination levels  for each of our 12 'uptake' DNA samples (3 replicates each of 4 treatments).  And we can correct the lower-bound estimate (and maybe the upper-bound one too) to get an estimate of the true contamination level for each sample. But what do we do with this information?

To think about this, first consider how we expect the contamination to affect the apparent coverage for each position in the NP (or GG) genome.  

We can map each sample just to its own NP genome.  The NP-derived reads (bright red and blue) will map to their correct positions (at internal repeats we'll see the average coverage).  Now the Rd-derived reads from positions that have strong similarity to NP locations (dark red in the figure) should map to their homologs, including all the Rd reads from repeats and no-SNP segments.   I think that the Rd-derived sequences that don't have NP homologs (yellow in the figure) and thus can't be mapped onto NP will be unmappable and given Q=0 scores. This will be about 10-15% of the Rd reads (a known value).  So most NP positions will have their NP read coverage plus (say) 20% extra coverage from the Rd reads. But some NP positions (again 10-15%) don't have Rd homologs (blue), and these will have only their NP coverage.




Grad student and former postdoc, how should we handle this?  If we have a table listing which NP positions have no Rd homolog, we could just do the contamination correction on the positions that do have homologs.  But how well this works depends on how well we understand the limits of the algorithm that maps the reads onto the reference sequence.

Try #3: Analyzing the uptake data

OK, here's another try at understanding the interactions of repeated sequences, reads with no SNPs, and RD contamination in our DNA uptake sequencing data.

A reminder of what the problem is that I'm trying to solve:  We have three 'uptake' samples for each treatment in our big analysis of DNA uptake specificity, but they can't be directly compared because, in the different samples, the recovered 'donor' DNA (from strain NP) probably has different levels of contamination by the recipient strain Rd.  We have no direct way to measure this contamination, but I think we can infer it from other features of the samples.  (The next post considers how to correct for the % contamination.)

This time I've simplified the diagrams by considering only a short segment containing one rDNA repeated sequence (purple) and one segment with no SNPs (green).  When the reads from the contaminated DNA are aligned to the concatenated Rd-NP genomes, reads from both repeats and no-SNP locations will be 'multiply-mapping', and the alignment program will randomly assign each of these reads to one of the places they align with. 

Diagram A. The first diagram is alignment of reads of the NP input sample to the concatenated Rd-NP genomes.


At the rDNA repeat and the no-SNP segment the coverage mapped to NP is reduced to 50%, and the coverage mapped to Rd is increased to 50%.  These 50% coverages will be constant over all repeats and no-SNP segments.

The multipy-mapping reads (repeats and no-SNP) will be given quality scores Q= 0. Some reads that are poor quality for other reasons may also be given this score - the grad student is going to check how big a contribution this is.  We can then calculate the fraction of reads that are multiply-mapping and use this later to improve our estimates of the Rd contamination in the uptake samples.

The next two diagrams separately show the expected mapping of each of the components of an NP uptake sample that is contaminated with 20% Rd DNA.

Diagram B.  Here's the expected NP contribution:


All the unique NP reads map at the unique NP positions; this is only 80% of the coverage because 20% is Rd contamination.  Only 40% of the multiply-mapping reads are assigned to the multiply-mapping positions in the NP genome.  The other 40% are assigned to their Rd homologs.

Diagram C.  Here's the expected Rd contribution:


The other 20% of the reads map to Rd.  At unique Rd positions this gives 20% coverage, but at the multiply-mapping positions this is reduced to 10% because the other 10% are assigned to their NP homologs. 

Diagram D.  Now here's the total mapping of the NP uptake sample (including the contamination):  (This is what we would actually see when we aligned an NP uptake sample with 20% Rd contamination to the concatenated Rd-NP genomes.)


The NP-aligned sequences stay at 80% coverage for the unique NP sequences, but coverage at the multiply-mapping segments increases to 50% due to the 10% from the Rd sequences.  Similarly the coverage at the Rd unique sequences stays at 20%, but increases to 50% at the multiply-mapping segments.

OK, how can we use this analysis to work backwards?  We need to start with Diagram D and infer that there is the 20% Rd contamination assumed in Diagrams B and C .

If the data were as clean as I've drawn it we could just look at the Rd coverage in Diagram D and fit a horizontal line to the 'typical' 20% coverage.  But the real data may be much noisier than this, so it will probably not be possible to decide where to put such a line.

What if we recalculated Diagram D, this time excluding all the reads with Q=0 scores?

Diagram E. 


Perhaps now we can draw the desired line through the high-coverage positions.  But this is still a bit subjective, so here's another approach:

First, we calculate mean coverage of the uptake sample over the Rd genome (Q=0 reads excluded).  This gives us a lower bound for the true contamination.  

Then, from Diagram D, we calculate mean coverage of the uptake sample over the Rd genome including the Q=0 reads.  This gives us an upper bound for the true contamination.  

Now, can we refine these estimates to get the actual contamination?  To do so we need to know what fraction of Rd positions in Diagrams D and E were excluded by excluding multiply-mapping reads, and what fraction were over-covered by including them.

Back at Diagram A, we calculated the fraction of reads that had Q=0 scores due to multiple-mapping.  If we assume that true coverage is independent of SNPs and repeats, this fraction should also approximate the fraction of the genome whose coverage was lost due to multiply-mapping reads (call it '%Q').  We can use this %Q to correct the contamination estimates.  

Correction for lower-bound estimate:  Divide by (1 - %Q).

Can we also correct the upper bound estimate?  I don't think so (at least not simply), because the degree of over-coverage at the multiply-mapping positions is not fixed - the higher the true contamination is, the smaller the over-coverage effect.  That is, we know that the multiply-mapping positions will have 50% coverage, but unless we already know the amount of contamination we can't correct this for the amount of contamination. 

However we do have a refined contamination estimate from the lower-bound estimate (call it 'Cu'), and I think we can use this to correct our upper-bound estimate.  If we've done this right we should get the same Cu value back.

Correction for upper-bound estimate:  Subtract (%Q(50%-Cu)).

Will the real value be midway between the estimates?  No, it will be closer to the lower bound.  The smaller the true contamination is, the more it will be overestimated by including the multiply-mapping reads.

Then what?  Hmm, I see that I haven't thought this part through.  I think I'll start a new post for that.

We need to do the lower-bound and upper-bound calculations and corrections for each of our 12 'uptake' samples (3 replicates each of 4 treatments).  This will give us good estimates of the % contamination for each sample.

e can map each sample only to its own Np genome (or GG).  Now the Rd-derived reads from positions that have strong similarity to NP locations should map there, and the Rd-derived sequences that cannot be mapped onto NP will be given Q=0 scores.


 and reduce the coverage levels in proportion to the contamination.

How to analyze the DNA uptake data

I'm working on getting a clear plan for how we will infer uptake bias for each genome position (genome-wide uptake ratios) from our DNA uptake sequencing data.

In principle, at each position in the genome, we just divide the sequencing coverage for the recovered DNA samples (taken up by cells) by the coverage for the input DNA.

(We are using DNAs from two donor strains 'NP' and 'GG').  Below I'll just describe things for NP, but exactly the same issues apply to GG.)

Complications (some solved, some not):
  1. Reads from repeated sequences such as ribosomal RNA genes (5 copies) can't be mapped to a single position in the reference genome.  Solution: the mapping software randomly assigns ambiguous reads to one of the possible sources, and flags the read with a quality score = 0.
  2. We have 3 replicates for each uptake treatment, but only one for each input DNA prep.  Solution: take the mean coverage over the 3 uptake samples
  3. Different samples have different total numbers of reads.  Solution:  Normalize each coverage to coverage per million reads.  Do this before calculating the means.
  4. The recovered donor DNA (strain NP) in the input samples is contaminated with a small fraction (5-17%) of DNA from the recipient (strain Rd).  Solution: identify the Rd-derived reads by aligning the sample reads to a concatenated Rd-NP double-genome.  Ignore the reads that align to Rd.
  5. This dual-alignment creates a new problem:  Any NP reads that don't include at least one SNP position distinguishing NP from Rd now align ambiguously.  Half of these NP reads will be mistakenly mapped to the Rd genome and ignored, and they all will have quality scores = 0.  A quick check suggests that this mis-mapping affects a substantial fraction of the genome.
  6. Because we need to divide the uptake coverages by the input coverage, we have been doing the dual-alignment on the input DNA reads even though we know that they have no Rd contamination.
Currently we have simply discarded all reads with Q=0 before dividing mean NP-specific uptake coverage by NP-specific input coverage to calculate the uptake ratios. Unfortunately discarding Q=0 reads means that our input data has many segments with very low or zero NP-specific coverage (everywhere that that Rd-NP sequence identity is high), which makes our uptake ratio calculations unreliable.  We suspect that the actual coverage of the input DNA is much more even.

Now that we better understand what's going on (see previous post), I'm hoping we can come up with a better way to do the analysis. 

We need to:
  • Identify where NP repeat sequences are being ambiguously aligned.  If half of them are being mis-assigned to Rd we need to correct for this, and we need to note that the coverage values for these segments are the repeat averages.
  • Identify where unique NP reads are being ambiguously aligned because they are identical to Rd sequences, and correct for the 50% mis-mapped reads.
  • Identify how much Rd contamination is in our uptake samples, and correct for this contamination.
If we do this properly, I think we can calculate our uptake ratios without discarding the Q=0 reads. 

I.  Normalize the coverage:  Align the reads in all NP samples to the NP reference genome.  This genome will have been tweaked to be identical to the input genome.  For each sample, normalize the coverage at each position to coverage per million reads (or whatever is customary or convenient).

II.  Examine coverage for the input sample (blue boxes below):  Only reads from repeated sequences (internal-repeat reads) will give ambiguous alignments and Q=0 scores (grey boxes in the maps below).  Examine genome coverage of only the Q=0 reads.  Are these just at the rDNA operons?

III. Align input reads again, this time to the concatenated Rd-NP dual genome (grey box below):  Normalize the coverage as before.  Now all the reads that don't contain SNP positions will align ambiguously and have Q=0 scores (as will the internal-repeat reads).  Record the normalized coverage by these Q=0 reads on the Rd and NP genomes (yellow and orange boxes below).  


IV.  Realign uptake reads to the concatenated Rd-NP dual genome:  Separately align each uptake sample's reads to the concatenated Rd-NP dual-genome (big green box below).  Normalize the coverage as before.  The rDNA segments will have higher Rd coverage than the rest.  

V.  Estimate % contamination in each uptake sample:  From each samples' normalized Rd coverage values (green line in the green box), subtract the normalized Rd Q=0 coverage (grey line in the green box, values from Step III, little yellow box) to get the levels of Rd coverage due to contamination.  Use these values to estimate a mean contamination level for each uptake sample (dashed green line in little green box).

VI. Correct coverage of uptake samples for contamination:  For each uptake sample, multiply each position's normalized coverage (from step I) by (1 - contamination) to get corrected coverage values (pink box). 


VII.  Calculate mean uptake coverage:  At each genome position, take the average of the three corrected normalized coverages.

VIII. Calculate uptake ratios:  Divide the mean uptake coverage (from step VII) by the normalized input coverage (from step I).

Now I just need the grad student and former postdoc to read this and check my logic. 






A new kind of problem

The PhD student and I are analyzing the data from his mapping of H. influenzae's uptake of genomic DNA.

The data was generated by Illumina sequencing of genomic DNA samples before and after they had been taken up by competent cells.  Using a rec2 mutant as the competent cells lets us recover the DNA intact after uptake.

He has done quite a bit of analysis of the resulting uptake ratios (ratio of 'recovered' coverage to 'input' coverage) but now we're going back and sorting out various anomalies that we initially ignored or overlooked.

One big issue is the surprisingly uneven sequencing coverage of the genome that we see even when just sequencing sheared genomic DNA (the 'input' samples).  The graph below shows the sequencing coverage of the first 10 kb of the genome of strain 'NP'.  The orange dots are from a short-fragment DNA prep (sheared, most fragments between 50 and 500 bp) and the blue dots are from a large-fragment DNA prep (sheared, most fragments between 1-5 and 9 kb).


Over most of this segment (and over most of the ~1900 kb genome), coverage of the 'short' sample is about 200-400 reads, about twice as high as the ~150-250 read coverage of the 'long' sample.  But in three places coverage of both samples falls to near-zero or zero.  Similar losses of coverage occur at many places throughout the genome, though this segment has more than usual.

What could cause this?  In principle it could be something about the DNA preparations, about the library construction, about the actual sequencing, or about the data processing and alignments to the reference genomes.  
  1. DNA prep artefacts: Long and short DNAs come from the same initial genomic DNA prep.  They were sheared differently: a Covaris G tube for the long-fragment prep, sonication for the short-fragment prep.  Perhaps there are places in the genome where breaks always occur?
  2. Library construction artefacts: This used Illumina Nextera XT kits (a transposon method).  Might there be strong sequence biases to this process?  The distribution of the spanning coverage of the reads indicates that the short library inserts were mostly 100-350 nt (mean about 200 nt) and the long library inserts were mostly 200-500 nt (mean about 325 nt).
  3. Sequencing artefacts:  The former post-doc used an Illumina NextSeq500 to collect 2-3 million paired-end reads of 2x150nt from each sample.  
  4. Analysis artefacts: We've been tracking down an artefact caused by indel errors in another genome used in this experiment, but the NP strain I'm analyzing now doesn't have these errors.  However it's possible that something about this reference sequence is causing reads that align to the dip positions to misalign or be discarded.

Very few fragments in the short DNA prep were longer than about 500 bp, and few library molecules had inserts longer than 350 bp, but one of the dips in coverage extends over more than 1 kb.  This means that the sequence feature responsible for this dip must either extend over more than 400 bp, or must cause loss of sequence over this distance.  

One thing I noticed is that the short coverage is more severely affected than the long coverage.  It falls lower and extends broader.  This is the opposite of what I would expect if reads containing a particular sequence were unsequenceable or misaligned.  This pattern shows very clearly in the graph below, which shows the ratio of coverage by the long-fragment and short-fragment samples. We see that the long coverage is generally about 2-fold higher than the short coverage, but that the ratio flips and falls to zero or near zero at the three low-coverage segments.

 

One thing we need to do is examine the locations of reads that were discarded because of low quality scores, either because of sequencing problems or because of ambiguous mapping (e.g. rDNA repeats).  Finding that reads at the dip positions were low quality would be a big clue.

Next day:  After a Skype conversation with the former post-doc I think I understand what's going on.  The dips aren't positions of low coverage, but positions were reads have been excluded from the dataset because of planned analyses using samples contaminated with DNA of the recipient strain Rd.

To remove the contaminating DNA, the reads have been mapped not just to the donor strain genome but to a concatenated fake genome consisting of both the donor and recipient genomes.  In the 'uptake' samples, the contaminating Rd reads align to the recipient (Rd) genome, not the donor genome.  But any reads spanning segments where the donor and recipient sequences are identical map 'ambiguously' to both genomes, and these are flagged as 'data quality = 0' and excluded from the analysis.  (The same 0 flag is given to repeated sequences (mainly the rDNA genes) because they also map to multiple positions, this time within a single genome.)

So the dips aren't places where there is no sequencing coverage, just places where the donor and recipient genomes are identical over the length of the pair of reads.

Now the difference between the long-fragment and short-fragment samples (orange and blue dots) makes sense!  The short fragment sample gave a sequencing library with shorter inserts (shorter spanning coverage), and shorter sequence intervals are more likely to be identical between the donor and recipient genomes.

-->