Field of Science

What's noise, what's Illumina bias, and what's signal?

The PhD student and I are trying to pin down the sources of variation in our sequencing coverage. It's critical that we understand this, because position-specific differences in coverage are how we are measuring differences in DNA uptake by competent bacteria.

Tl;dr:  We see extensive and unexpected short-scale variation in coverage levels in both RNA-seq and DNA-based sequencing. Can anyone point us to resources that might explain this?

I'm going to start not with our DNA-uptake data but with some H. influenzae RNA-seq data.  Each of the two graphs below shows the RNA-seq coverage and ordinary seq coverage of a 3 or 4 kb transcriptionally active segment.

Each coloured line shows the mean RNA-seq coverage for 2 or 3 biological replicates of a particular strain.  The drab-green line is from the parental strain KW20 and the other two are from competence mutants.  Since these genes are not competence genes the three strains have very similar expression levels.  The replicates are not all from the same day, and were not all sequenced in the same batch.  The coloured shading shows the standard errors for each strain.

We were surprised by the degree of variation in coverage across each segment, and by the very strong agreement between replicates and between strains.  Since each segment is from within an operon, its RNA-seq coverage arises from transcripts that all began at the same promoter (to the right of the segment shown).  Yet the coverage varies dramatically.  This variation can't be due to chance differences in the locations and endpoints of reads, since it's mirrored between replicates and between strains.  So our initial conclusion was that it must be due to Illumina sequencing biases.  

But now consider the black-line graphs inset below the RNA-seq lines.  These are the normalized coverages produced by Illumina sequencing of genomic DNA from the same parental strain KW20. Here there's no sign of the dramatic variation seen in the RNA-seq data.  So the RNA-seq variation must not be due to biases in the Illumina sequencing.

Ignore the next line - it's a formatting bug that I can't delete.
<200 and="" from="" in="" lower="" nbsp="" p="" segment.="" segment="" the="" to="" upper="">

How else could the RNA-seq variation arise? 
  • Sequence-specific biases in RNA degradation during RNA isolation?    If this were the cause I'd expect to see much more replicate-to-replicate variation, since our bulk measurements saw substantial variation in the integrity of the RNA preps.
  • Biases in reverse transcriptase?  
  • Biases at the library construction steps?  I think these should be the same in the genomic-DNA sequencing.

Now on to the control sequencing from our big DNA-uptake experiment.

In this experiment the PhD student mixed naturally competent cells with chromosomal DNA, and then recovered and sequenced the DNA that had been taken up.  He sequenced three replicates with each of four different DNA preparations; 'large-' and 'short-' fragment preps from each of two different H. influenzae strains ('NP' and 'GG').  As controls he sequenced each of the four input samples.  He then compared the mean sequencing coverage at each position in the genome to its coverage in the input DNA sample.

Here I just want to consider results of sequencing the control samples.  We only have one replicate of each sample, but the 'large' (orange) and 'short' (blue) samples effectively serve as replicates.  Here's the results for DNA from strain NP.  Each strain's coverage has been normalized as reads per million mapped reads (long: 2.7e6 reads; short: 4.7e6 reads).

The top panel shows coverage of a 1 kb segment of the NP genome.  Coverage is fairly even over this interval, and fairly similar between the two samples.  Note how similar the small-scale variation is; at most positions the orange and blue samples go up and down roughly in unison.  I presume that this variation is due to minor biases in the Illumina sequencing.

The middle panel is a 10 kb segment.  The variation looks sharper only because the scale is compressed, but again the two traces are roughly mirroring each other,

The lower panel is a 100 kb segment.  Again the variation looks sharper, and the traces roughly mirror each other.  Overall the coverage is consistent, not varying more than two-fold.

Now here's the corresponding analysis of variation in the GG control samples.   In the 1 kb plot the very-small-scale position-to-position variation  is similar to that of NP and is mirrored by both samples.  But the blue line also has larger scale variation over hundreds of bp that isn't seen in the orange line.  This '500-bp-scale' variation is seen more dramatically in the 10 kb view.  We also see more variation in the orange line than was seen with NP.  In the 100 kb view we also see extensive variation in coverage over intervals of 10 kb or larger, especially in the blue sample. It's especially disturbing that there are many regions where coverage is unexpectedly low.

The 500-bp-scale variation can't be due to the blue sample having more random noise in read locations, since it actually has four-fold higher absolute coverage than the orange sample.  Here are coverage histograms for all four samples (note the extra peak of low coverage positions in the GG short histogram):

If you've read all the way to here:  You no doubt have realized that we don't understand where most of this variation is coming from.  We don't know why the RNA-seq coverage is so much more variable than the DNA-based coverage.  We don't know how much of the variation we see between the NP samples is due to sequencing biases, or noise, or other factors.  We don't know why the GG samples have so much more variation than the NP samples and so much unexpectedly low coverage.  (The strains' sequences differ by only a few %.)

We will be grateful for any suggestions, especially for links to resources that might shed light on this. 

Later:  From the Twitterverse,  a merenlab blog post about how strongly GC content can affect coverage: Wavy coverage patterns in mapping results.  This prompted me to check the %GC for the segment shown in the second RNA-seq plot above.  Here it is, comparing regular sequencing coverage to %GC:

 I don't see any correlation, particularly not the expected correlation of high GC with low coverage.  Nor is there any evident correlation with the RBNA-seq coverage for the same region.

Do the rpoD hypercompetence mutations eliminate the normal diauxic shift?

I've been going over the RNA-seq data for our rpoD1 hypercompetence mutant, looking for changes in gene expression that might help us understand why the mutation causes induction of the competence genes in rich medium.

Here's a graph showing the results of DESeq2 analysis of the expression differences between the wildtype strain KW20 and rpoD1 cells at timepoints B1 and B2.  B1 is true log phase growth in rich medium; OD = 0.02.  B2 is OD = 0.6, when the cells are just starting to modify their growth in response to changes they've caused to the medium.  Both axes show how the rpoD1 cells differ from KW20.  The X-axis is differences at B1, and the Y axis is differences at B2.

Phenotypically, RpoD1 cells are not noticeably different from KW20 at timepoint B1. They grow at almost the same rate, and they're not competent.  This similarity is also seen in the lack of horizontal spread of the points in the RNA-seq graph; very few genes are more than twofold different between rpoD1 and KW20.  

But at B2 the  differences are larger, as indicated by the greater spread along the Y axis.  sxy mRNA is up about 3 fold, and the competence genes are increased 3-4-fold (in the green circles).  This is expected, since we know that the rpoD1 cells are competent at this stage.  The gcvB gene (a small regulatory RNA) is also up, but I haven't found any consequences of this (need to look more).  

The only other substantial change (and the largest change) at B2 is a cluster of 7 genes (HI1010-HI1016, in the purple circle) which are down 4-15-fold relative to KW20.  In KW20 these genes are induced briefly at B2 and then shut off again at B3 (OD = 1.0), but in rpoD1 their expression stays low.  The graph below illustrates this for gene HI1010, the first gene in the cluster.  (Look only at the first three time points; the others are cells in the competence-inducing medium MIV.)

What's going on around this time point that could be altered in the rpoD1 mutant?  We know that when cultures of wildtype cells reach this density they have begun to change their gene expression - they're not in true exponential growth any more (not in true 'log phase').  

Bioscreen growth curves of KW20 cultures consistently show a blip around OD = 0.6  (red arrows in the graph above), where growth briefly pauses and then resumes at a slower rate.  This type of growth has been given the name 'diauxy', and the blip represents a 'diauxic shift', a brief slowing or cessation of growth while cells shift from using one one resource to a different resource.

The change in growth rate is more obvious in the version shown below.  It uses a log scale for the Y axis, so periods of exponential growth appears as straight lines.  It's easy to see the initial period of exponential growth (red dashed line), where cell density doubles about every 35 minutes.  After the blip growth resumes, growth is slower but still roughly exponential (blue dashed line), and then gradually stops as conditions become less supportive.

Here's a graph showing all the Bioscreen traces from a different experiment, again showing the diauxic shift.  It appears to occur at a higher OD this time only because the student who made the graph didn't correct for the baseline OD of the culture medium. 

So my hypothesis is that the transient expression of HI1010-HI0116 at the B2 time point is associated with this diauxic shift.  I predict that the lack of the transient expression in the rpoD1 mutant will abolish the diauxic shift - there will be no blip in rpoD1 cultures.

I'm doing a Bioscreen run right now to test this hypothesis, comparing KW20, rpoD1 and rpoD2.  But while I was writing this post I did some digging around and found two relevant results from previous work by undergraduates in the lab.

The first graph compares KW20 to all three types of hypercompetence mutant.  The dark red line is rpoD1; consistent with my hypothesis it's the only strain that doesn't show the diauxic shift.  The second graph compares KW20 to rpoD1 and rpoD2, just like my present run.  This graph uses a log scale  so the shift appears higher in the curve, and appears to occur for all three strains.

And here are my new results (replicates with three slightly different batches of sBHI media):

Conclusion:  I was wrong.  The rpoD mutants are just as likely to show a clear diauzic shift blip as KW20.

Just to further complicate the picture, here's a Bioscreen run using a quite different strain, a clinical strain called 86-028NP, whose DNA sequences differ by 2-3% from the KW20 sequences: there's no sign of a diauxic shift.  Maybe KW20's diauxic shift was selected for by many generations of growth in lab cultures!

Later:  I reran the Bioscreen runs using Medium batch A, this time taking readings every 4 minutes instead of every 10 minutes in order to better resolve the diauxic shift.

The diauxic shift is very evident in the linear-scale (upper) plot, and appears to be identical in all three strains.  The log-scale plot (lower) unexpectedly shows rpoD1 to be growing slower than the others before the shift,.  (Unexpected since this was not seen in the first experiment.)

Analysis of NP-GG differences (I can't help myself!)

Despite my sensible conclusion to the previous post, I've rushed in with a bit of analysis of the reasons for the differences between the NP and GG uptake-ratio peaks.

I was able to do this because the PhD student just posted two new graphs, showing the uptake peaks in syntenic 20 kb segments of the NP and GG genomes.

The peaks for the two genomes are in the same places because the underlying DNA sequences are very similar.  Most of the peaks also have similar heights in the two genomes, with two obvious exceptions (labelled Discordant peak 1 and Discordant peak 2).  Here are those peaks side-by-side, to the same scale:

To look for sequence differences that could explain these uptake differences, I copied the corresponding DNA sequences for these regions from Genbank and examined them for USS.  I easily found good matches to the USS motif at (approximately) the centers of both peaks. 

Here are the GG and NP sequences for Peak 2, which has the bigger difference in height.  I've included a logo showing the USS-uptake motif we determined earlier.

There are lots of differences over this 66 bp segment.  None are in the 9 bp USS core, but there are 4 base substitutions and a single-base deletion in the 'unimportant' parts of the motif, and 5 more substitutions nearby.  In principle any of these differences could be responsible for the uptake difference.

But here are the corresponding sequences for Discordant peak 1.  (It's in the other orientation in the genome.)

This is completely different from Peak 2.  There's only one difference between the GG and NP sequences, and it's outside of the USS motif.

Might the sequences outside of the known USS motif be important after all?  Here is a comparison between the USSs of Peak 1 and Peak 2.  (To get both USSs in the same orientation I took the reverse complements of the Peak 2 sequences.) 
The orange vertical lines indicate positions where the Peak 1 and Peak 2 sequences differ.  Outside ogf the USS there are more differences than identities; we expect this because these sequences are unrelated.  Peak 2 is in an acetyltransferase gene, and Peak 1 is in a helicase gene.

So, this analysis didn't find any sequence differences likely to explain the uptake differences.  We certainly need to repeat this for other syntenic segments (= most of the genomes).  ANd we should examine individual discordant peaks at higher resolution, to see if the peaks in both NP and GG are centered on exactly the same sequences.

What about the possibility that the genomes have methylation differences that cause the uptake differences?  That's certainly possible - I wonder if there's an easy (bioinformatics) way to check.

p.s.  The PittGG annotation in Genbank is a mess.  I spent 2 hours figuring out why the segments appeared to have different genes.

Unexpected differences in uptake of DNA from two closely related strains

The PhD student's long careful reanalysis of the DNA uptake data has finally produced uptake ratio plots.  These confirm a surprising difference between the DNAs from two closely related strains, 86-028NP ('NP') and PittGG ('GG').  We also saw this difference in our preliminary analysis, but we thought it might be an artefact of how the analysis was done.

In the experiment underlying this data, cells of a third strain, KW20, took up DNA that had been purified from NP or GG cells.  We recovered the taken-up DNA and sequenced it, comparing how well each position in the ~1,800,000 bp genome was represented in the 'uptake' DNA relative to parallel sequencing of the 'input' NP or GG DNA.

We expected to see peaks and valleys of high and low DNA uptake, because we knew:

  • that the DNA of each strain contains many occurrences of a short sequence that's strongly preferred by the DNA uptake machinery *'uptake sequences'),
  • that the DNA had been broken into fragments so small that most of them wouldn't contain this sequence.

The two strains' DNA sequences are only 2-3% different, and we've found that uptake sequences are usually less variable between strains than other sequences,.  Thus we expected the overall pattern of uptake to be very similar between the two strains (approximately the same number of peaks, and approximately the same distribution of peak heights).

We don't know what causes this difference.  We'd expect it to be differences in the sequences of the two genomes, since both DNAs were highly purified before use.  But it could be a methylation difference, since the two strains might contain different methylation systems, especially those associated with restriction-modification genes.

The graphs below show that the numbers of peaks are quite similar, but their height distributions are not.  For DNA from strain NP (upper graph), most of the peaks have quite similar heights, and almost all are between 3.5 and 4.5.  But DNA from strain GG (lower graph) has much more variation, with many peaks below 2.5 and many higher than 5 or even 10.
Below is the same data, this time plotted on log scales.  This lets you see how deep the valleys are, and how high the highest GG peaks are.

Cause of the strain differences?

We don't know what causes this difference.  We'd expect it to be differences in the sequences of the two genomes, since both DNAs were highly purified before use.  But it could be a methylation difference, since the two strains might contain different methylation systems, especially those associated with restriction-modification genes.

In principle, sequence differences in the uptake sequences could accumulate over evolutionary time if one strain had lost the ability to take up DNA.  But in lab experiments strains GG and NP both transform poorly relative to the highly transformable lab workhouse strain KW20 (NP a bit worse than GG).

How to find out the cause?

In his preliminary analysis the PhD student examined uptake sequences associated with the high and low GG peaks and didn't see any obvious differences.  We'll want to do this again with the improved datasets.

We can do this at a more detailed level, examining specific uptake sequence occurrences at positions of high and low uptake.  We should particularly focus on parts of the genomes where the NP and GG genomes are 'syntenic' - where they have homologous sequences in homologous locations.  That will let us compare pairs of NP and GG uptake sequences that we know share a recent evolutionary ancestor.

Let's not rush into this

I 'm keen to find out what's going on, but I think it's important to exercise restraint.  We should proceed systematically through the analyses we've planned, rather than jumping onto this tempting problem.

How many contamination-control replicates can we do?

This is a continuation of the previous post.

For each of our 12 genuinely contaminated uptake samples we want to create multiple replicate fake-contaminated input samples, each fake-contaminated with an independent set of Rd reads at that sample's level of contamination.
For example, our UP01 sample has 5.3% Rd contamination.  Its corresponding input sample is UP13.  UP13 has about 2.7 x10^6 reads, so to make a fake-contaminated sample for UP13 we need to add (2.7x10^6 * 0.053)/(1-0.053) = 1.5x10^5 Rd reads to the UP13 reads.
Since our Rd sample contains 4,088,620 reads, for UP01 we could make 27 such fake-contaminated sets.   Other samples might need more Rd reads per set, and we wouldn't be able to make so many sets.

We'd like to use the same number of replicate sets for each of our 12 uptake samples, so we need to identify the uptake sample that needs the most reads per set, and thus has the lowest number of possible sets.  The table below shows that this is sample UP08, which needs 943,299 reads per set and thus allows creation of only 4 independent sets, and thus 4 independent fake-contaminated input samples.  This value is lowest because UP08 has a very high level of Rd contamination (16.6%) and its corresponding input control sample (UP15) is quite large (4.7x10^6 reads).  It's not the largest control sample (that's UP16 with 1.0x10^7 reads), but the uptake samples corresponding to UP16 have much less contamination than UP08 does.

So we should plan on creating 4 independent fake-contaminated input samples for each uptake sample, and then using the average coverage of these 4 samples as the denominator in the uptake ratio calculation.

Almost there: making the uptake ratio graphs

Yesterday the PhD student showed me the results of his contamination-correction tests.  They confirmed that our new error-correction strategy works, and suggested an improvement.

The problem and the strategy:  We want to know how efficiently different segments of NP or GG DNA are taken up by competent Rd cells.  All of our 12 'uptake' samples are contaminated, consisting of mostly reads of NP or GG DNA taken up by Rd cells plus varying amounts of contaminating Rd chromosomal DNA.  We want to calculate the 'uptake ratio' for each genome position as the ratio of sequence coverage in the uptake sample to coverage in the 'input' sample - thus correcting for the varying efficiency of sequencing at different genome positions. We originally tried to identify and remove the Rd-derived reads from the uptake samples before calculating the uptake ratio, but this introduced new problems.  Our new strategy is to instead deliberately add 'fake-contaminating' Rd reads to our input samples, at levels matching the real contamination in each uptake sample.

The test:  To test whether the new strategy works, the PhD student first created a set of four fake-uptake samples by adding 10% of Rd reads (set 1) to each of the four input samples (NP-long, NP-short, GG-long, GG-short).  He then created the corresponding fake-contaminated input samples by adding different Rd reads (set 2) to get the same 10% contamination level.  He then calculated and plotted the ratio of fake-uptake to fake-input for each genome position.  If the contamination correction were perfect this would give a value of 1.0 at every position, but we knew it would be imperfect because the contamination reads (set 1) and the correction reads (set 2) were not identical.

Here are the results for the NP-long analysis (the NP-short was similar):

The top graph shows a 10 kb segment; the bottom one the whole genome.  The uptake rations are nicely centered at 1.0, varying from about 0.9 to about 1.1.  This variation is expected, due to the random differences in Rd coverage by set 1 and set 2.  The segments with no variation are places where the Rd genome has no homolog in the NP sequences.

Here are results for GG-long:

This result is noisier, with many spikes above 1.1 and dips below 0.9.  The cause was easy to find:  the spikes and dips occur at positions where sequencing inefficiencies cause very low GG coverage.  Below are coverage graphs for the 10 kb segment in the first graph above, and for a 30 kb segment around the major spike/dip at position 400,000 in the whole-genome graph above.  In each case we see that sequencing coverage is much lower at the spike-dip positions, causing the chance differences in Rd coverage to have a much bigger effect than elsewhere. 

In principle, the role of chance differences between the set 1 and set 2 Rd coverage can be checked by examining the other GG sample, GG-short, but I think the PhD student used exactly the same sets of Rd reads for this sample as for the others, which would predict that the spikes and dips should be the same.  They're not, probably because of differences in GG coverage between the long and short samples.

We should go back and test the same GG-long sample using different sets of Rd reads (set 3 rather than set 1, or set 4 rather than set 2).

Sources of variation:  The above analysis gives us a better understanding of the sources of variation in this uptake analysis.  First there's the variation across the genome in sequencing efficiency.  This is (we think) due to properties of the sequencing technology, and should be constant across different samples from the same genome (e.g. input and uptake samples). We don't have any way to reduce this variation, but we control for it by calculating the uptake ratios rather than just position-specific differences in coverage in the uptake samples.  Second, there's the variation in how much Rd contamination is present.  This arises due to variation in the experiments that purified the DNA; we can't change it at this stage, but we control for the differences between samples by introducing different amounts of compensating fake-contamination into the input sample control for each uptake sample.  Third, there's the chance variation in the distribution of contaminating Rd reads across the genome.  This will be different for each sample, and we can't change it or control for it.  Finally there's the random variation in the distribution of fake-contaminating Rd reads added to each input sample.  The next section describes how we can eliminate most of this.

Replicate corrections will reduce variation:  The above analysis also showed us a way to improve the correction in our 12 genuinely contaminated samples.  Instead of doing each correction once, we can do the correction several times, generating independent fake-contamination input samples using independent sets of Rd reads.  Then we can average the resulting uptake ratios.  In fact we can do this a simpler way, by just averaging the coverage levels of the independent fake-contamination input samples at each position, and then calculating the uptake ratios using these averages.

The number of Rd reads needed for each correction will depend on the coverage level and true contamination level of each uptake sample, but we should have enough Rd reads to create at least five four independent sets for each sample (see next post). 

UBC's Faculty Pension Plan, for dummies

Yesterday I went to an information session about UBC's Faculty Pension Plan, designed for faculty approaching the age when we have to decide what to do with the money the Plan has accumulated on our behalf.  Even if we choose not to retire (or not yet), by age 71 contributions to the Plan will end and we must make a decision.

Until retirement or age 71, the Plan has been taking money from our salaries and from UBC's pockets, and investing it in something they call their Balanced Fund.  (Yes, if you know what you're doing you can change the investment mix.)  The info session told us that this fund does pretty well, and because it's a big pot of money ($2 billion) the management fees are very low.

At retirement or age 71, you have three choices.

  1. If you want to manage your money yourself, you can take it out of the Plan.  But if you don't use it for an annuity or retirement fund you'll have to pay taxes on the full amount. 
  2. You can leave it in the UBC Plan's Balanced Fund and treat it as a retirement fund, using regular withdrawals to support your retirement or other goals.  You pay tax on these withdrawals.  Depending on the details (RRIF vs LIF), there are upper and lower limits on the withdrawal amounts.  Your goal may be to have spent all your money by the time you die, or to leave what's left to your family or other beneficiaries.
  3. You can purchase a UBC Plan annuity ('Variable Payment Life Annuity') that will pay you a relatively constant amount every month until you die.  The amount depends on how much money you pay when you purchase it, how old you are at this time, and a bit on how well the Balanced Fund is doing (that's the 'Variable Payment' part).  IMPORTANTLY, it doesn't depend on whether you are male or female.
In a normal annuity (not purchased from the UBC Plan), a woman gets less money per month than a man who invests the same initial amount.  Here's a quote from an outraged article in the Globe and Mail.
Here’s the reality behind the numbers, based on the table published Monday, May 2: A 65-year-old woman who gives $100,000 to a major insurance company will get an annuity of about $474 a month, while a man of the same age spending the same amount will get $519. A woman who waits until the age of 71 to buy her annuity will get $548 monthly, while a man of the same age will get $603.
But the differing returns are because the woman is likely to live longer.  On average, women get about the same amounts from their annuities over their lifespan as men do.

The UBC Annuity Plan doesn't do this.  It ignores the differences in male and female survivorship, so men and women who purchase the same annuity amount at the same age get the same monthly income.  This is really good for women because we can expect to continue getting our payments for 5 or 6 years longer than a man would.  If payments are $3300/month (UBC's predictions for a $500,000 purchase in 2017), this is about $200,000!

The UBC Plan's use of a single gender-neutral survivorship datable creates another benefit.  The table doesn't just average the survivorships for men and women.  Instead it weights these by the proportions of men and women in the pool of people buying UBC annuities.  Since this pool is currently 80% men, the survivorships will be lower than they would otherwise be, so the calculated monthly payouts will be higher. 

Ways to test contamination control

In the previous post I proposed an alternative method to control for contaminating recipient DNA in our donor DNA uptake samples.  Because we've never done this before (and probably nobody else has either), we need ways to check that it accomplishes what we want.

Here's one way:
  • We already have made samples of pure donor DNA reads (from strain NP or GG) that have been deliberately contaminated with reads from the recipient Rd (10% Rd, 90% NP or GG). These REMIX samples have already been mapped to the donor genomes.
  • Make a second set of these samples, using the same pure donor samples but this time contaminating them to 10% with an independent set of Rd reads - pretend this is a 'contaminated uptake' sample.  
  • Map the new 'contaminated uptake samples onto the donor reference genome
  • Divide the coverage at each position in the contaminated uptake samples by the coverage in the corresponding contaminated input samples.
  • Examine plots to see how coverage differs across the genome in the two contaminated samples and in the pure donor samples.
If the method works as planned, the ratio of coverages should be close to 1 at most positions.

For comparison, we can calculate and plot the ratios of coverage when the 'contaminated uptake' coverage is divided by the corresponding pure input coverage.

Aside:  Way back we spent a lot of time wondering why the GG-short experiment had different peak uptake ratios than the NP-short experiment.  Most uptake-ratio peaks in the NP-short experiment had heights between 3 and 4 (very few above 5).  But most of the peak heights in the GG-short experiment were between 2 and 3, with many others have ing much higher peaks.

Now that I've been thinking about consequences of contamination, I wonder if this difference could have been caused by the much lower contamination levels in the GG samples (NP-short: 13.9, 17.4, 21.5; GG-short: 2.7, 4.8, 7.7). 

New contamination analysis suggests alternate strategy

Yesterday's post considered the factors affecting our current strategy for removing the contamination Rd-derived reads from our 'uptake' samples.  Since then I've looked at the read-mapping results using our new perfected reference sequences.  The problem of reads that have no SNPs and thus map ambiguously to both Rd and NP/GG is worse than we thought.  Below I'll describe this data and the new strategy I want to consider.

The former post-doc sent us a spreadsheet with all the summary data from mapping each of our samples ('input', 'uptake' and controls) to various combinations of genomes.  I'll start by considering what we see when we map the 'input' genomes (NP and GG) and the recipient genome (Rd) onto artificial genomes consisting of an Rd-genome 'chromosome' and an input-genome 'chromosome (NP or GG).  These samples have no contamination, so any reads that map to the wrong 'chromosome' will do so because of these factors (also described in the previous post):

  1. Sequencing errors:  Some NP-derived (or GG-derived) reads will have sequencing errors that create the Rd alleles of SNPs, causing them to align better to the Rd homolog. 
  2. Shared segments: Although the Rd and NP (or GG) genome sequences differ on average by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical, including each genome's six copies of the ribosomal RNA genes.  Reads from within these segments cannot be attributed to one source and are randomly assigned locations by the mapping program.  Half of them will be assigned to Rd and half to NP (or GG), regardless of the actual level of contamination, and all will be given MAPQ scores = 0.
Here are the results.  I've deleted the columns with the raw coverage data and just show the % of reads in each sample that were incorrectly assigned.

Depending on the combination of reference genomes used, between 4.75% and 7.61% of the reads are assigned to the incorrect genome, either due to sequencing errors or because the segment they cover is identical in both genomes (does not include any SNP positions).  

The likelihood that a read will be mis-mapped depends on its length - short reads are more likely to not include any SNP positions.  This probably explains the lower mis-assignment rates of the 'long' NP and GG samples.

Very little of this mis-assignment is due to sequencing errors.  The table below shows how the % mis-assignments change when the reads with MAPQ scores of 0 are removed (XXX rows).

Excluding the ambiguously mapping reads dramatically reduces the mis-mapping, to between 0.05% and 0.21%.  This residual mis-mapping reflects reads that did not match the correct genome, but matched a SNP allele in the incorrect genome.  This will be due to sequencing errors that change the SNP allele.

With this information we can calculate the actual fraction of reads that are ambiguously mapped, as 
2 * (mis-mapped reads - seq. errors)

This is a lot more than I expected!  Only a small part of the mis-mapping is due to the 6 ribosomal RNA operons.  Each operon has about 4.5 kb of identical sequence, which will cause about 25 kb (1.4%) of the genome to map ambiguously.  Most of the ambiguous mapping must thus be due to identical sequences in other parts of the genome.  These are likely to be mainly in coding sequences that are either highly conserved or that are identical by descent due to previous recombination between recent ancestors of these strains.

New plan for contamination correction:  Our original contamination-correction plan was to remove the contaminating reads from the uptake samples.  We planned to use mapping to concatenated 'genomes' to identify and discard reads that mapped unambiguously to Rd and that mapped ambiguously to both genomes.  But the above analysis shows that we would be discarding a large fraction of the non-contaminating (NP or GG) reads from many parts of the genome (probablythe most conserved parts).

So here's an alternative.  Instead of trying to remove the contamination from the 'uptake' samples (the numerator in our uptake-ratio calculation), let's add fake contamination to the 'input' samples (the denominator in our calculation).  This idea was suggested by some fake-contamination samples the former post-doc created to do some previous tests.

Here's how it would work:  
  1. First we estimate the actual contamination level of each 'uptake' sample, by mapping it onto the concatenated pair of genomes and using the above analysis to correct for the mis-mapping.   (This calculation is shown below.)
  2. Then we create, for each uptake sample, a fake 'input+contamination' sample that matches the Rd contamination of the corresponding uptake sample, by adding the appropriate number of (randomly-selected?) Rd reads to its reads. 
  3. Then we map each sample's reads (uptake and input+contamination) to the genome it's supposed to represent, not paying any attention to possible contamination.
  4. Then we normalize the coverages of all these mapped samples to the same size (reads per million bp or ???).
  5. Then we calculate the uptake ratios for each uptake sample, by dividing the normalized 'uptake' coverage at each genome position by the corresponding normalized 'input+contamination' coverage. 
  6. Finally, we calculate mean uptake ratio at each position for the sets of three replicate uptake samples.  

Actual contamination:  The calculations to estimate the actual contamination level of each 'uptake' sample took a bit of algebra to set up (see *** below), but I checked them a couple of ways and I'm pretty sure they're correct.

The final column is the calculated contamination level for each sample.  This is the fraction of the DNA in the sample that came from the chromosomes of the recipient Rd cells instead of from donor NP or GG DNA taken up by these cells.  These values are surprisingly close to the values in yesterday's post, which were calculated in a completely different way, using only the reads that did not map ambiguously.  That's fine - it tells us that there's nothing special about the ambiguously mapping reads.

I'm not absolutely sure that this strategy will eliminate the effects of contamination (we can test it with a fake-contaminated uptake sample).  If it does, we should see that the uptake ratio peaks are better correlated with the quality-scores of the uptake sequences (the large scatter we saw previously should be reduced).  And it will certainly reduce the problems associated with the many regions with low or zero coverage created by discarding reads with MAPQ=0.

What next: The first step is for the grad student and former post-doc to read and comment on this post.  If they agree that what I'm proposing a reasonable approach, we'll test it.  

What about using spanning coverage instead of single reads? The grad student is now generating the spanning-coverage, but the goal of using it to correctly position the ambiguous reads is challenging.  In principle we won't need to tag individual reads as ambiguous if we use this new contamination-correction strategy.

***  The contamination calculation:

The input and control analysis above tells us what fraction of the reads mapping-to-Rd in each uptake sample results from mis-assignment of ambiguously mapping reads rather than from contamination.

I worked out the calculation by instead considering an imaginary sample whose contamination level C was known:

Coverage and contamination in our DNA-uptake dataset

We finally have the corrected reference genome sequences for our DNA uptake deep-sequencing data.  The genome sequences from NCBI had errors and polymorphisms relative to the specific strains we used in the experiments, so the former post-doc used the sequence data from this experiment to create reference sequences that perfectly match the DNAs we used.

The bright-coloured table below shows 8 of our 16 samples.  These are the samples that examined uptake of DNA from strain 'NP' (86-028NP) into strain 'Rd'.  Not included are the equivalent 8 sample that used DNA from strain 'GG"' (PittGG).  These reads have been mapped onto a reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome, and all reads that map equally well to both genomes have been removed (reasons for this are explained below).   (The numbers in the table are from our earlier preliminary data analysis, so they will probably change a little bit with the new analyses.)

Issues to consider:

Reads or spanning coverage:  Should our analysis use coverage data derived from considering each sequencing read independently (what we have been doing) or from the 'spanning-coverage' sequences created by considering each pair of paired-end reads plus its intervening sequences?

We agree that spanning coverage will be better.  The former post-doc was going to create the BED files with the spanning coverage data but he got bogged down in other work, so the grad student is now working on this.  For now we're using the read-based BED files.

Mean coverage:  The numbers in the 'NP mean' column give the average number of reads from each sample that aligned to each genome position.  Mean coverage ranges from 166 reads (sample UP2) to 482 reads (sample UP9).  This is plenty for our purposes.

Variation in coverage across the genome:  Our preliminary analyses did not give us a good view of this, for complicated reasons.  In the next day or so the grad student should be generating these graphs for the 'input' samples (the UP13 and UP15 samples mapped onto NP, and the UP14 and UP16 samples mapped onto GG).  This will let us see how even the coverage is across each genome, and how reproducible the variation is for each genome's two input samples (short and long) .

Our concern is that there may be positions or segments with very low coverage (less than 10 reads). Such low coverage would reduce our ability to detect the effects of DNA uptake biases, which requires dividing the mean 'uptake' coverage at each position by the 'input' coverage.  If the low coverage is due to random noise in the analysis, it will be evened out for the 'uptake data because we have three replicates for each DNA length,.  But we only have a single replicate for each DNA length, so the divisor would still retain all the noise.

The relative roles of noise and biases will be checked by comparing the long and short input samples to each other, and the replicate uptake samples to each other.  Given our overall high coverage thbig problem may not be random events but sequence biases in the procedures used for library creating and sequencing.  Unfortunately these would not be averaged out over the three uptake replicates.

One complication is the presence in the genomes of repeated sequences, especially the six copies of the 16S and 23S ribosomal RNA genes.  The mapping software doesn't know which copy reads from within these segments are derived from, so for each such read it randomly choses one repeat copy as the mapped location, and assigns the read a map quality score (MAPQ) of 0.  This random assignment is not a problem when the sample consists of reads from a single genome with no contamination, but it contributes to the difficulty of eliminating reads from Rd DNA that contaminates the uptake samples (described below).

Presence of contaminating Rd DNA in the uptake samples:  The 'uptake' samples consist of NP (or GG) DNA recovered after being taken up by Rd cells.  The recovery process is not perfect, and each sample contains a significant amount of contaminating Rd DNA, whose reads need to be removed before the effects of uptake biases can be determined.

The former post-doc found that the Rd reads could be removed by aligning each uptake sample to a fake reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome.  In this 'competitive' alignment, almost all the contaminating Rd-derived reads are mapped to the Rd 'chromosome' (where they can be ignored) and almost all the desired NP-derived reads are mapped to the NP 'chromosome' (where they can be analyzed for evidence of uptake biases).

Why 'almost all' rather than 'all':  Several factors will cause reads to be mapped to the wrong 'chromosomes' in this contamination-removal step.
  1. Sequencing errors:  First and least important, some NP-derived reads will have sequencing errors that create the Rd alleles of Rd-NP SNPs, causing them to align better to the Rd homolog.  This will slightly raise the apparent baseline level of Rd contamination but isn't a serious problem.  The reverse will also happen, but we can also ignore them, especially because Rd-derived reads are the minority.  
  2. Repeated segments:  Second, the rRNA genes of Rd and NP have identical sequences, so when these are randomly assigned locations (see above), half of them will be assigned to Rd and half to NP, regardless of the actual level of contamination, and all will be given MAPQ scores = 0.  This is potentially a serious problem, since it reduces the apparent NP-derived coverage of these regions by 50%.  Because there are no uptake sequences in the rRNA genes, provided we treat the input and uptake samples identically this will only affect our ability to see effects of uptake biases in sequences within a few hundred bp of the repeated segment.   
  3. Shared segments:  Third, although NP and Rd genome sequences differ, on average, by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical. The mapping software will treat reads from these segments just like it treats reads from the rRNA repeats, randomly placing them at one of the two possible positions (in Rd or NP) and assigning them MAPQ scores = 0.
(Factors 2 and 3 are not significant right at the edges of repeated and shared segments, but become more important as the distance from the nearest SNP approaches the average read length.  Use of spanning coverage rather than single reads should improve our ability to identify and remove the contamination.)
The former postdoc handled factors 2 and 3 by removing from the analysis all reads that had MAPQ=0.  The data in the table above shows the read coverages AFTER the reads with MAPQ=0 have been removed.  The UP13 and UP15 input samples still have a small fraction (about 0.1% ) of Rd contamination rather than 0% contamination.  This may be due to Factor 1 above.

For the ribosomal repeats (and other repeats) removing all the MAPQ=0 reads may be valid, because we genuinely do not know where they came from.  But it's more complicated for the shared segments.  If uptake coverage were even over the genome, we'd know that most of the reads came from NP.  But uptake coverage is not even, especially for the 'short' DNA samples (that's the whole point of the experiment).  Positions close to uptake sequences will have high NP coverage, and only a small contribution from Rd contamination.  Positions far from uptake sequences will have very low NP coverage, so the even contribution from the Rd contamination will have a big effect.

Does expression of the toxA operon depend on ToxT as well as ToxA?

Short answer:  Yes, but not in the way we expected.

First, here's a diagram showing the toxTA operon and the mutants we're examining:

The grey bars show the extents of the deletions.  The ∆toxT and ∆toxTA mutants have a SpcR/strR cassette inserted at the deletion point, but the ∆toxA mutant has only a short 'scar' sequence at the deletion point.

A few months ago I wrote a post about evidence that ToxA prevents transcription of the toxTA operon from an unexpected internal promoter.  Here's a better version of the graph I showed there (note that transcription is going from right to left):

It looks like there are two promoters.  The CRP-S promoter is competence-induced so inactive in the M0 samples and maximally active in the M2 samples.  The unexpected internal promoter (labeled 'toxTA' in red) is not active in wildtype but highly active in all the ∆toxA samples.  Really I should consistently refer to this as the 'putative' internal promoter, but I won't bother for this post.

The line heights in the above graphs are hard to interpret because of the log scale, so here's a linear scale view.

These expression patterns appear to show normal repression of the 'toxTA" internal promoter by ToxA, and release from repression in the ∆toxA mutant.  The internal location of the ToxA-repressed promoter is unexpected, but its repression by ToxA fits the pattern observed for the well-studied homologs of this type of toxin/antitoxin system.

In these homologs, transcription of the operon is actually repressed by a toxin-antitoxin complex, not by antitoxin alone.  So we would expect to see that repression of the internal toxTA promoter is also released by knockouts of toxT or of both genes.  But that's not what we see.

We can't check the effect of a toxT knockout on transcription from the internal promoter, because the internal promoter is deleted in our toxT mutants.  But we can check the effect of deleting both toxT and ToxA, because the deletion in our double mutant starts well downstream of the internal promoter.

Surprisingly, transcription from the internal promoter is not increased in the double mutant.  The first figure shows coverage for wildtype, ∆toxA and ∆∆toxTA at time point M0.  We see that the CRP-S promoter is inactive in all three strains, and the internal promoter is very active only in ∆toxA (purple).  It's actually less active in the double knockout (green) than in KW20 (blue).

This figure shows transcription of the same strains at time point M2, when the CRP-S promoter is most active.  Now we see transcription from the CRP-S promoter in all strains, although definitely weaker in ∆∆toxTA.  We again see strong transcription from the internal promoter in ∆toxA (purple), but very little in ∆∆toxTA.

Could some sort of artefact be responsible for the apparent lack of transcription from the internal promoter in the double knockout?  One likely candidate is read-mapping artefacts caused by the presence of toxTA deletions and insertions in the mutant reads but not in the wildtype genome sequence they are being aligned to.

So we could check for these effects, one of the former honours students took a a set of mutant-specific reference sequences for the toxTA region, and separately aligned each set of mutant reads to its corresponding mutant reference sequence, and explained to me how to examine the reads and coverages using the Integrated Genome Viewer (IGV).

For all the M0 (log phase growth) and M2 samples (max CRP-S induction) I noted the number of reads covering position 400 (at the toxT start codon, ~ 35 bp downstream from the CRP-S promoter, and covering position 500 (about 75 bp downstream from the internal promoter).  (I didn't bother analyzing M1 and M3 samples.)  I normalized each coverage value by the number of reads in the sample, and calculated the mean coverage over the three replicate samples for each time point.

Here's a linear-scale graph:

And here's the corresponding log-scale graph:

And here's the conclusions:

This rules out read-alignment artefacts as an explanation for the apparently low transcription from the internal promoter in the toxTA mutant, and from the CRP-S promoter in the toxT mutant.

I'm now going to go back and generate the data for M1 and M3.  Then I'll update this post.  For now, ignore the notes below.

OK, the data for the M1 and M3 time points don't change the conclusions at all.

So, questions we still don't know the answers to:

Why is competence-induced transcription from the CRP-S promoter down modestly in all the toxTA mutants?  (Compare actual mean M2 values: wildtype: 830, ∆toxA: 513, ∆toxT: 213; ∆∆toxTA: 173.)
In wildtype cells in log phase, neither antitoxin (ToxA) or or toxin (ToxT) are likely to be present.  In wildtype cells at M2, both proteins are likely to be accumulating.  We don't know whether there will be more of one than the other - usually the toxin is more stable, and the antitoxin is unstable unless it is bound to toxin.  We don't know what the HI0660 toxin's 'toxic' activity is, and we don't know of any other expected activity for the HI0659 antitoxin except binding toxin and repressing toxTA transcription.
The lower transcript levels in the mutants suggests that both ToxA and ToxT contribute positively to transcription from the CRP-S promotor, or to the stability of the resulting transcripts.  
Could toxin in wildtype cells, and unopposed toxin in ∆toxA, be somehow stabilizing transcripts?  Doesn't make much sense to me.
Why is log-phase transcription from the putative internal promoter way up in ∆toxA but way down in ∆∆toxTA?  (Compare actual M0 values: wildtype: 134; ∆toxA: 1713; ∆∆toxTA: 170.)
Taken at face value, this would seem to mean that ToxT actively stimulates toxTA transcription, or stabilizes another transcription factor.
I hope someone else has some ideas!

Learning to use the NCBI Gene Expression Omnibus

As part of our workup for the toxin/antitoxin manuscript, I want to find expression data for the homologs of the Haemophilus influenzae toxin and antitoxin genes.  The former post-doc recommends that I use NCBI's Gene Expression Omnibus ('GEO') for this.

I'll need to learn how to search the GEO for specific accession data and data from specific taxa.

I'll also need to find out the specific identifiers for the genes I'm interested in, in the species I'm interested in.  I think I can use BLAST searches (queried with the H. influenzae sequences) to find the species and links to the DNA sequences of the homologs, and then I can look at the gene records to find the strain and gene identifiers.

Then I need to check if anyone has reported doing gene-expression studies on this strain or species (ideally the same strain, but I think/hope the gene identifiers will be consistent across strains).  These reports should contain the GEO accession numbers for the data.

Then I can ask GEO for the data for this study.  But I don't know what format the data will be in, nor how hard it will be to find the information about the genes I'm interested in.

The best situation would be if GEO has done its own analysis on the data and made this available as a curated 'GEO Dataset' or a 'GEO Profile'.  I could then search the GEO Profile to see their analysis of the particular genes I want to learn about.

Here's a phylogenetic tree showing the relationships between the concatenated toxin/antitoxin sequences in the species they're found in.  (I should also have a tree showing the relationships of the species themselves, but I haven't drawn that yet.)

So I'll start by searching BLAST with the amino acid sequence of HI0660, the H. influenzae toxin, and limiting the search to Streptococcus species.  Because I want to find the genes, not the genomes, I'll do a BLASTP search against Streptococcus protein sequences in the database.  OK, this gets me species and accession numbers for the (mostly 'hypothetical') proteins, but not usually strain identifiers or gene names.  

Will I be able to use a gene accession number to search GEO?  Apparently yes!

Now let's try searching GEO for the gene accession number of a gene I'm sure it has in at least one Profile:  The H. influenzae CRP gene:

But GEO does find it if I search for it by its gene name: 'HI0957'.  

I think the red bars are expression levels in the six samples of the study.  Yes, and clicking on them gets me the actual data from the study.  Now I can see the Y-axis and discover that, although the red bars look dramatically different, the differences are actually quite small (range 6300-6300).

So this control search tells me that I need to know gene names (or whatever identifiers are being used) for the genes whose expression I want to learn about.  Hmmm...  I wonder if GEO provides any help for this issue.

I also tried starting with a TBLASTN search, which will get me the position in the genome where the homolog is encoded.  I can then look at the genome sequence, find the position, and see if the gene has a name.  The Streptococcus pneumoniae homolog of HI0660 (the H. influenzae toxin) is SP_1143, but searching GEO for this finds nothing