UBC's Faculty Pension Plan, for dummies

Yesterday I went to an information session about UBC's Faculty Pension Plan, designed for faculty approaching the age when we have to decide what to do with the money the Plan has accumulated on our behalf.  Even if we choose not to retire (or not yet), by age 71 contributions to the Plan will end and we must make a decision.

Until retirement or age 71, the Plan has been taking money from our salaries and from UBC's pockets, and investing it in something they call their Balanced Fund.  (Yes, if you know what you're doing you can change the investment mix.)  The info session told us that this fund does pretty well, and because it's a big pot of money ($2 billion) the management fees are very low.

At retirement or age 71, you have three choices.

  1. If you want to manage your money yourself, you can take it out of the Plan.  But if you don't use it for an annuity or retirement fund you'll have to pay taxes on the full amount. 
  2. You can leave it in the UBC Plan's Balanced Fund and treat it as a retirement fund, using regular withdrawals to support your retirement or other goals.  You pay tax on these withdrawals.  Depending on the details (RRIF vs LIF), there are upper and lower limits on the withdrawal amounts.  Your goal may be to have spent all your money by the time you die, or to leave what's left to your family or other beneficiaries.
  3. You can purchase a UBC Plan annuity ('Variable Payment Life Annuity') that will pay you a relatively constant amount every month until you die.  The amount depends on how much money you pay when you purchase it, how old you are at this time, and a bit on how well the Balanced Fund is doing (that's the 'Variable Payment' part).  IMPORTANTLY, it doesn't depend on whether you are male or female.
In a normal annuity (not purchased from the UBC Plan), a woman gets less money per month than a man who invests the same initial amount.  Here's a quote from an outraged article in the Globe and Mail.
Here’s the reality behind the numbers, based on the table published Monday, May 2: A 65-year-old woman who gives $100,000 to a major insurance company will get an annuity of about $474 a month, while a man of the same age spending the same amount will get $519. A woman who waits until the age of 71 to buy her annuity will get $548 monthly, while a man of the same age will get $603.
But the differing returns are because the woman is likely to live longer.  On average, women get about the same amounts from their annuities over their lifespan as men do.

The UBC Annuity Plan doesn't do this.  It ignores the differences in male and female survivorship, so men and women who purchase the same annuity amount at the same age get the same monthly income.  This is really good for women because we can expect to continue getting our payments for 5 or 6 years longer than a man would.  If payments are $3300/month (UBC's predictions for a $500,000 purchase in 2017), this is about $200,000!

The UBC Plan's use of a single gender-neutral survivorship datable creates another benefit.  The table doesn't just average the survivorships for men and women.  Instead it weights these by the proportions of men and women in the pool of people buying UBC annuities.  Since this pool is currently 80% men, the survivorships will be lower than they would otherwise be, so the calculated monthly payouts will be higher. 

Ways to test contamination control

In the previous post I proposed an alternative method to control for contaminating recipient DNA in our donor DNA uptake samples.  Because we've never done this before (and probably nobody else has either), we need ways to check that it accomplishes what we want.

Here's one way:
  • We already have made samples of pure donor DNA reads (from strain NP or GG) that have been deliberately contaminated with reads from the recipient Rd (10% Rd, 90% NP or GG). These REMIX samples have already been mapped to the donor genomes.
  • Make a second set of these samples, using the same pure donor samples but this time contaminating them to 10% with an independent set of Rd reads - pretend this is a 'contaminated uptake' sample.  
  • Map the new 'contaminated uptake samples onto the donor reference genome
  • Divide the coverage at each position in the contaminated uptake samples by the coverage in the corresponding contaminated input samples.
  • Examine plots to see how coverage differs across the genome in the two contaminated samples and in the pure donor samples.
If the method works as planned, the ratio of coverages should be close to 1 at most positions.

For comparison, we can calculate and plot the ratios of coverage when the 'contaminated uptake' coverage is divided by the corresponding pure input coverage.


Aside:  Way back we spent a lot of time wondering why the GG-short experiment had different peak uptake ratios than the NP-short experiment.  Most uptake-ratio peaks in the NP-short experiment had heights between 3 and 4 (very few above 5).  But most of the peak heights in the GG-short experiment were between 2 and 3, with many others have ing much higher peaks.

Now that I've been thinking about consequences of contamination, I wonder if this difference could have been caused by the much lower contamination levels in the GG samples (NP-short: 13.9, 17.4, 21.5; GG-short: 2.7, 4.8, 7.7). 

New contamination analysis suggests alternate strategy

Yesterday's post considered the factors affecting our current strategy for removing the contamination Rd-derived reads from our 'uptake' samples.  Since then I've looked at the read-mapping results using our new perfected reference sequences.  The problem of reads that have no SNPs and thus map ambiguously to both Rd and NP/GG is worse than we thought.  Below I'll describe this data and the new strategy I want to consider.

The former post-doc sent us a spreadsheet with all the summary data from mapping each of our samples ('input', 'uptake' and controls) to various combinations of genomes.  I'll start by considering what we see when we map the 'input' genomes (NP and GG) and the recipient genome (Rd) onto artificial genomes consisting of an Rd-genome 'chromosome' and an input-genome 'chromosome (NP or GG).  These samples have no contamination, so any reads that map to the wrong 'chromosome' will do so because of these factors (also described in the previous post):

  1. Sequencing errors:  Some NP-derived (or GG-derived) reads will have sequencing errors that create the Rd alleles of SNPs, causing them to align better to the Rd homolog. 
  2. Shared segments: Although the Rd and NP (or GG) genome sequences differ on average by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical, including each genome's six copies of the ribosomal RNA genes.  Reads from within these segments cannot be attributed to one source and are randomly assigned locations by the mapping program.  Half of them will be assigned to Rd and half to NP (or GG), regardless of the actual level of contamination, and all will be given MAPQ scores = 0.
Here are the results.  I've deleted the columns with the raw coverage data and just show the % of reads in each sample that were incorrectly assigned.



Depending on the combination of reference genomes used, between 4.75% and 7.61% of the reads are assigned to the incorrect genome, either due to sequencing errors or because the segment they cover is identical in both genomes (does not include any SNP positions).  

The likelihood that a read will be mis-mapped depends on its length - short reads are more likely to not include any SNP positions.  This probably explains the lower mis-assignment rates of the 'long' NP and GG samples.

Very little of this mis-assignment is due to sequencing errors.  The table below shows how the % mis-assignments change when the reads with MAPQ scores of 0 are removed (XXX rows).


Excluding the ambiguously mapping reads dramatically reduces the mis-mapping, to between 0.05% and 0.21%.  This residual mis-mapping reflects reads that did not match the correct genome, but matched a SNP allele in the incorrect genome.  This will be due to sequencing errors that change the SNP allele.

With this information we can calculate the actual fraction of reads that are ambiguously mapped, as 
2 * (mis-mapped reads - seq. errors)

This is a lot more than I expected!  Only a small part of the mis-mapping is due to the 6 ribosomal RNA operons.  Each operon has about 4.5 kb of identical sequence, which will cause about 25 kb (1.4%) of the genome to map ambiguously.  Most of the ambiguous mapping must thus be due to identical sequences in other parts of the genome.  These are likely to be mainly in coding sequences that are either highly conserved or that are identical by descent due to previous recombination between recent ancestors of these strains.

New plan for contamination correction:  Our original contamination-correction plan was to remove the contaminating reads from the uptake samples.  We planned to use mapping to concatenated 'genomes' to identify and discard reads that mapped unambiguously to Rd and that mapped ambiguously to both genomes.  But the above analysis shows that we would be discarding a large fraction of the non-contaminating (NP or GG) reads from many parts of the genome (probablythe most conserved parts).

So here's an alternative.  Instead of trying to remove the contamination from the 'uptake' samples (the numerator in our uptake-ratio calculation), let's add fake contamination to the 'input' samples (the denominator in our calculation).  This idea was suggested by some fake-contamination samples the former post-doc created to do some previous tests.

Here's how it would work:  
  1. First we estimate the actual contamination level of each 'uptake' sample, by mapping it onto the concatenated pair of genomes and using the above analysis to correct for the mis-mapping.   (This calculation is shown below.)
  2. Then we create, for each uptake sample, a fake 'input+contamination' sample that matches the Rd contamination of the corresponding uptake sample, by adding the appropriate number of (randomly-selected?) Rd reads to its reads. 
  3. Then we map each sample's reads (uptake and input+contamination) to the genome it's supposed to represent, not paying any attention to possible contamination.
  4. Then we normalize the coverages of all these mapped samples to the same size (reads per million bp or ???).
  5. Then we calculate the uptake ratios for each uptake sample, by dividing the normalized 'uptake' coverage at each genome position by the corresponding normalized 'input+contamination' coverage. 
  6. Finally, we calculate mean uptake ratio at each position for the sets of three replicate uptake samples.  


Actual contamination:  The calculations to estimate the actual contamination level of each 'uptake' sample took a bit of algebra to set up (see *** below), but I checked them a couple of ways and I'm pretty sure they're correct.


The final column is the calculated contamination level for each sample.  This is the fraction of the DNA in the sample that came from the chromosomes of the recipient Rd cells instead of from donor NP or GG DNA taken up by these cells.  These values are surprisingly close to the values in yesterday's post, which were calculated in a completely different way, using only the reads that did not map ambiguously.  That's fine - it tells us that there's nothing special about the ambiguously mapping reads.

I'm not absolutely sure that this strategy will eliminate the effects of contamination (we can test it with a fake-contaminated uptake sample).  If it does, we should see that the uptake ratio peaks are better correlated with the quality-scores of the uptake sequences (the large scatter we saw previously should be reduced).  And it will certainly reduce the problems associated with the many regions with low or zero coverage created by discarding reads with MAPQ=0.

What next: The first step is for the grad student and former post-doc to read and comment on this post.  If they agree that what I'm proposing a reasonable approach, we'll test it.  

What about using spanning coverage instead of single reads? The grad student is now generating the spanning-coverage, but the goal of using it to correctly position the ambiguous reads is challenging.  In principle we won't need to tag individual reads as ambiguous if we use this new contamination-correction strategy.

***  The contamination calculation:

The input and control analysis above tells us what fraction of the reads mapping-to-Rd in each uptake sample results from mis-assignment of ambiguously mapping reads rather than from contamination.

I worked out the calculation by instead considering an imaginary sample whose contamination level C was known:



Coverage and contamination in our DNA-uptake dataset

We finally have the corrected reference genome sequences for our DNA uptake deep-sequencing data.  The genome sequences from NCBI had errors and polymorphisms relative to the specific strains we used in the experiments, so the former post-doc used the sequence data from this experiment to create reference sequences that perfectly match the DNAs we used.

The bright-coloured table below shows 8 of our 16 samples.  These are the samples that examined uptake of DNA from strain 'NP' (86-028NP) into strain 'Rd'.  Not included are the equivalent 8 sample that used DNA from strain 'GG"' (PittGG).  These reads have been mapped onto a reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome, and all reads that map equally well to both genomes have been removed (reasons for this are explained below).   (The numbers in the table are from our earlier preliminary data analysis, so they will probably change a little bit with the new analyses.)

Issues to consider:

Reads or spanning coverage:  Should our analysis use coverage data derived from considering each sequencing read independently (what we have been doing) or from the 'spanning-coverage' sequences created by considering each pair of paired-end reads plus its intervening sequences?


We agree that spanning coverage will be better.  The former post-doc was going to create the BED files with the spanning coverage data but he got bogged down in other work, so the grad student is now working on this.  For now we're using the read-based BED files.

Mean coverage:  The numbers in the 'NP mean' column give the average number of reads from each sample that aligned to each genome position.  Mean coverage ranges from 166 reads (sample UP2) to 482 reads (sample UP9).  This is plenty for our purposes.

Variation in coverage across the genome:  Our preliminary analyses did not give us a good view of this, for complicated reasons.  In the next day or so the grad student should be generating these graphs for the 'input' samples (the UP13 and UP15 samples mapped onto NP, and the UP14 and UP16 samples mapped onto GG).  This will let us see how even the coverage is across each genome, and how reproducible the variation is for each genome's two input samples (short and long) .

Our concern is that there may be positions or segments with very low coverage (less than 10 reads). Such low coverage would reduce our ability to detect the effects of DNA uptake biases, which requires dividing the mean 'uptake' coverage at each position by the 'input' coverage.  If the low coverage is due to random noise in the analysis, it will be evened out for the 'uptake data because we have three replicates for each DNA length,.  But we only have a single replicate for each DNA length, so the divisor would still retain all the noise.

The relative roles of noise and biases will be checked by comparing the long and short input samples to each other, and the replicate uptake samples to each other.  Given our overall high coverage thbig problem may not be random events but sequence biases in the procedures used for library creating and sequencing.  Unfortunately these would not be averaged out over the three uptake replicates.

One complication is the presence in the genomes of repeated sequences, especially the six copies of the 16S and 23S ribosomal RNA genes.  The mapping software doesn't know which copy reads from within these segments are derived from, so for each such read it randomly choses one repeat copy as the mapped location, and assigns the read a map quality score (MAPQ) of 0.  This random assignment is not a problem when the sample consists of reads from a single genome with no contamination, but it contributes to the difficulty of eliminating reads from Rd DNA that contaminates the uptake samples (described below).


Presence of contaminating Rd DNA in the uptake samples:  The 'uptake' samples consist of NP (or GG) DNA recovered after being taken up by Rd cells.  The recovery process is not perfect, and each sample contains a significant amount of contaminating Rd DNA, whose reads need to be removed before the effects of uptake biases can be determined.

The former post-doc found that the Rd reads could be removed by aligning each uptake sample to a fake reference 'genome' consisting of the Rd and NP genomes, concatenated as if they were two chromosomes of a single genome.  In this 'competitive' alignment, almost all the contaminating Rd-derived reads are mapped to the Rd 'chromosome' (where they can be ignored) and almost all the desired NP-derived reads are mapped to the NP 'chromosome' (where they can be analyzed for evidence of uptake biases).

Why 'almost all' rather than 'all':  Several factors will cause reads to be mapped to the wrong 'chromosomes' in this contamination-removal step.
  1. Sequencing errors:  First and least important, some NP-derived reads will have sequencing errors that create the Rd alleles of Rd-NP SNPs, causing them to align better to the Rd homolog.  This will slightly raise the apparent baseline level of Rd contamination but isn't a serious problem.  The reverse will also happen, but we can also ignore them, especially because Rd-derived reads are the minority.  
  2. Repeated segments:  Second, the rRNA genes of Rd and NP have identical sequences, so when these are randomly assigned locations (see above), half of them will be assigned to Rd and half to NP, regardless of the actual level of contamination, and all will be given MAPQ scores = 0.  This is potentially a serious problem, since it reduces the apparent NP-derived coverage of these regions by 50%.  Because there are no uptake sequences in the rRNA genes, provided we treat the input and uptake samples identically this will only affect our ability to see effects of uptake biases in sequences within a few hundred bp of the repeated segment.   
  3. Shared segments:  Third, although NP and Rd genome sequences differ, on average, by 2-3% in alignable regions (= most of their genomes), there will be some segments where their sequences are identical. The mapping software will treat reads from these segments just like it treats reads from the rRNA repeats, randomly placing them at one of the two possible positions (in Rd or NP) and assigning them MAPQ scores = 0.
(Factors 2 and 3 are not significant right at the edges of repeated and shared segments, but become more important as the distance from the nearest SNP approaches the average read length.  Use of spanning coverage rather than single reads should improve our ability to identify and remove the contamination.)
The former postdoc handled factors 2 and 3 by removing from the analysis all reads that had MAPQ=0.  The data in the table above shows the read coverages AFTER the reads with MAPQ=0 have been removed.  The UP13 and UP15 input samples still have a small fraction (about 0.1% ) of Rd contamination rather than 0% contamination.  This may be due to Factor 1 above.

For the ribosomal repeats (and other repeats) removing all the MAPQ=0 reads may be valid, because we genuinely do not know where they came from.  But it's more complicated for the shared segments.  If uptake coverage were even over the genome, we'd know that most of the reads came from NP.  But uptake coverage is not even, especially for the 'short' DNA samples (that's the whole point of the experiment).  Positions close to uptake sequences will have high NP coverage, and only a small contribution from Rd contamination.  Positions far from uptake sequences will have very low NP coverage, so the even contribution from the Rd contamination will have a big effect.