The big progress came because we realized that we needed to stop looking at the data for the whole genome and instead examine a representative 5 kb segment. This has allowed us to relate the results of each analysis to the specific sequence features and uptake data for each position in the segment. So now we have a pretty good understanding of what the various analyses can show us, and what they can't.
Rather than detailing what we learned, here I want to consider what our goals are, and what steps we should take.
Goals: For the analysis of transformation frequencies (the bigger project this work is part of), we want to know how much of the variation in transformation frequencies across the genome is due to differences in DNA uptake. In principle this could just be a number, e.g. 37%.
I guess one (mindless) way to do this would be just to subtract the differences in uptake from the differences in transformation. I don't know whether the former post-doc has done this - I'm pretty sure we haven't discussed it.
A second approach would be to determine the extent to which the already-characterized effect of USS (uptake signal sequences) on DNA uptake explains differences in transformation across the genome. Doing this doesn't require any of the new DNA-uptake sequencing data, just the sequence of the genome of the DNA source. The former post-doc has done simple versions of this, and he has a rotation student working on a more sophisticated version.
We (the PhD student and I) are instead using the new sequence data to improve our understanding of how DNA sequences determine how efficiently a fragment will be taken up by a competent cell. This better understanding can be then used to predict the contribution of uptake to the transformation differences (as above), but its main value is more direct - understanding how DNA sequence differences affect uptake will help us understand the evolution of uptake biases and uptake signal sequences, in H. influenzae and other organisms.
So what have we learned so far:
Size distribution of the input DNA: We don't yet have the direct DNA-analyzer data on length distribution. But we can indirectly estimate this by looking at the graphs of uptake ratio as a function of genome position. Positions that are more than 500 bp from the center of an uptake peak (location of a USS) have a very small uptake ratio (~ 0.01, often not distinguishable from zero). This means that almost all of the fragments in the short DNA sample were shorter than than 500 bp. The mid-height widths of the (well-separated) peaks are about 400-500 bp, indicating that the average fragment was about 200-250 bp. I haven't taken the time to get the best image for this analysis,so we can be more precise than this.
Importance of USS: It's abundantly clear that most of the variation in uptake seen in our 'short' DNA sample is due to the locations of 'USS', sequences with strong matches to the USS motif. Most fragments containing a strong match (score > 20 with the 'genomic' scoring matrix) are taken up several hundred times more efficiently than fragments without a good match.
We've only examined 5 kb in detail, but so far all the uptake peaks we've examined are centred on positions with strong USS scores. The height of the peak correlates with the score.
Importance of the USS scoring matrix: We have two types of position-weight matrices for scoring how well a sequences matches an uptake-promoting motif.
The first is the 'genomic' matrix that the PhD student has been using so far., shown in the figure below. It's based on analysis of abundant USS elements in the H. influenzae Rd genome, identified using the Gibbs Motif Sampler (Maughan et al. 2010). In the figure each bar represents a position in the motif, and its height represents the 'information content' at that position (the sum of the weighted values of each base at that position in the table).
The genomic analysis means that this matrix doesn't directly represent the preferences of the uptake machinery, but rather some combination of these preferences with other factors affecting how sequences accumulate in the genome over evolutionary time.
The second type of matrix comes from the former post-doc's direct analysis of uptake biases, done using a synthetic DNA fragment containing a degenerate USS (Mell et al. 2012). This 'uptake' matrix gives a motif with a strong consensus only for the a much smaller region, with only four very important bases.
We haven't yet analyzed any genome uptake data using this matrix, but it's high on our priority list. We expect similar results with both matrices, but the uptake matrix may be better because it's directly based on uptake data.
How will we decide if it's 'better'? Here, 'better' means that position USS scores better predict the uptake ratios of nearby sequences. We're still working our way to deciding the best way to do this. In addition to the USS score from the matrix, the prediction will need to consider how far the position is from the nearest 'USS' (on a list using a good score cutoff), whether fragments containing it are likely to contain more than one 'USS', the size distribution of the DNA fragments in the prep). Maybe some of this would be incorporated in a matrix of USS scores and distances...
Ideally (i.e. if computational time and resources were unlimited), for each focal position whose uptake we want to predict, the uptake prediction would incorporate:
- the USS scores at each distance from it (two scores for each distance), weighted by our observed correlation between USS score and height of uptake ratio peak
- For each distance, a weighting factor that reflects the probability that the focal position is in the same DNA fragment as the sequence being scored (based on the measured size distribution of the input DNA prep)
- A factor reflecting the interactions between USS scores at different positions, weighted by the probability that both USS would be in the same fragment.
In practice, our job is to characterize these effects and then distill the important ones into a computationally simple prediction algorithm.