A future postdoc and I are replanning an experiment I discussed a couple of years ago and included in an unsuccessful grant proposal, designed to finally find out the real sequence specificity of the DNA uptake machinery. At that time I only proposed a little bit of sequencing, but sequencing keeps getting cheaper so this time we're going to do it right. This blog post is to help me get the numbers clear.
In my original plan we'd start with a pool of mutagenized (or degenerately synthesized) USS, either as short fragments (say 50bp) or inserts in a standard plasmid. The USS in the pool would have, at each position of the 30bp USS, a 91% probability of carrying the consensus base and a 3% probability of carrying any one of the 3 other bases. The flanking sequences might be mutagenized too, or the mutagenized fragment might have been ligated to a nonmutagenized tag or plasmid sequence. We'd give this DNA to competent cells, reisolate the DNA that had bound to the cells, and sequence it. We'd also sequence the 'input' pool. The differences between the 'input' and 'bound' pool sequences would define the specificity.
Only about 20 positions of the USS show a significant consensus, so in the analysis below I'm going to assume that the USS is 20bp long. On average each USS would have about 2 changes away from the consensus in these 2obp, but the range would be broad. Figuring out how broad is part of the reason for this post.
For example, what fraction of the fragments would have no differences from the consensus in the relevant 20bp? That's (0.91)^20, which Google says is about 0.15. What about fragments with 1 difference? I think that's about 20 * 0.09 * (0.91)^19, because there are 20 different possible locations for the difference. That's about 0.3. I fear the calculation will be more complicated for the larger numbers of differences. A similar calculation as the second one above gives 0.56 as the frequency of USSs with 2 mismatches, but that's unlikely to be correct because the sum of 0.15 + 0.3 +0.56 =1.01, leaving no possibility of USSs with more than two differences from the consensus. So for the analysis below I'll just use some fake numbers that I think are plausible: 0.1 with no difference, 0.25 with one difference, 0.3 with two differences, 0.2 with three differences, and 0.15 with four or more differences.
How is a sequencespecific uptake system likely to act on this variation? Let's assume that fragments with 'perfect' consensus USSs are taken up with probability 1. For the simplest version, let's assume all oneposition differences have the same effect, reducing binding/uptake to 0.1, and all twoposition differences reduce it to 0.01, etc. The 'bound' pool would then be expected to contain 0.78 perfects, 0.195 oneoffs, 0.023 twooffs, 0.0015 threeoffs and 0.0001 USSs with four or more differences.
How much sequencing of the 'bound' pool would we have to do to have included all of the threeoff sequences (i.e. sequenced each version at least once)? Only one sequence in about 1000 will be a threeoff, and there are about 7000 different ways to combine the positions of the changes (20*19*18). But there are three ways to be different at each position that's different... Yikes, I'm in over my head.
OK, let's do it for the oneoffs first. There are 20 different locations for the difference, and three possible differences at each location, so that's 60 different versions. About 0.2 of all sequences will be in this class, so we will need to sequence a few thousand USSs to get reasonable coverage. What about the twooffs. There are 20*19=380 possible pairs of locations for the differences, and 9 possible differences for each pair of locations, so that's 3420 different sequences. Only about .023 of the fragments in the pool would have two differences, so we'd need to sequence about 150,000 USSs to get onefold coverage, say about a million sequences to get good coverage. For the threeoffs, the numbers are .0015, 6840, and 27, giving almost 200,000 different sequences, with about 1.5x10^8 sequences needed for onefold coverage (say 10^9 for reasonable coverage).
If I instead assume that each mismatch reduces binding/uptake by only 0.5, then the 'bound' pool would have 0.3 perfects, 0.37 oneoffs, 0.22 twooffs, 0.07 threeoffs and 0.03 USS with four or more differences from the consensus. The corresponding numbers of fragments sequenced for reasonable coverage of oneoffs, twooffs and threeoffs would be about 1000, 100,000, and 10^8.
A more realistic model should have some positions more important that others, because that's the main thing we want to find out. What if differences at one of the positions reduces uptake to 0.1, and differences at the other 19 reduce it only to 0.5? We'd of course recover more of the differences in the lessimportant positions than of the differences in the important positions.
How does thinking about this change the amount of sequencing we'd need to do? If all we're interested in is the different degrees of importance of positions considered singly, then we'd easily see this by doing, say, enough to get 100fold coverage of the important single mismatches. Even 100fold coverage of the unimportant positions would be quite informative, as we only need enough to be confident that differences at the one 'important' position are underrepresented in our sequences. So 10,000 USS sequences from the 'bound' pool would be plenty to detect various degrees of underrepresentation of differences at the important positions.
But what if we also want to detect how differences at different positions interact? For example, what if having two differences beside each other is much worse for uptake than having two widely separated differences. Or if having differences at positions 3 and 17 is much worse than having differences at other combinations of positions? Or having A and C at positions 3 and 17 is much worse than having other nonconsensus bases at those positions?
We would certainly need many more sequences to detect the scarcity of particular combinations of two or more differences. The big question is, how many? Consider just the twooffs. 100,000 sequences would let us get almost all of the nonimportant twooff variants at least once, and most of them about 510 times. But that wouldn't be enough to confidently conclude that the missing ones were not just due to chance  we'd need at least 10 times that much sequence.
How much sequencing is that? If the fragments are 50bp, and we want, say, 10^6 of them, that's 5x10^7 bp of sequence. Future postdoc, that's a modest amount, right?
Given a sufficiently large data set, we do have access to software that would detect correlations between sequences at different positions (we used it for the correlation analysis in the USS manuscript).
Once we had identified candidates for significantly underrepresented sequences, I wonder if there if a technique we could use to go back to the pool and confirm that these sequences were genuinely underrepresented? Something analogous to validating microarray results with QPCR? Maybe the reverse? Ideally, we would have an oligo array with every possible twooff USS, and hybridize our bound pool to it. Probably not worth the trouble.
The other reason I'm writing this post is to figure out how much DNA we'd need to start with, in order to end up with enough sequences to define the specificity. If the 'bound' pool contained 1 microgram of 50bp fragments  that would be 10^13 fragments. This should be enough to encompass way more diversity than we would ever be able to sequence. To get 1 microgram we'd need to start with an awful lot of cells, but even if we cut this down to 1 ng we'd still be fine.
 Home
 Angry by Choice
 Catalogue of Organisms
 Chinleana
 Doc Madhattan
 Games with Words
 Genomics, Medicine, and Pseudoscience
 History of Geology
 Moss Plants and More
 Pleiotropy
 Plektix
 RRResearch
 Skeptic Wonder
 The Culture of Chemistry
 The Curious Wavefunction
 The Phytophactor
 The View from a Microbiologist
 Variety of Life
Field of Science


On Patrick Blackett, the ideal experimental physicist, and what it takes to excel at interdisciplinary research3 hours ago in The Curious Wavefunction





Zika virus poses a greater threat than we thought1 week ago in Genomics, Medicine, and Pseudoscience


Once again, with the real bicyclomycin!1 week ago in RRResearch




Citizen Science Works!3 months ago in Games with Words






post doc job opportunity on ribosome biochemistry!1 year ago in Protein Evolution and Other Musings

Growing the kidney: reblogged from Science Bitez1 year ago in The View from a Microbiologist


Blogging Microbes Communicating Microbiology to Netizens1 year ago in Memoirs of a Defective Brain

Rule of 6ix has moved2 years ago in Rule of 6ix




The Lure of the Obscure? Guest Post by Frank Stahl4 years ago in Sex, Genes & Evolution


The Large Picture Blog Has Moved4 years ago in The Large Picture Blog

Lab Rat Moving House4 years ago in Life of a Lab Rat

Goodbye FoS, thanks for all the laughs4 years ago in Disease Prone

Branson getting into microbial diversity in the deep sea5 years ago in The Greenhouse
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
5 comments:
Markup Key:
 <b>bold</b> = bold
 <i>italic</i> = italic
 <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
"If the fragments are 50bp, and we want, say, 10^6 of them, that's 5x10^7 bp of sequence. Future postdoc, that's a modest amount, right?"
ReplyDeleteSounds pretty reasonable, me thinks. We should probably think in terms of number of sequence reads, rather than base pairs sequenced. So my sources say singleend sequencing by Solexa reliably picks up ~30 base pairs per read, so that's our target size.
And the number of reads ranges from 115 million, with the usual number in the neighborhood of 5 million.
So... With your approximations and 5 million reads of the input library:
10% perfect match > 500K
25% 1 mismatch > 1.25 million reads of 60 sequences (~20,000X)
30% 2 mismatch > 1.5 million reads of ~3500 sequences (~400X)
20% 3 mismatch > 1 million reads of ~200,000 sequences (5X)
15% 4 mismatch > 750k ... (<<1X)
So it looks like the approach works well for up to 2 mismatches, somewhat for 3 mismatches, and not at all for 4 mismatches.
What about the "unconserved" positions? If we do the same kind of 9% substitution for the remaining 10 bps in our 30mer USS, will we ruin our ability to look at correlations between conserved positions?
I think I found an error in your calculations of how many distinct 2, 3, and 4 base mismatches there'd be.
ReplyDeleteSo for 2 mismatches, 20*19 overestimates. Instead, it's 19+18+17+...+1, or 20*19/2 = 190.
For 3 mismatches, it's 190*18, since for each 2 base mismatch, there are 18 additional sites that could be the 3rd mismatch.
For 4 mismatches, I think it's 190*(18*17/2)...
errors states sequences
0 1 1
1 20 60
2 190 1,710
3 3,230 87,210
4 29,070 2,354,670
In the same way, the % of the library with a given mismatch number changes.
ReplyDeleteSo for 2 mismatches, I think it's (20*19/2)*(0.09^2)*(0.91^18), or 28.2%, closer to your arbitrary estimate than the calculated 56%
I'm still completely flummoxed by 3 mismatches, so maybe this is still all off...
@ FP comment 1:
ReplyDeleteI think also mutagenizing parts of the fragment that don't affect uptake (its full length) wouldn't change the numbers.
EUREKA! It's the binomial probability...
ReplyDeleten = mer length = 20
k = # mismatches
p = freqency mismatch = 0.09
q = frequency match = 0.91
so the binomial is,
n!/(k!(nk)!)*(p^k)*(q^(nk))
and then for different k:
P(k=0) = 15.2%
P(k=1) = 30.0%
P(k=2) = 28.2%
P(k=3) = 16.7%
P(k=4) = 7.0%
P(k=5) = 2.2%
...
P(k=20) = 1.2e21
sum = 100%