What are the steps (the challenges)?
- synthesizing the degenerate USS
- putting the degenerate USS in a plasmid vector
- having cells selectively take up plasmids from the pool
- reisolating the plasmids that have been taken up
- sequencing representative USS from the two plasmid pools (input and taken-up)
- interpreting the sequences (characterizing the sequence diversity)
Steps 1 and 2. We'll obtain the degenerate USS as a pool of degenerate oligos, each with a 20 nt adapter on their 3' end. We can then make double-stranded versions using a sequence complementary to the adapter as a primer, using Klenow polymerase or Taq. Taq will leave A overhangs on both 3' ends and we'll use these to ligate the USSs into pTOPO or another TA-cloning vector. We'll order the oligos from a supplier, specifying the 30 nt USS consensus and flanking adapter sequence, and that each position should have 3% of each of the other nucleotides. They will probably need to be synthesized as a special batch, using specially prepared input nucleotide mixes (e.g. for A positions, 91% A and 3% each of G, T and C).
The 9% degeneracy was chosen to give an average of about 3 non-consensus bases in each USS, but I haven't done the math to predict the expected distribution. This is essential because we don't want the mix to consist mostly of USS variants that are just as good as the consensus. OK, the USS is 30 nt long, with 7 positions that show no consensus (bases at these positions probably don't influence uptake at all). So 21 positions that matter, with some probably mattering much more than others. That means the average USS will have only 2 non-consensus bases at positions that matter, and many will have the perfect consensus. I suspect this is too low - I'll work it out with the post-doc today.
Step 3. The cells need to be as competent as possible (to maximize uptake and recovery), but they also need to be given the DNA at a concentration that maximizes their selectivity. So they shouldn't be given so much DNA that the uptake system is saturated. 200 ng/ml of DNA is saturating for standard preps of competent cells, so we'll give these cells 100 ng/ml. We usually have ~ 10^9 cells/ml, so if they each take up 10 plasmids we'll have a maximum of ~10^10 plasmids to reisolate and sequence (assuming no losses). Depending on other constraints, the whole experiment could also be done with short USS-containing fragments rather than plasmids.
Step 4. Reisolating the plasmids that have been taken up may be easiest if we use a rec-2 mutant, because these cells can't take DNA past the first stage of uptake. The post-doc is getting ready to do some preliminary tests of DNA recovery, using radiolabeled DNAs. If the plasmids remain un-nicked they can be isolated using a miniprep kit. If we can't reisolate them away from the chromosomal DNA we can instead PCR-amplify the USS sequences, using flanking plasmid and adapter sequences for the primers and the minimum number of PCR cycles to maintain diversity (to reduce amplification artefacts). We'd better think carefully about much diversity our recovered DNA will have.
Step 5. The post-doc and I had a long discussion about the sequencing yesterday. For thorough sequencing of the input and taken-up pools we'd probably use one or two Illumina lanes - I need to be coached on the details of how this works. But for preliminary characterization (for the grant proposals) we could get by with much less sequencing - even sequencing just 100 of the recovered USSs should be enough to demonstrate that things are working. If the recovered USSs are in plasmids, we can just transform these into E. coli and do conventional Sanger/capillary sequencing of some inserts - this wold be inefficient but not overly expensive. If the recovered DNA is in short linear fragments (from PCR?), I was originally suggesting that we'd ligate these end-to-end into blocks of 5-20 (depending on size) and clone these for conventional sequencing. But the post-doc pointed out that big problems can arise when sequencing repeats, so cloning and sequencing them singly might still be faster and cheaper than troubleshooting these problems.
Step 6. Analyzing 100 preliminary sequences for the grant proposal data will be no big deal, but analyzing many thousands from the Illumina runs will require a more sophisticated approach which I haven't thought through yet.