Field of Science

Super-ultra-high-throughput sequencing? done cheap?

A potential collaborator/coworker has suggested what I think would be a great experiment, if we can find a way to do it economically.  But it would require sequencing to very high coverage, something I know almost nothing about the current economics of.

Basically, we would want to sequence a pool of Haemophilus influenzae DNA from a transformation experiment between a donor and a recipient whose (fully sequenced) genomes differ at about 3% of positions, as well as by larger insertions, deletions and rearrangements.  The genomes are about 2 mb in length. The DNA fragments could have any desired length distribution, and the amount of DNA is not limiting.  

Ideally we would want to determine the frequencies of all the donor-specific sequences in this DNA.  For now I'll limit the problem to detecting the single-nucleotide differences (SNPs).  And although we would eventually want to have information about length/structure of recombination tracts, for now we can consider the frequency of each SNP in the DNA pool as an independent piece of information.

At any position, about 1-5% of the fragments in the pool would have a base derived from the donor strain.  This means that simply sequencing to, say, 100-fold coverage of the 2 mb genome (2 x 10^8 bases of sequence) would be sufficient to detect most of the SNPs but could not establish their relative frequencies.  Increasing the coverage 10-fold would give us lots of useful information, but even higher coverage would be much better.

The collaborator and the metagenomics guy in the next office both think Solexa sequencing is probably the best choice among the currently available technologies.  Readers, do you agree?  Are there web pages that compare and contrast the different technologies?  What should I be reading?

One possibility I need to check out is skipping sequencing entirely and instead using hybridization of the DNA pool to arrays to directly measure the relative frequencies of the SNPs.  Because the pool will contain mainly recipient sequences, we'd want to use an array optimized to detect donor sequences against this background.  How short to oligos need to be to detect SNPs?  How specific can they be for a rare donor base against a high background of recipient sequences?  Is this the kind of thing Nimblegen does?


  1. Hey Rosie,

    Been a while since I looked into this sort of thing, but it sounds like it's up Nimblegen's alley. Maybe contact Mickael in the Doebeli lab, I can't recall which technology he wound up choosing, or why, but he might've done the background research.

  2. The forum is useful. You can find threads there on 454 vs. Illumina (Solexa) vs SOLiD.

    I don't have any experience of 454 or SOLiD, but I do handle, on average, several lanes of Solexa data a week, including bacterial. I'm in software and analysis, not in the lab, so I can't advise on that side of things.

    There's quite a range of free and open source read mappers and assemblers that will work on a desktop machine. With a small genome you may well find that further processing the data with Perl is acceptable, in terms of time/space, since that's where you already have some expertise.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS