Our new post-doc is planning a project that fits wonderfully with our research questions. He's going to incubate competent cells with fragments of chromosomal DNA from a different H. influenzae strain (about 2% sequence divergence), and reisolate the DNA that has passed through different stages of uptake and recombination. The pools of DNA (each potentially about 10^10 different molecules, taken up by about 10^9 different cells) will then be intensively sequenced with the latest ultra-high-throughput technology. By simply comparing how many times each donor sequence appears in each pool, we will learn an enormous amount.
The biggest thing we want to understand is the bias of the DNA uptake machinery towards the uptake signal sequence (USS) motif. Sites matching this motif are very common in the H. influenzae genome (~2000 sites), and we have characterized them in great detail, but we know very little about their role in uptake. This is a big problem, because we think that their role in uptake is the ultimate cause of their accumulation in the genome. We know that cells prefer sequences perfectly matching the core consensus (AAGTGCGGT) 10-100-fold over unrelated sequences. Our previous attempts to find out more about the effects of mismatches and of changes to the flanking-sequence- consensus haven't gotten very far, partly because we were testing each variant USS separately, and partly because the results were not very consistent (high variation between experiments). This new project will not only measure the relative uptake of all the >2000 sequences closely matching the consensus, it will measure the uptake of all the other sequences in the genome too. The resolution can be manipulated by changing the sizes of the chromosomal DNA fragments used as input (100 bp, 1 kb, 10 kb), and if more diversity is wanted we can repeat the analysis with mixtures of DNAs from other strains and closely related species. We will represent this full uptake-bias as a position-weight matrix for the 30+ bp of the extended USS, in the same way as we now represent the consensus of the USS in the genome. Comparing the two matrices will tell us the extent to which the sequence motifs found in the genome correspond are produced by the bias of the uptake machinery.
The project will answer a lot of other questions, both mechanistic and evolutionary. WHY the uptake machinery is biased is our biggest research question. We think that the bias arises for mechanistic reasons; i.e. the sequences favoured by the uptake machinery are those that are easiest to take up. One issue that the post-doc's experiments will resolve is whether the direction of DNA uptake is set by the orientation of the USS. This will tell us quite a bit about the uptake mechanism. The uptake analysis will also let us define a practical USS - the set of sequences that are consistently taken up much better than other sequences. At present most analysis has considered only sequences perfectly matching the core consensus of the motif. This is an easy place to draw a line between USS and not-a-USS but it is very unlikely to correspond with reality.
We may also be able to find out whether a distinct DNA-binding step precedes the initiation of uptake? DNA uptake is conventionally treated as being initiated after DNA is specifically bound to the cell surface. But we don't know if DNA does stick to competent cells in a way that it doesn't to non-competent cells, or if it does, whether the binding is sequence-specific. In principle I think that any protein-catalyzed step can be modeled as having separate binding and catalysis stages, but as a first step we just want to find out whether whatever protein or proteins that initially bind the DNA also play direct roles in its uptake. The alternative would be a protein that just binds DNA and holds it close to the cell surface, increasing its local concentration so it's more likely to encounter the uptake machinery. If we did find such a protein we'd want to find out whether it had any sequence specificity. The post-doc's experiments won't be able to tell us whether the protein exists, but he may be able to look for a specific DNA fraction that has bound to competent cells but not started uptake.
The final part of his experiments will sequence DNA from a pool of 10^9 cells that have recombined fragments of donor DNA into their chromosomes. This will be the biggest sequencing challenge, as only about 1% of each genome will be recombined donor sequences. But it will give a big payoff for understanding both the mechanistic factors limiting recombination and the evolutionary consequences of recombination between diverging lineages.
For example, are USS sufficiently close together that they cause recombination to be high and even across the whole genome? Or does recombination reveal substantial disparities in coverage that correlate with differences in USS spacing? The answer is likely to depend on the lengths of the DNA fragments the cells are given; fragments shorten than the average USS spacing will likely give large disparities that will be evened out when the fragments are large enough that they usually contain at least one good USS.