Field of Science

Lab decoration

To celebrate getting back into the lab, I spent a little while
improving the decor by my bench.

Starting experiments again

I'm (finally) starting an experiment. This is the test of whether cells can take up intact closed circular plasmid DNA. The original experiment (Kahn et al 1983 - pdf here - compare lanes A and G of Fig. 3) used plasmids that had been cut, labeled with 32-P, and religated; the religation trapped some supercoils, allowing the gels to show that supercoiling was preserved in the DNA that had been taken up. Our new 32-P won't arrive until sometime next week, so I'm going to do a practice run using cold DNA. If I'm lucky, the cells will take up enough that I can see it in a gel.

The post-doc showed me the stock of the plasmid I'll use (pUSS-1, left by a previous post-doc). So this afternoon I'm just running a quick gel to check the state of this DNA. And pouring some plates to streak out the wildtype and rec-2 cells I'll be using. If the DNA is supercoiled I'll try it out tomorrow on some wildtype competent cells I have frozen.

We've also ordered the tester-kit of four kinds of streptavidin-coated magnetic beads with different surface properties. They're much bigger than I wanted, but I figure I can at least use them to find out whether H. influenzae cells stick to any of (or all of?) these beads in the absence of DNA. If they do, the planned experiments may not work.

The new post-doc's plan

Our new post-doc is planning a project that fits wonderfully with our research questions. He's going to incubate competent cells with fragments of chromosomal DNA from a different H. influenzae strain (about 2% sequence divergence), and reisolate the DNA that has passed through different stages of uptake and recombination. The pools of DNA (each potentially about 10^10 different molecules, taken up by about 10^9 different cells) will then be intensively sequenced with the latest ultra-high-throughput technology. By simply comparing how many times each donor sequence appears in each pool, we will learn an enormous amount.

The biggest thing we want to understand is the bias of the DNA uptake machinery towards the uptake signal sequence (USS) motif. Sites matching this motif are very common in the H. influenzae genome (~2000 sites), and we have characterized them in great detail, but we know very little about their role in uptake. This is a big problem, because we think that their role in uptake is the ultimate cause of their accumulation in the genome. We know that cells prefer sequences perfectly matching the core consensus (AAGTGCGGT) 10-100-fold over unrelated sequences. Our previous attempts to find out more about the effects of mismatches and of changes to the flanking-sequence- consensus haven't gotten very far, partly because we were testing each variant USS separately, and partly because the results were not very consistent (high variation between experiments). This new project will not only measure the relative uptake of all the >2000 sequences closely matching the consensus, it will measure the uptake of all the other sequences in the genome too. The resolution can be manipulated by changing the sizes of the chromosomal DNA fragments used as input (100 bp, 1 kb, 10 kb), and if more diversity is wanted we can repeat the analysis with mixtures of DNAs from other strains and closely related species. We will represent this full uptake-bias as a position-weight matrix for the 30+ bp of the extended USS, in the same way as we now represent the consensus of the USS in the genome. Comparing the two matrices will tell us the extent to which the sequence motifs found in the genome correspond are produced by the bias of the uptake machinery.

The project will answer a lot of other questions, both mechanistic and evolutionary. WHY the uptake machinery is biased is our biggest research question. We think that the bias arises for mechanistic reasons; i.e. the sequences favoured by the uptake machinery are those that are easiest to take up. One issue that the post-doc's experiments will resolve is whether the direction of DNA uptake is set by the orientation of the USS. This will tell us quite a bit about the uptake mechanism. The uptake analysis will also let us define a practical USS - the set of sequences that are consistently taken up much better than other sequences. At present most analysis has considered only sequences perfectly matching the core consensus of the motif. This is an easy place to draw a line between USS and not-a-USS but it is very unlikely to correspond with reality.

We may also be able to find out whether a distinct DNA-binding step precedes the initiation of uptake? DNA uptake is conventionally treated as being initiated after DNA is specifically bound to the cell surface. But we don't know if DNA does stick to competent cells in a way that it doesn't to non-competent cells, or if it does, whether the binding is sequence-specific. In principle I think that any protein-catalyzed step can be modeled as having separate binding and catalysis stages, but as a first step we just want to find out whether whatever protein or proteins that initially bind the DNA also play direct roles in its uptake. The alternative would be a protein that just binds DNA and holds it close to the cell surface, increasing its local concentration so it's more likely to encounter the uptake machinery. If we did find such a protein we'd want to find out whether it had any sequence specificity. The post-doc's experiments won't be able to tell us whether the protein exists, but he may be able to look for a specific DNA fraction that has bound to competent cells but not started uptake.

The final part of his experiments will sequence DNA from a pool of 10^9 cells that have recombined fragments of donor DNA into their chromosomes. This will be the biggest sequencing challenge, as only about 1% of each genome will be recombined donor sequences. But it will give a big payoff for understanding both the mechanistic factors limiting recombination and the evolutionary consequences of recombination between diverging lineages.

For example, are USS sufficiently close together that they cause recombination to be high and even across the whole genome? Or does recombination reveal substantial disparities in coverage that correlate with differences in USS spacing? The answer is likely to depend on the lengths of the DNA fragments the cells are given; fragments shorten than the average USS spacing will likely give large disparities that will be evened out when the fragments are large enough that they usually contain at least one good USS.

tiny magnetic beads?

I was about to order a tester-kit of streptavidin-coated magnetic beads from Invitrogen (4 different bead surface types, to find the surface that gives the least background), when I realized that these weren't the tiny 50 nm beads I had in mind but 1 ยต beads that are as big as H. influenzae cells. I don't want to use great big beads if I can get smaller ones (yes, I realize that 1 ยต is hardly 'great big').

Now I can't find the supplier that had the 50 nm streptavidin-coated beads at all. The best I can find is a European company, Ademtech, that has 100 and 200 nm beads. I've sent them an email asking if they have a Canadian supplier, but I'm not optimistic. Maybe I should just try the tester kit of big beads, and then scale down...

Invitrogen charges more than $500 for a stand that holds 16 microfuge tubes against rare-earth magnets, to separate the beads from the liquid they're suspended in. The 6-tube one is only about $150. But, I can buy a sampler of 50 NdFeB magnets of assorted sizes (1/8 inch to 1/2 inch, rods and discs) from Lee Valley Tools for $16.50.

Real USS spacings are not random


The co-author only needed about 20 minutes to write the little perl script I'd asked for, and here are the results I generated with it. The blue bars show the spacing distribution of the real H. influenzae uptake sequence cores, and the red bars show the spacing distribution of the same number of random positions in a sequence of the same size (based on 10 replicates).

So the statistical advisor was correct; the real distributions have far fewer close spacings than expected for a random sequence. This is despite the the presence of two forces expected to increase close spacings - USS pairs serving as transcription terminators, and general clustering of USS in non-coding sequences. As hypothesized in yesterday's post, the discrepancy between real and random is stronger when we eliminate the terminator effect by only considering spacings between USS in the same orientation.

And the previous mathematical analysis was right, the real spacing distribution is more even than the random one - not only are there fewer close USS than expected, there are fewer far-apart USS too. (This is more obvious in the upper graph.)

Now I need to modify the perl script that locates uptake sequences so it will find the Neisseria DUS, and then do the same analysis for the N. meningitidis genome. The previously published analysis of these spacings simply compared the mean spacing of DUS with the mean length of apparent recombination tracts, concluding that the similar lengths was consistent with the hypothesis that DUS arise or spread by recombination.

Spacing of uptake sequences

I've been looking more at the spacings of the uptake sequences that arise in our model genomes and that are present in the real H. influenzae genome. First, in the panel on the left, is what we see in our simulated genomes. I only have data for simulations done with relatively small recombination fragments because the others take a very long time to run.

The notable thing (initially I was surprised) is that almost all of the uptake sequences in these genomes are at least one fragment length apart. That is, if the fragments that recombined with the genome were 25 bp (top histogram), none of the resulting evolved uptake sequences are in the 0-10 and 10-20 bins. If the fragments were 500 bp (bottom histogram, not to the same scale), none are in any bin smaller than 500 bp.

For the 100 bp fragments I've plotted both the spacings between perfect US (middle green histogram) and the spacings between pooled perfect and singly mismatched US (blue histogram below the green one). When the mismatched US are included the distribution is tighter, but there are still very few closer than 500 bp.

This result makes sense to me, because the uptake bias only considers the one best-matched US in a fragment. So a fragment that contained 2 well-matched USs is no better off than a fragment with one, and selection will be very weak on USs that are within a fragment's length of another US.

I talked this result over with my statistics advisor, and we agreed that there's not much point in worrying about whether the distributions are more even than expected for random spacing, because these are so obviously not random spacing.

After getting these results I decided I should take my own look at the spacings of the real uptake sequences in the H. influenzae genome, and these are shown in the graphs in the lower panel. I first checked the spacings of all the uptake sequences that are in the same orientation; the upper graph shows the pooled data for spacings between 'forward' USS and spacings between 'reverse' USS. Let's consider this graph after the one below it.

The lower graph shows the spacings between all of the USSs, regardless of which orientation each is in. This is the data that has been previously analyzed and found to be non-random (in a mathematical analysis that's too sophisticated for me to follow).

We don't see the same abrupt drop-off at close spacings that's in the simulated genomes, but the statistics advisor thought the drop-off at close spacings might be significant. Rather than making a theoretical prediction, he suggested just generating randomly spaced positions (same number of occurrences as the real USS, and in the same length of sequence as the real genome) and comparing their distributions to the real ones in my graph. I'm asking my co-author to write me a little perl script to do this (yes, I probably could write it myself, but she can write it faster). I'll use it to generate 10 random pseudo-genomes worth of spacings, and compare their distributions to my real genome.

There are good reasons to expect the real distribution to not be random. First, uptake sequences are preferentially found outside of genes. In H. influenzae 30% of the USS are in the 10% of the genome that's non-coding, and this would be expected to cause some clustering, increasing the frequencies of close spacings. Second, USS often occur in closely spaced, oppositely oriented pairs that act as transcriptional terminators. This is also expected to increase the frequencies of close spacings. I can't easily correct for the first effect, but I can correct for the second one by only considering USS that are in the same orientation, and that's what I've done in the upper graph.

There we see a much stronger tendency to avoid close spacings. The numbers aren't really comparable because I'm only considering half as many USS in each orientation, but the overall shapes of the two distributions suggest fewer close spacings in the same-orientations graph.

So, two things to do: First, predict the spacings of random positions, using the perl script that isn't written yet. Second, I need to get better data for the simulated evolved genomes, because these were initiated with fake genomes consisting of pure strings of uptake sequences, and the uptake sequences in the resulting evolved genomes had not originated de novo but instead had been preserved from some of the originating ones. I need to redo this with evolved genomes that began as random sequences, but these are taking longer to simulate.

Added later: And a third thing to do: Reread the paper by Treangen et al., who investigated the spacings of the Neisseria DUSs and compared them to the average lengths of recombination tracts.

Uptake of closed circular DNA

H. influenzae is reported to efficiently take up closed circular DNA as well as linear DNA. The DNA can be re-isolated intact after the cells have been treated with DNase I, so it is thought to be in the periplasm. I've based much of my thinking about the mechanism of DNA uptake on this result, but I've never tested it for myself. Now's the time, because the ideas in the DNA uptake proposal depend on this being true.

How to do it? How was the original experiment done? (First step is to find the paper...) Would they have used radioactively labeled DNA, and if so how did they label it without compromising its supercoiled state? (Prep the plasmid from E. coli grown in the presence of massive amounts of 32-P? Yikes!) Or maybe they did a Southern blot to detect it? Find the paper...

Basic plan: Incubate competent cells (wild type or rec-2 mutant) with supercoiled plasmid containing a USS - pUSS1 isolated from E. coli using a kit should be fine, except for the labeling question. Treat cells with DNase I and wash them thoroughly to get rid of external DNA. Lyse cells and prep DNA using a plasmid-prep kit. Run in a gel. If the recovery is really high I might be able to see the unlabeled plasmid, especially if I use one of those ethidium alternatives that's much more sensitive. How high could recovery be? Say I have 10^9 cells, each takes up 10 plasmid molecules. That's only 10^10 plasmids; if they're 2kb long that's 2x10^10kb of DNA, which is about 0.02 micrograms. 20 ng can easily be seen in a gel, especially with a fancy dye, but I'd be out of luck if uptake is less or recovery is poor (for example, if the DNA is nicked the kit won't work).

Ideas from lab meeting

At today's lab meeting we discussed the feasibility of the six Specific Aims from the 2007 CIHR proposal on DNA uptake, and considered what kinds of preliminary data we might generate. I'll discuss each separately here.
  • "What is the H. influenzae uptake specificity? A pool of USSs that have been intensively but randomly mutagenized and then selected for the ability to be taken up by competent cells will be sequenced to fully specify the uptake bias."
The new-post-doc is beginning experiments to show that he can reisolate DNA that has been specifically bound by competent cells. He also suggested a better way to make the pool of randomly mutated USS. Rather than using the degenerate 30-bp USS oligos as primers with a mutagenesis kit, he suggests instead creating 50- to 60-bp oligos with 20 bp at the 3' end outside of the USS. He'll then use Klenow polymerase to fill in the second strand of these oligos, with a primer complementary to the 3' end, and clone these to make our diverse library of degenerate USSs. Sequencing of some of these will provide preliminary data that our input is correct.
  • What forces act on DNA during uptake? Laser-tweezer analysis of USS-dependent uptake by wild type and mutant cells will reveal the forces acting on the DNA at both the outer and inner membranes.
This work was begun by a physics grad student from the lab with the tweezers (across town). I subsequently revised the plan to make things simpler, more efficient, and better controlled. I have a stock of the styrene beads somewhere (with streptavidin), and I've already checked that our standard MAP7 DNA has a suitable length distribution. So I need to put biotin ends on the DNA. I have biotinylated nucleotides but don't remember how I was going to attach these to the DNA ends. I could cut the DNA with a rare-cutting restriction enzyme (SmaI?) and then fill-in with the nucleotide (better read my old notes). SmaI sites are sufficiently rare that most molecules in the MAP7 DNA prep would have biotin only on one end. Then I need to bind the DNA to the beads. That's easy, but how do I check how much DNA is on the beads? Measure DNA concentration by Absorbance at 260? (Does styrene absorb UV?) Then I need to arrange with my physicist collaborator to try them out, using Bacillus subtilis first because its cells are nice and big and it doesn't care about the sequence of the DNA it takes up.
  • Does the USS polarize the direction of uptake? Using magnetic beads to block uptake of either end of a small DNA fragment will show whether DNA uptake is symmetric around the asymmetric USS.
This time I need to lay out clearly how answering this question would help distinguish between different models of uptake. We tried this experiment unsuccessfully about 5 years ago with much bigger streptavidin-sepharose beads (we later realized that the beads were way bigger than the cells, and porous to boot). I need to do more careful thinking of how this would be done and what we might measure.
  • Does the USS increase DNA flexibility? Cyclization of short USS-containing fragments will reveal whether the USS causes DNA to be intrinsically bent or flexible, and whether ethylation or nicking can replace parts of the USS.
The RA pointed out that a former post-doc tried unsuccessfully to get this cyclization assay working. The problems are all nicely recorded in her blog, as well as being documented in her lab book. Hmmm, what to do? I found a 2009 paper about DNA kinking that used nicking and 2 nt gaps as controls; they didn't see significant kinking with their nicked DNAs. I was going to test whether ethidium causes restriction enzymes to nick circles instead of cutting both strands, but the post-doc and RA tell me that we can now buy 'nickase' derivatives of restriction enzymes instead.
  • Which proteins interact with incoming DNA? Cross-linking proteins to DNA tagged with magnetic beads, followed by HPLC-MS, will be used to isolate and identify proteins that directly contact DNA on the cell surface.
It would be good to at least show that the cross-linking works... The RA says that she and a former post-doc tried using formaldehyde to cross-link DNA and proteins and got nothing but gunk. So I think this will take a lot of optimization. Start by getting the magnetic beads on the DNA. Start by ordering some magnetic beads.
  • Which proteins determine USS specificity? Heterologous complementation with homologs from the related Actinobacillus pleuropneumoniae (which recognizes a variant USS) will identify the proteins responsible for this specificity.
I made the plasmids for this work several years ago, but they were not well-tolerated by the host cells. I should pull them out and see if I can get some results.

Before any of this, I or the post-doc should carefully test whether H. influenzae can indeed take up closed circular DNA intact. I've been citing an old report of this for years, but it's critical to my model of DNA uptake so I really need to test it myself. I'll write a separate post about this.

Babes

Preliminary experiments for our next grant proposals

It's my turn to do lab meeting tomorrow afternoon, and I'm going to talk about the preliminary data needed for the CIHR and NIH grant proposals I'm planning to submit in September. In preparation I've been rereading the CIHR proposal that didn't get funded two years ago, and the reviewers' critiques of it.

Basically it's a lovely proposal with one massive flaw. The experiments are a fine mix of safe and risky, and the proposal is beautifully written and illustrated. BUT, there's no preliminary data to show that the experiments we propose are likely to work. (The reviewers also grumbled that I was being sneaky in submitting proposals to two different committees in the same competition. Such greed is, of course, un-Canadian, but that won't be an issue this time.)

So now it's time to think more strategically (i.e. about strategy) than I usually do. Which of the 6 lines of investigation I propose are easiest to generate preliminary data for? Which are the most severely weakened by their lack of preliminary data?

Here are the Specific Aims from that proposal, with brief notes about preliminary data in italics:
  1. "What is the H. influenzae uptake specificity? A pool of USSs that have been intensively but randomly mutagenized and then selected for the ability to be taken up by competent cells will be sequenced to fully specify the uptake bias." The new-post-doc is about to start doing preliminary experiments for this and related work, so we should at least be able to show that we can reisolate DNA that has been specifically bound by competent cells.
  2. "What forces act on DNA during uptake? Laser-tweezer analysis of USS-dependent uptake by wild type and mutant cells will reveal the forces acting on the DNA at both the outer and inner membranes." This work was begun by a physics grad student from the lab with the tweezers (across town). We have all the components ready, and I'm keen to work on this myself.
  3. "Does the USS polarize the direction of uptake? Using magnetic beads to block uptake of either end of a small DNA fragment will show whether DNA uptake is symmetric around the asymmetric USS." We've tried this unsuccessfully in the past with much bigger beads. The magnetic beads may also be valuable for the post-doc's DNA-reisolation work - I wonder what form preliminary data should take.
  4. "Does the USS increase DNA flexibility? Cyclization of short USS-containing fragments will reveal whether the USS causes DNA to be intrinsically bent or flexible, and whether ethylation or nicking can replace parts of the USS." We should at least try this cyclization assay - it shouldn't be a big deal.
  5. "Which proteins interact with incoming DNA? Cross-linking proteins to DNA tagged with magnetic beads, followed by HPLC-MS, will be used to isolate and identify proteins that directly contact DNA on the cell surface." It would be good to at least show that the cross-linking works...
  6. "Which proteins determine USS specificity? Heterologous complementation with homologs from the related Actinobacillus pleuropneumoniae (which recognizes a variant USS) will identify the proteins responsible for this specificity." I made the plasmids for this work several years ago, but they were not well-tolerated by the host cells. I should pull them out and see if I can get some results.

For the Discussion?

We need to remember that the uptake bias specified by the position-weight matrix the model uses isn't the effective bias in the run. Instead, the effective bias changes within each cycle and between cycles. In each cycle it typically starts with even the perfect-US fragments having only a 10% chance of recombining, and decreases gradually to a point where the disparity between perfect and less-than-perfect US is substantially reduced. How substantially depends on how many US the genome has at that cycle.

The Lewontin approach to confusion

(I should be thinking about DNA uptake projects but instead I'm still trying to understand why so few US accumulate in our simulations.)

I'm (at last) trying the approach I learned from Dick Lewontin when I was a beginning post-doc. When you're trying to understand a confusing situation where multiple factors are influencing the outcome, consider the extreme cases rather than trying to think realistically. Create imaginary super-simple situations where most of the variables have been pushed to their extremes (i.e. always in effect or never in effect). This lets you see the effect of each component in isolation much more clearly than if you just held the other variables constant at intermediate values. Then you can go back and apply the new insights to a more complex but more realistic situation.

So I've created versions of our model where the input DNA consists of just end-to-end uptake sequences, where the genome doesn't mutate (the fragments being recombined do have mutations), where the probability of recombination is effectively 1.0 for fragments with perfect uptake sequences and almost zero for fragments with less-than-perfect ones, and where this probability doesn't change during the run. The genome is small (2 kb) to speed up the run and the fragments are only 25 bp, so recombination doesn't bring in a lot of background mutation.

These runs reach equilibria with 45-50 uptake sequences in their 2 kb genomes. If they had one uptake sequence every 25 bp they would have 80 in the genome. Maybe this is as good as it can get - it's certainly way better than what we had previously.

Eliminating the background mutation in the genome makes a surprisingly large difference; a run with it present had only 16 perfect uptake sequences. I wonder if, with this eliminated, the amount of the genome recombined per generation now has much less effect on the final equilibrium? In the runs I've just done, fragments equivalent to 100% of the genome are recombined, replacing about 63% of the genome in each cycle. So I've now run one with half this recombination - it reaches the same equilibrium.

I had previously thought that the reason our runs reached equilibria that were independent of mutation rate was some complex effect of the weird bias-reduction component of our simulations. But in these new 'extreme' runs I had the bias reduction turned off, and I got the same equilibria with mutation rates of 0.01 and 0.001. I also tested a more-extreme versions of our position-weight matrix, where a singly mismatched uptake sequence was 1000-fold less likely to recombine rather than the usual 100-fold less, and this gave the same equilibrium as the usual matrix. So I tried using a weaker matrix, with only a 10-fold bias for each position. This gave about 30 uptake sequences in the 2 kb, a much less dramatic reduction than I had expected. Not surprisingly there were also more mismatched uptake sequences at the equilibrium.

This is excellent progress for two reasons. First, I now have a much better grasp of what factors determine the outcomes of our more realistic simulations. Second, I now know how to generate evolved genomes with a high density of uptake sequences, which we can analyze to determine how the uptake sequences are spaced around the chromosome.

The one big problem

Now that we have more data from our model and I've summarized it, we're left with one big problem: Why don't more uptake sequences accumulate in our simulated evolved genomes?

In all the generic simulations the bias favouring perfect matches to the 10 bp consensus US is very strong - they're favoured 100-fold over sequences with one mismatch to the consensus. If nothing else is limiting I would expect the final evolved genomes to have about one of these perfect US in each uptake-sized segment. That is, if the fragments taken up in the run were 1000 bp, I would expect the evolved genome to have about one perfect US per 1000 bp. If the simulated fragments were 100 bp, I'd expect about one perfect US per 100 bp. But my impression is that we get a lot less than this. The outcome appears to be independent of mutation rate, so the explanation shouldn't be that accumulation is limited by mutational decay of uptake sequences.

The other variable I could change is how quickly the bias is reduced when fewer than the desired number of fragments have recombined. At present it's 0.75. Increasing it would make the early cycles go slower, but with such a high mutation rate this isn't a big problem. I'll try a couple of runs using 0.9 and 0.95.

I wrote 'my impression is' above because I haven't really gone back and redone/rechecked the simulations that should have given the most accumulation. These would be ones where the amount of the genome replaced by recombination each generation is almost 100%, or where the genome mutation rate is set to zero. I'll look carefully at what's already done and queue up some new runs today.

This is a really important issue. We need to understand what is limiting US accumulation in the model, and I don't think we do yet.

Update: I've checked the runs my co-author did before she left. With 100 bp fragments, replacing 50% of the genome in each cycle, she always got less than one perfect US per kb. A run that started with 200 perfects seeded in a 200 kb synthetic sequence ended with 33, not 2000. A run that started with a random sequence ended with 18 perfects. These may not have been the final equilibria, but , in the last half of both runs the numbers of perfects were wandering around between 10 and 100, never more. In these runs the mutation rates for the genomes were very low (the rates for the fragments were 0.001), so the problem isn't background mutational decay of perfect US (the numbers of singly mismatched US at equilibrium were also low, about 100-150).

While I was traveling I did some runs with 500% of the genome replaced - in practice this means that most parts of the genome would have replaced several times before the cycle ended, and only a few bits would have not been replaced at all. These runs had less sensitivity because the genomes were only 20 kb, but they did give higher scores, with about 30-40 perfect US in the 20 kb genome. With this much recombination, the mutation rate of the genome itself becomes irrelevant as the almost the whole mutated genome is replaced by recombination in each cycle.

Should I repeat these runs, and do some with fragments only 50 bp long too? The only reason would be to increase my confidence that these are real equilibrium outcomes. I had terminated my runs when the scores seemed stable, but maybe they would creep up if left long enough (maybe using a seeded genome isn't a very good test of the true approach to equilibrium). The results were the same with mutation rates of 0.001 and 0.01, so I could use a rate of 0.01 and set it to run for 10,000 cycles, which is more than 10 times longer than my other runs went for.

More US-variation progress

The manuscript about variation in uptake sequences is still a mess, but today I'm going to send it off to my co-author anyway because my brain is tired of dealing with it. The Introduction isn't bad, and the Results is at least complete in that it describes all the analyses and their results (even though a couple of the model analyses aren't actually done yet). But the Discussion is still a shambles. It's not that we don't have important ideas to put in the Discussion, but I'm too saturated with details to think about them.

The draft figures are done. Some of them are far from polished, but they show what the data is. A couple of others are sketches because the data is incomplete. Lots of new runs to generate the missing data have already finished on the new server, and other (longer) runs are ongoing.

One analyses I'm working on is the spatial distribution of uptake sequences around the genome. Years ago a paper reported that USS are spaced more evenly than would be expected for randomly positioned sequences and suggested this might mean they had an intracellular role in chromosome maintenance. I'm wondering if this even spacing might instead by a consequence of the recombination events they arise by - if so we would expect the uptake sequences that arise in our simulations to also be relatively evenly spaced. A recent paper on Neisseria DUS found spacing to have roughly the same properties as the average lengths of recombination tracts (but this was a very weak constraint).

To do this analysis we need to know the locations of our uptake sequences in our evolved simulated genomes, but our model doesn't report this. On Monday I tried using the Gibbs motif sampler to find these. It easily found the perfect occurrences but needed some nudging to find most of the singly and doubly mismatched ones. This will be useful for making logos that show the evolved motifs, but not very suitable for analyzing spacing. My co-author has now modified a little perl script she had written so it lists the positions of the uptake sequences. The version she sent was for the USS 9 bp core, so I modified it to find the 10 bp generic US that the simulations had been generating. So now I have my first list of spacings.

The USS spacing analysis was done by Sam Karlin (noted Stanford mathematician interested in genomics), using some fancy math he'd invented. I emailed my local statistics expert about the best way to analyze the spacings, and he recommended that I use simulations (rather than probability theory) to find out what a random distribution should look like, and then compare our spacings to this.

To do this I think I need to write a little script that generates a specified number of random positions in a string of specified length (e.g. get 243 positions in a 200,000 sequence, to match the data I had for an evolved genome) and calculates the variance of the spacing distribution. The script needs to do this many times (10,000, suggested the expert!), and generate some summary that I can use to decide whether the spacings of the 243 uptake sequences in my evolved genome is significantly different from random spacing. The expert is off at the evolution meetings right now - I'll consult with him when he gets back.

In the interim I need to get back to thinking about DNA uptake, to plan the preliminary experiments and analyses we want to get done for our upcoming proposals.

US-variation progress

Progress on both the simulation and manuscript-writing fronts:

The minor snags with using the new server were easily resolved, and yesterday I queue'd up all of the runs needed for the data on mutation-rate effects. Some of them I queue'd up twice because I made mistakes in the settings the first time - one result of this is that we'll have five rather than four replicates of the key runs. And all but two of the runs had finished by this morning - the two still running were doing 50,000 cycles rather than 10,000.

The new writing plan is to generate a draft manuscript (very crude in spots) and draft figures (some in final form, a few just sketches because I don't have the final simulation data yet), and send the whole thing off to my co-author. She can then revise the draft while I generate and analyze the rest of the data and create the semi-final figures. Our goal is to have the semi-final version of the whole thing ready when we meet up at the Gordon Conference on Microbial Population Biology (end of July), and to do the final revisions together then.

Getting organized

Yesterday I wrote an outline for the simulation section of our US-variation manuscript, based what I could remember about the new simulations I ran while I was traveling, and sent it to my co-author. Then I started going through all these simulations to figure out exactly what I had actually done and learned.

I printed out some blank table sheets and filled in a line for each run, listing the key settings and the results (US score, had the genome reached equilibrium). There were about 45 runs, but I'd been sensible enough to keep the related ones together and delete all the false starts so it only took me a couple of hours to sort them all out. Now (once I hear back from my co-author), I need to decide where the gaps are and what replicates are needed, and start queueing the runs on the new server.

Unexpected perl-run success

One of the responses to my question about fast computers was an offer to use a computer cluster at Dalhousie University. The guys in charge (Rob and Karl) set me up with an account and a password but I haven't had time to try it out until this afternoon.

Rob had tried to run our program and run into a problem - it wouldn't read the desired settings from the settings input file. This appeared to be due to some idiosyncracy with how our code was being interpreted by their perl compiler. When he tweaked the code to use a slightly different command that did the same thing, it worked. But I wasn't looking forward to having to go through our code and make these changes.

Karl had thoughtfully sent me some instructions when he sent me my password:
The cluster uses the `module` system for loading and unloading the environment variables required for the different applications, quick how to is as follows

Show the currently installed modules
`module avail`

Load the module for this shell only, not future shells
`module load mauve/2.2.0`

Load the module for all future shells
`module initadd mauve/2.2.0`

Remove the module
`module initrm mauve/2.2.0`

Beyond that pretty standard SGE install.
Unfortunately this was gibberish to me, as I don't know what a 'module' or a 'shell' is, or how to do a standard SGE install. I also couldn't understand the online documentation for remote use, which talked about things I'd never heard of, such as RSA auth KeyFiles and NX CLIENTs and Gnome.

But wotthehell, I figured I might as well try doing the same things I do to install and run my program on our local WestGrid cluster. So I used Fugu to log on to Rob and Karl's cluster (that worked!) and to copy my program and its two input files into my cluster directory (that worked too!). Then I opened a Mac Terminal window, used ssh to log on to the cluster (that worked!) and told it to run my program. Based on Rob's experience I was expecting a string of error messages, but the program ran fine! In fact it ran about 60% faster than it does on my laptop, which is about five times faster than it runs on WestGrid.

Now I just have to decide what runs need to be done. This will require sorting through everything I did while I was away and what I learned from it, and then consulting with my former-post-doc co-author. Expect a post tomorrow describing this.

Commenting inconsistencies

Our new lab blogs process comments quite differently.  On this blog and on Shoe Science, you just click 'post comment' (after entering your comment in the box) and it appears on the blog.  On No DNA Control and RY's blog, clicking 'post comment' gets you a message that your comment could not be posted.  Then you try clicking 'preview comment' and a preview window appears, along with a captcha (one of those word-recognition tests) to complete.  Then when you click 'submit comment' your comment appears.  I wonder if it's the captcha that creates the problem - I'll need to test this on another computer because after posting one comment the problem has disappeared.

simulation plans?

I'm back in Vancouver, slowly working my way through the three-week backlog of emails that I glanced at and then disregarded.  These include lots of suggestions from helpful blog and evoldir readers about how I might be able to speed up my computer simulations of uptake sequence evolution.  I'll try to send a group reply to them all, but I'll summarize the ideas here. (Note that this summary is being generated BEFORE I go back and read through all the emails, so it's not the final version.)

My original query had asked about the availability of a 'mainframe' or other fast computer, hoping for something that was at least ten times faster than my MacBook Pro laptop.  The first thing I learned from the responses is that mainframes don't exist any more - they've all been replaced by clusters and grids, where large numbers of individually-slow computers are networked together.

Many replies asked if I could convert my code so it would take advantage of one of these clusters.  Unfortunately the steps in the simulation are all very sequential; each takes information from the previous step and generates information for the next step.  Even if they could be run in parallel, I don't have the computer skills to rewrite the code to take advantage of this.  But I realized that I can quite easily do multiple slow runs with small sub-genomes instead of one run with a big genome, and then manually combine the outputs.  For my present goals I only need to get data for a few slow runs with big genomes, so doing this manually wouldn't be a big deal (much easier than learning how to automate it).  The only concern is edge effects, but I think these will be minor and I can run a test to confirm this.

Some responses pointed me to an Amazon service called 'elastic compute cloud' - I haven't yet gone beyond looking at the front page, but this is certainly worth investigating.

Several people generously offered me time on computers under their control - either personal computers that are somewhat faster than mine or clusters under their control.  The local WestGrid computers I could use are three times slower than my laptop, so I'm going to check out one of these I've been given an account on, located on the other side of the country.

But I still need to work out exactly what big slow runs I need data for, and I also need to discuss this with my main co-author (former post-doc).  Deciding what needs to be done is the real bottleneck.

Chalk talk!

Alex Kondrashov gave a dynamic chalk talk yesterday, making the point that conventional population genetics approaches to the evolution of sex were probably a dead end - sex can't be selected for under the usual assumptions.

I was inspired by this (also by the lovely long blackboard and the availability of coloured chalk) so I did my talk this afternoon as a chalk talk too.  I started by pointing out that, because meiotic sex evolved once early in eukaryotes, any acceptable explanation must solve a problem experienced by diploids and haploids, uni- and multicellular organisms, and obligately and facultively sexual organisms.  This constraint rules out many of the explanations and examples proposed at this meeting.  Then I said that, because bacteria don't have sex (don't have any process evolved to cause recombination of chromosomal genes), any acceptable explanation of meiotic sex must also solve a problem that bacteria don't have.  Then I went over the evidence that bacteria indeed don't have sex.  I was in an adrenalin fog and so have no idea how much over my allocated 20 minutes I went, or whether I ever looked at the audience.  But Matt Meselson thinks I'm right!