The manuscript about variation in uptake sequences is still a mess, but today I'm going to send it off to my co-author anyway because my brain is tired of dealing with it. The Introduction isn't bad, and the Results is at least complete in that it describes all the analyses and their results (even though a couple of the model analyses aren't actually done yet). But the Discussion is still a shambles. It's not that we don't have important ideas to put in the Discussion, but I'm too saturated with details to think about them.
The draft figures are done. Some of them are far from polished, but they show what the data is. A couple of others are sketches because the data is incomplete. Lots of new runs to generate the missing data have already finished on the new server, and other (longer) runs are ongoing.
One analyses I'm working on is the spatial distribution of uptake sequences around the genome. Years ago a paper reported that USS are spaced more evenly than would be expected for randomly positioned sequences and suggested this might mean they had an intracellular role in chromosome maintenance. I'm wondering if this even spacing might instead by a consequence of the recombination events they arise by - if so we would expect the uptake sequences that arise in our simulations to also be relatively evenly spaced. A recent paper on Neisseria DUS found spacing to have roughly the same properties as the average lengths of recombination tracts (but this was a very weak constraint).
To do this analysis we need to know the locations of our uptake sequences in our evolved simulated genomes, but our model doesn't report this. On Monday I tried using the Gibbs motif sampler to find these. It easily found the perfect occurrences but needed some nudging to find most of the singly and doubly mismatched ones. This will be useful for making logos that show the evolved motifs, but not very suitable for analyzing spacing. My co-author has now modified a little perl script she had written so it lists the positions of the uptake sequences. The version she sent was for the USS 9 bp core, so I modified it to find the 10 bp generic US that the simulations had been generating. So now I have my first list of spacings.
The USS spacing analysis was done by Sam Karlin (noted Stanford mathematician interested in genomics), using some fancy math he'd invented. I emailed my local statistics expert about the best way to analyze the spacings, and he recommended that I use simulations (rather than probability theory) to find out what a random distribution should look like, and then compare our spacings to this.
To do this I think I need to write a little script that generates a specified number of random positions in a string of specified length (e.g. get 243 positions in a 200,000 sequence, to match the data I had for an evolved genome) and calculates the variance of the spacing distribution. The script needs to do this many times (10,000, suggested the expert!), and generate some summary that I can use to decide whether the spacings of the 243 uptake sequences in my evolved genome is significantly different from random spacing. The expert is off at the evolution meetings right now - I'll consult with him when he gets back.
In the interim I need to get back to thinking about DNA uptake, to plan the preliminary experiments and analyses we want to get done for our upcoming proposals.
I played with this stuff when looking for significant associations between putative laterally transferred genes a few years ago. A hugely fun problem that has never percolated to the top of the Important column since then.
ReplyDeleteI think David Sankoff had some measures for this that were similar to the ones I made up. I suspect that his will be better :^>
wv: 'obine' - pertaining to a pair of sheep.