Field of Science

Estimating equilibrium with noisy data

I'm trying to find a better way to identify an 'equilibrium' genome score for our simulations of uptake sequence evolution. Ideally I would just let the runs go until the score was stable, but two factors complicate this. First, the scores are very noisy due to genetic drift, so there is no stable core. In principle this can be smoothed out by taking the mean score over a sufficiently long interval, but in practice the runs take a long time.

I have been working on complicated averaging strategies to deal with this, but for many runs I thought I could just stop trying to be economical of computer time. So I queue'd up a few very long runs, but this didn't really solve the problem, because the mean scores just drift up and down.

Finally I tried just plotting the 'grand mean' scores of the up-from-random-sequence-genome and down-from-seeded-sequence-genome runs ('up' and 'down' runs) on the same graph, and taking the average of the final values. The 'grand mean' score is the geometric mean of the genome scores at every cycle from the start of the run. For up runs it's a underestimate of the recent mean, and for the down runs it's an overestimate, but in both cases its much smoother than the recent mean. I'll paste a graph in later, but these give what look like pretty good stable estimates of equilibrium values. So I think I'll just state in the methods that runs were continued until the up and down grand mean scores differed by no more than two-fold, and then the final scores were averaged to give an equilibrium score.

Simulating uptake of degenerate USSs

In addition to using our uptake simulation to generate surrogate data for uptake of chromosomal fragments, I think I can also use it to generate surrogate data for the uptake experiments that start with degenerate pools based on perfect uptake sequences. In these experiments we'll generate a pool of synthetic DNA fragments (~70 bp in the latest plan, I think, although our CIHR proposal says 200 bp) with a 30 bp internal USS that is 12% degenerate at each position (maybe it's degenerate throughout its length. We plan to reisolate fragments from this pool that cells have taken up, sequence these, and analyze the patterns of variation.

I can give the uptake simulation a 'genome' consisting of just this fragment sequence, and instruct it to take up fragments of this full length (so every fragment will start out the same). I can then set the fragment-specific mutation rate to 12%, so that each fragment will be independently mutagenized to this level before being scored for possible uptake. I can test the effects of different uptake biases by using different matrices for the scoring, and different ways of using the matrices (additive, multiplicative). (I'll also ask my mathematical colleagues if there is any function that gives effects intermediate between additive and multiplicative.) If our proposed fragment is degenerate only in its USS, I'll use a USS-length fragment; if it's degenerate throughout I'll use the full length sequence.

Then I'll give the pool of simulated-uptake sequences (and a control pool of degenerate but unselected sequences) to the post-doc for analysis. He's already done a very nice preliminary analysis of the simulated chromosomal fragments.
I've spend the last 48 hours watching simulations of uptake sequence evolution run (very slowly) and fiddling with code and queue-submission files and settings, but now it all seems to be falling into place.

First the former post-doc said that she thought the int was indeed an error, and that it probably arose when the coding-expert undergrad copied the code from another probabilistic step to the recombination step. She also made the excellent suggestion that we use the additive matrix in the US-variation paper.

Now that the erroneous int has been eliminated from the simulation code, runs with additive scoring do indeed work. So do runs with the less-extreme tenfold bias, and runs using the Gibbs position weight matrices derived from the H. influenzae and N. meningitidis genomes. None of this give as strong accumulation as the 100-fold bias matrix I have been using, but that's OK.

Testing the additive scoring led to several other improvements. I increased the resolution of the recombination decision by introducing a lower-cutoff for the random number range, as I mentioned in the previous post. I also found that when using the additive scoring we no longer needed to have the bias gradually decrease during each cycle, but could just leave it fixed. And I rediscovered a problem with scoring the genome additively. [Aside: scoring is done in two situations - first to decide how well the different positions in an individual fragment correspond to the matrix, so a decision about recombination can be made, and second to track the accumulation of preferred sequences in the whole genome, to follow the progress of the simulation.] When individual fragments are scored, only the highest position-score is used, and the scores of all other positions are discarded. But when the genome is scored we add up the scores of all the positions in the genome. With multiplicative scoring the low scores are many orders of magnitude lower than the good scores, so including them in the sum has little effect. But with additive scoring even really lousy matches to the matrix have significant scores, and the sum of these swamps out the effect of the rarer good scores. The solution to this problem is to only use additive scoring for the individual fragments, and to always score the whole genomes by using the matrix multiplicatively.

I also spent some time relearning how to run the simulation on the computer cluster in Nova Scotia that a colleague kindly gave me access to, and modifying the program to run properly in this different situation (just trivial fiddly stuff). I kept queueing runs on this system, thinking they were running OK, and then looking at the error file )"just in case" and finding that they had aborted at the start because forgot to specify a file, or specified it wrong, or specified it in the wrong order. But now I have about 15 runs going overnight!

My new plan for this part of the US-variation manuscript (it's most of the Results) is to start with the additively scored matrix, both because that's how position-weight matrices are commonly used in other situations and because this means we don't have to explain about reducing the bias during the cycles. I'll start with the 100-fold matrix, and show how we detect equilibrium, and that genome length doesn't affect the outcome (score per kb). Then I'll explain about mutation rates (genomic and fragment-specific). The preliminary runs I did today indicated that the mutation rate of the fragments doesn't affect the equilibrium score, which I continue to find surprising. I had suspected that this was somehow due to using the bias-reduction feature with the multiplicative scoring, but now I see it with additive scoring and fixed bias. But I have a more detailed set of runs going overnight which should make the situation clearer. Then I'll consider how the amount of recombination (amount of genome recombined in each cycle) affects the outcome, and the effect of fragment length on number and spacing of uptake sequences

Then I'll consider the effects of different matrices, first the ten-fold matrix I tried today, and then the multiplicatively scored 100-fold and ten-fold matrices, and then the Gibbs matrices. This last step links the model back to the Gibbs analyses that the Results starts with.

Two more ideas about scoring sequences with a matrix

1. While the post-doc and I were fussing with the Perl simulation, figuring out the best settings for scoring sequences with the Gibbs position-weight matrix, we also tested the effect of calculating the score in each window of the sequence by adding the scores of the individual positions in the window, rather than by multiplying them. In fact the original versions of the simulation did add the positions cores, but we changed that because it didn't seem to give enough resolution (not enough difference in score between what our experiments had told us were good and bad uptake sequences).

Below I think I'll be able to explain how to get better resolution, but when the post-doc and I were discussing the additive and multiplicative versions we realized that of course neither additive nor multiplicative has any chance of being realistic. That's because the proteins that interact with the sequence don't use arithmetic to decide whether to act on a particular sequence. Instead each base at each position will affect the probability of uptake in its own way. Thus the additive and multiplicative ways of applying the matrix should probably be seen as the extreme outliers of how the real interactions might work.

So when we present this analysis in our proposal to NIH, I think we should include the additive case as well as the multiplicative case, with the no-selection sequences as control. The additive case will show only modest selection for fragments containing good matches to the matrix, and the multiplicative version wills how very strong selection. And if we're lucky enough to have some real uptake data by then, we can compare that to the two kinds of simulated data; it should fall somewhere in between. (Ideally I'd put a graphic here showing what I expect these to look like, but I'll leave that for the post-doc to do once he's analyzed the simulated data.)

2. Here's a way to improve the resolution of the additive selection: In the previous post I said that the simple selection method started by choosing a random number from the range between zero and the maximum possible score. But the range doesn't need to start at zero, and for the additive matrix it certainly shouldn't. Instead maybe it should start at the minimum possible score, or at the average score, or at some higher-than average score.

Why would we have used "int(rand(" rather than "rand("?

On Tuesday night I successfully modified our Perl simulation model so it printed out lists of sequences of simulated fragments that it had found acceptable, based on their scores using a position-weight matrix. But I couldn't get it to choose the best fragments scored with the Gibbs-derived matrix; instead it just gave me a list of randomly chosen fragment sequences. Scoring with a fake matrix worked OK.

So tonight I was reading through the code, trying to remind myself of how the model uses each fragment's score and Perl's random number generator to decide whether or not to accept the fragment for recombination (or, in this case, for printing). The logic is simple, but its implementation surprised me.

Here's the logic: First choose a random number from the range between zero and the maximum possible score. Then compare it to the fragment's score. If the score is lower, discard the fragment. If the score is higher, accept the fragment. So a fragment with the best possible score will always be accepted, and a fragment whose score is 50% of the best will get accepted half the time.

But the code the simulation actually used added a step. When it got each random number, it converted it to its integer value (instead of rand($max) it did int(rand($max). This didn't create a problem with our fake matrix, because the preferred bases at each of its 10 positions all had value 10, giving the total matrix a maximum score of 10^10, and taking the integer values just stripped off irrelevant decimal points. But the Gibbs matrix has true probabilities in its elements, so its maximum score is only 1.9 x 10^-7. This means that all the random numbers will be ≤ 1.9 x 10^-7, so their integer values will all be zero. And this means that all the scores will be larger than the random number, so all the fragments will be accepted.

I don't know why the int function was included in the original program, but we tried deleting it and the program still ran fine. And now the program works as it should with the Gibbs matrix, preferentially accepting fragments with higher scores. So I used it to collect the positions of 200 preferred fragments from each of the forward and reverse-complement strands of the first 50 kb of the H. influenzae genome, and an equal number of control fragments (with the program set to accept every fragment it scored.

As a bonus, solving this problem may also have solved the problem I was having with the real version of the simulation - the one we are using to model the evolution of uptake sequences. In the past I couldn't see any evolution of uptake sequences when I used a Gibbs-derived matrix, only with the fake matrix. At the time I though this must be because the Gibbs matrix didn't incorporate strong enough selectivity, and that was worrisome for my understanding of how uptake sequences evolve. But now I suspect that the problem may just have been that int function!

But I still would feel better if I knew why the student who did the coding for us included the int step. Did we discuss it and decide that this was the right thing to do, or did he just do it on his own initiative? I've checked the old versions of the program and it's there from the beginning. Maybe it's some sort of Perl good-practice thing.

Preliminary data goals for the NIH proposal

I took a turn at lab meeting this afternoon, and used the time to brainstorm about what experiments and analyses we should try to get done before the NIH deadline in Early February. The work for the first NIH Aim will also be preliminary data for a resubmission of the CIHR proposal I just submitted, should that become necessary. I had hoped to also go over other work for the CIHR resubmission but didn't get to it before my time was up, so I'm going to do that next week (remembering to also update the lab meeting schedule).

Goals for Aim 1: Here we have two goals. The big one is to reisolate enough clean DNA from the periplasm that we can either send it for Illumina sequencing (if we have the opportunity to do this - the post-doc has a contact) or characterize it with conventional sequencing, and then do some analysis.

The second goal is to generate and analyze some surrogate sequence data, just to show that we know how to analyze it. By first doing this surrogate analysis, we'll be all set to quickly analyze whatever read sequence data we are able to get before the NIH deadline. Made bold by my recent success in adapting our Perl simulation of uptake sequence evolution to score DNA sequences, I've claimed that I can easily modify the simulation code so it generates sets of DNA sequences from a genome, a random-fragment set and a high-uptake set. Ideally these sets would contain fragments with a range of lengths, but for now I'll just generate several sets of each type (e.g. a 0.5 kb random set, a 1 kb random set set, a 2 kb random set) and mix them together.

Goals for Aim 2: The big issue here is whether we can successfully reisolate input DNA from the cytoplasm. This DNA is expected to be single-stranded, and to be only 5-10% of the total DNA in the cytoplasm (the rest is double-stranded chromosomal DNA). Traditional methods of separating ssDNA from dsDNA rely on their differing physical properties; both hydroxyapatite and BND-cellulose can be used. An alternative I just thought of is to use random hexamers to prime DNA synthesis on the single-stranded fraction of total cytoplasmic DNA, including a biotinylated nucleotide in the dNTP mix. The new DNA can then be separated from the rest using streptavidin agarose or beads, and the template strands eluted by denaturing the DNA.

One complication is that the chromosomes will contain short single-stranded segments, but these are likely to be a small contribution relative to the input DNA. Another complication is that the DNA in the cytoplasm is transient. We'll use a rec-1 mutant so the DNA doesn't recombine with the chromosome, but degradation by cellular nucleases is certainly a concern.

Goals for Aim 3: Here we plan to characterize the results of chromosomal recombination, by deeply sequencing a very large pool of DNA from cells transformed with DNA of another strain. Because there is no DNA reisolation step, I think the best preliminary data we could have is the sequence of one transformed strain (lab strain transformed with DNA from a diverged strain). We can easily get this information from one Illumina lane, and we may be able to get one lane done well before the NIH deadline.

The plan is for the post-doc to first transform a cloned novobiocin-resistance (novR) allele into the most transformable of the three sequenced diverged strains. This can be done very quickly because the former post-doc left a tube the novR DNA, and the transformation can be done by overnight co-culture of cells and DNA. The next step is to grow up and prepare DNA from one of the novR transformants, and to use that DNA to transform the standard strain KW20. Then we (i.e. the post-doc) just need to prep DNA from one of the transformants and pass it on to the Illumina specialists. They will shear the DNA to the appropriate length for Illumina sequencing, ligate the Illumina adapters onto the ends, and sequence it for us. We can use this sequence to specify the expected density of recombinant segments in the pooled chromosomes we propose to sequence, and the length distribution of recombination tracts.

Juggling data into the US-variation manuscript

I have two uptake sequence analyses I want to publish, and I'm trying to decide whether they would fit into the US-variation manuscript. One is the analysis of covariation between different USS positions and between different DUS positions (see here for example), and the other is BLAST analysis of the amounts of within-species sequence variation seen at different USS positions (see here). Both are certainly about variation in uptake sequences, and in previous blog posts I've described them as parts of this manuscript, but since then I've gotten bogged down in thinking that they don't really belong.
But let's assume that they do belong in this manuscript, and consider the next question of where would they fit best and how they would be connected with the other parts. They have to go after the first major section, which describes the analyses of the H. influenzae and N. meningitidis genomes with the Gibbs Motif Sampler, because they use datasets this generated. Should they go before the other major section, which describes the Perl model of uptake sequence evolution, or after it? These two major sections fit well together, because the model makes predictions that can be tested with the Gibbs datasets.
What if the model went first, using the generic uptake sequence? It could then be followed by the Gibbs analyses, applied both to the model's output and to the real genomes. And then by more versions of the model, using the position-weight matrices generated from the real genomes. We could then do the other analyses of the genomic data....
Here's an attempt at an outline using this order:
Intro:
Emphasize the goal of understanding the cause of uptake sequences. Summarize what we know of their properties and the evidence that they arise by point mutation and spread by transformation more efficiently than other sequences because they are preferentially taken up (cartoon figure contrasting drive and beneficial-variation models). We have developed a model of this evolutionary process, which we describe and test below.
Results:
  1. The model (how it works, with a cartoon figure).
  2. We run the simulations to equilibrium (evaluated as score or US count), using a 10 bp generic position-weight matrix.
  3. Characterization of model: Effects of (i) genome length (each cycle takes longer); (ii) mutation rate (getting to equilibrium takes more cycles); and (iii) recombination rate (determines equilibrium score, below saturation).
  4. Properties of equilibrium sequences (i): Proportions of perfect and mismatched occurrences. Use Gibbs to find them? Compare with direct counts of perfects, one-offs and two-offs? Explain why Gibbs is, in principle, better?
  5. Properties of equilibrium sequences (ii): Spacings between occurrences (found by Gibbs. These are more even than random, as are the spacings between real uptake sequences. The spacing depends on the length of the recombining fragments.
  6. Use Gibb to reanalyze the Perl output sequences.
  7. Use Gibbs to reanalyze the genome sequences.
  8. Repeat the key Perl runs using the position-weight matrices from the Gibbs analyses of real genomes.
  9. Compare what we find.
  10. Do other reanalyses and new analyses of the Gibbs output of the genome analyses. (Variation in subsets of genome sequences (no interesting results but worthwhile anyway) This would include the covariation analysis and the BLAST analysis of within-species sequence variation.
That seems workable, and as sensible as the previous order. Now I need to decide whether it's sufficiently better than the old order to be worth the trouble of rearranging the whole Results section.
A couple of hours later: I don't think it is. For now I'm going to treat the manuscript as two distinct parts. First, the Gibbs analysis of the genome and sub-genome sequences, and analyses using the occurrences these identify, including the covariation and within-species variation analyses. Second, the Perl simulation of uptake sequence evolution, and comparison of its results with the genome analyses.

More about Gibbs scores

Yesterday the post-doc and I were again trying to understand the scoring system used by the Gibbs Motif Sampler. It seems peculiar to us because, in a search where most occurrences of the motif get scores close or equal to 1.00, it produces a lot of sites that score zero, as illustrated in the histograms above. In general, sites that are scored higher also better match the position-weight matrix that describes the motif, but the scaling seems weird. All of these sites, even the ones scoring 0.00, are pretty good matches to the motif (see below, copied from earlier post), and we don't understand why there are a lot of sites with 0.00 scores but few or (usually) none with scores between 0.01 and o.50.

The Gibbs documentation isn't very helpful, describing the score as "the probability of the motif belonging to the alignment". And the legend provided with the scores at the end of the run only says "Column 6 : Probability of Element". I decided to email my helpful Gibbs expert how the score is calculated, but while searching for his email address I found that I'd already asked him in early 2007 (!) Here's what he said:
Well, it’s a little confusing. For the near optimal solution (the one labeled NEAR OPTIMAL RESULTS), the probability given is the probability of the particular site according to the motif model for that motif type. It is calculated by computing the posterior probability of each of the sites according to the motif probability model. In your first example, its calculated as P(A in position1)*P(C in position 2)…. The probability is normalized so that the sum of the probability of the particular site belonging to each possible motif type and background sums to 1. The first site below is pretty definitely a good fit for the model, the second is not so great and might be better described as a background site. Both of the statements you list below are true in the sense that the probability correlates with how well the site fits the model.

Why are there low probability sites in the model? When the MAP is calculated, the program tries to maximize the MAP. Sometimes, there is a marginal improvement to the MAP by adding these low probability sites. The frequency solution will usually not include these sites, so in some ways it gives a better picture of the motif model.
First about the 'normalization': I don't understand what he means, but I think I don't need to worry about it. I've drawn a sketch of what I think might be going on - the output first calculates a score for every position in the genome, and then puts only those above some criterion into its alignment. But before it reports those aligned sequences it recalculates their scores, maybe using some normalization procedure that stretches out the middle, cramming most of the scores that would have been in the lower half down to zero.

Now about the 'background': In one sense his statement that the low-scoring sites in the alignment should be treated as background seems to be wrong, because even the sites with scores of 0.00 have significant matches to the matrix (see figure below). But maybe sites with this strength of match to the matrix are expected to occur by chance in a genome sequence of this length and base composition, so we can't assume that they result from selection for the process/effect of interest. In that sense they might indeed be considered background. I would think that the decision about background status would also have to take into account how often such sites are expected by chance and how often they're seen. For example, if we found 100 occurrences where theory predicts 10, we would conclude that most are due to non-background forces (though we wouldn't know which ones).

Silane-treated coverslips?

I'm hoping to get started learning how to work with the optical-tweezers apparatus soon. One of the things I want to test is adhesion of H. influenzae cells to silane-coated coverslips, as this will eliminate the need to restrain the cells by suction from a micropipette. The RA has already found that cells don't stick to uncoated coverslips, nor to ones coated with BSA. I had been hoping that I could just order them from Sigma or somewhere, but now it looks like preparing the coverslips might be the first of many very fussy procedures.

A google search didn't find anyone selling them, but it found lots of protocols for making them. They all involve precleaning the coverslips by boiling in strong acid or soaking in strong base or scrubbing with bathtub cleanser (Bon Ami was used), and then dipping or soaking them in solutions containing 3-aminopropyltriethoxy-silane/epoxy-silae/TMA silane/trichloro-octenyl-silane/methacryloxypropyltrimethoxysilane, and then air-drying or baking or ??? and storing in a dessicator or under unspecified conditions. I have no idea what kind of silane I should use or how to prepare the coverslips, so I've emailed the author of the Bacillus subtilis paper that did tweezers analysis, to ask what she did. I've also emailed my physicist collaborator (whose apparatus I'll be using) to ask if she has experience with this.

I think I'll also email around our institute, to find out if anyone here is routinely using silane-treated cover slips. Maybe they can (1) give me one to test, and (2) show me how to make them.

To do list

Last night's post began to list the things I need to get done in the next ~6 months. I'll try to elaborate on it now, adding other goals.

Finish and submit the US-variation manuscript: This is the only manuscript we have in the works right now (except one on teaching that's fallen by the wayside). Hmmm, that in itself is probably a cause for concern.

Write the NIH proposal: Now that NIH has cut its proposal page limit to 12 pages, including figures but not references (?), perhaps we can just revise (and revise and revise) the 10-page proposal the post-doc submitted for his fellowship application. Yesterday I discovered that HERRO, the UBC group that supports research in the health sciences, is offering a half-day workshop on writing NIH grants at the end of the month. They don't seem to have advertised this to the UBC community, just announced it on their web page, but a colleague and I tracked it down and signed up.

Get preliminary data for the NIH proposal: The proposal has 3 (or maybe 4) parts, each consisting of sequencing specific very diverse DNA samples we've purified from transformed cells. For the first and third of these parts we can do a small-scale experiment to show that the components all work, if necessary doing the preliminary sequencing by conventional methods rather than Illumina. The second part is the most difficult, but we should be able to get preliminary data there too.

Get more preliminary data for the CIHR proposal: This will be critical if this proposal is rejected. The first part of this proposal is the same as the first part of the NIH proposal, so that new data will do double duty. But we also need more data for the second part, making and analyzing mutants and testing candidate DNA-binding proteins, and for the third part, optical tweezers measurements and studies of USS conformation and flexibility. The optical tweezers part is the most risky but that makes it the one most in need of evidence that it can be done. It's also my own project and I want to get started having fun with it soon.

Grant proposal's done!

I clicked "Submit" on Sunday night. It was wonderful writing it with the research associate and post-doc - we made a good team. If the proposal doesn't get funded it will be only because we don't have much preliminary data yet.

But I didn't get to relax because I headed off to Chicago on Monday morning to give a talk at Northwestern School of Medicine. I learned lots of interesting things in my talks with various faculty and students. One was something I should have considered for the proposal. It discussed the problem of fitting DNA through the narrow secretin pore in the outer membrane, especially if the DNA has to share the pore with a type IV pilus. The same pore is used to pass proteins out of the cell, and the same size problem arises - some of the proteins are probably too big to fit through the pore. So maybe we're wrong in attributing to the pore the size it appears to be when purified - maybe it's stretchy. Another particularly interesting thing was some unexpected properties of the DNA uptake specificity of Neisseria gonorrhoeae. I won't spell them out because they're not published yet, but we'll want to do some experiments to see if the same things happen in H. influenzae.

Now it's time for some careful thinking about what to try to get done in the next 6 months. I plan to submit a proposal to NIH in early February, and we need to obtain preliminary data. So we need to decide what data will be the most compelling. I also want to be prepared with more preliminary data in case CIHR rejects our DNA uptake proposal - that would need to be resubmitted by March 1. Some of the data we'd be getting for NIH will also be good for CIHR, but others will be different. And I need to get the US-variation manuscript finished. As I recall, I need to run some more simulations and then the former post-doc and I need to finish the writing.

A better analysis of the data



The previous post had a rather silly way of handling the matrix-score data. Now I've just graphed the actual scores on a log scale.

One line of Perl code changed = new data!

This afternoon the post-doc and I were discussing the uptake-sequence analysis we're proposing for the CIHR grant. We wanted a way to compare the actual uptake of genome fragments with that predicted by their scores with the matrix I'd derived from the genome sequence using the Gibbs motif sampler. I realized that we might be able to use my existing Perl program (the one that evolves uptake sequences in simulated genomes) to score these.

So, after I burned out on rewriting paragraphs in the proposal, I opened up the giant Perl program US.pl. In the 965 lines of code I found just the spot I wanted, and told the program to print the score it had just calculated. I put the first 10 kb of the H. influenzae genome sequence into a text file, and changed the program's settings file so it would score this sequence but wouldn't try to evolve it. And I created a scoring matrix using the output from one of my old Gibbs runs on the H. influenzae genome.

It ran! I pasted the output into a Word file and converted the '-' signs of the exponents of the scores (range -32 to -8) to a tab mark, and reopened the file in Excel. I then subtracted these values from 32, so now the lowest pseudoscore value was zero and the highest was 24. Then I plotted this 'score' against the row numbers

And here's the result (black bars). To check that I hadn't screwed up, I looked up the position of the highest score in the Excel file (score = 24, at position 5319) and checked the sequence I'd started with. Sure enough, at position 5319 I found AAAGTGCGGTCAATTTTTCTGGTATTTTTT, a near-perfect match to the USS consensus.

What this graph doesn't show very well is the large number of positions with very low scores. So here's another graph (blue bars) to the same scale, but showing only the first 300 positions. Now we see that most positions have scores less than 10.

Versions of these figures will go in the Appendix of the proposal, as preliminary data.

H. influenzae and influenza


Haemophilus influenzae got its name because (1) it likes to grow on medium that contains blood (it needs heme and NAD, which blood has lots of) and (2) it was identified as commonly present in lings and sputum of patients ill with influenza in the 1918 global epidemic. It was initially suspected of being the causative agent of influenza, before the virus was discovered. If you're interested in the history of medicine, I strongly recommend the book The Great Influenza, by John M. Barry.

Anyway, the new H1N1 influenza epidemic probably means we'll see a rise in H. influenzae pneumonia. I'm mentioning this in my grant application, in the paragraph on the medical relevance of H. influenzae.

A terminology problem

I've only just realized how strongly the terms we use have been affecting our thinking about uptake sequences. I'll try to write about the problem here, but the nature of terminology problems makes it hard to write about them...

First some history of uptake sequences: It's been known since 1974 (Scocca, Poland and Zoon) that competent H. influenzae cells take up H. influenzae DNA fragments in preference to fragments from unrelated species, and since 1979 that this is a sequence preference (i.e. not a modification preference). In 1980 Danner et al. showed that H. influenzae cells preferentially take up fragments containing the 11bp sequence AAGTGCGGTCA, and that ethylation of positions in and close to this sequence interferes with uptake. They correctly concluded that the species preference is seen because H. influenzae DNA is greatly enriched for this sequence (or for similarly-acting sequences). [They presciently also proposed that the uptake bias might be at least partly responsible for this enrichment, a hypothesis we've since formalized in our molecular drive model.] Later Goodgal showed that the 9 bp sequence AAGTGCGGT was probably sufficient to give strong uptake. In 1995 Smith et al. analyzed the newly available H. influenzae genome sequence for the 9 bp sequence and found 1465 (later 1471) perfect matches. Alignment of these also showed a strong consensus for two 6 bp AT-rich segments on one side of the 9 bp sequence. And in the last five or ten years we showed that these sequences are the consensus of a diverse motif that accumulates in the genome by point mutation and homologous recombination, and attempted to measure the preference of the uptake machinery for different variants of this consensus.

The terminology problem: So we need terminology for several different phenomena, all of which have at least sometimes been called uptake sequences or uptake signal sequences (USSs). There's the short sequence initially identified as the uptake sequence; the 9 bp version of this is commonly referred to as the uptake signal sequence or USS and is often the only sequence considered in analyses of uptake. Then there's the consensus derived from analysis of the genome sequence, with the 9 bp sequence called the core USS and the flanking AT-rich segments called segments 2 and 3. Then there's the position-weight matrix description of the genomic enrichment, which we refer to only in convoluted ways. And finally there's the true bias of the DNA uptake machinery, which we have only very poor estimates of and no term for.

Until recently nobody knew enough about DNA uptake and the genome to clearly distinguish between these, and this confusion has led to a lot of muddled thinking. We've been among the main perpetrators of this confusion, using 'USS' interchangeably for genomic sequences and uptake biases. But now we're proposing to obtain a very detailed description of the 'true' uptake bias, and to compare it to the genomic motif, so I need to carefully choose the terms I use, and to explicitly draw the reader's attention to them.

I could call the genomic motif the genomic USS motif, and its consensus the genomic USS consensus. We think it really has evolved as a motif, just like those associated with the preferences of known DNA-binding proteins, but we haven't yet identified the proteins and we aren't defining it by the proteins' preferences (which we don't know) but by its abundance in the genome. Although this consensus is useful in some experiments and simulations, it isn't much use when considering what's actually in the genome. If the genomic USS motif does approximate the real bias of the uptake machinery, then counting the number of occurrences of the core consensus (1471) is quite misleading about the actual uptake of different parts of the genome. (If it doesn't, counting is even more misleading.)

Should I call the real uptake bias just that (real uptake bias)? Or the real uptake motif? I could leave off the real, and just use uptake motif. Like the genomic USS motif, it would be described by a position-weight matrix. What would I call the consensus of this, which I am proposing to use in additional experiments in place of the genomic USS consensus? Should I avoid using the USS abbreviation when referring to the matrix and consensus of the real uptake bias? This would help the reader keep things straight, but would mean getting rid of the customary description of this bias as the USS.

I could keep it simple, using genomic motif and genomic consensus, and uptake motif and uptake consensus. Or I could insert USS into each of these terms (genomic USS motif etc.). I think I'll try the latter for now - it's a bit wordy but has the virtue of being unambiguous, and if I later come up with something better I can easily make the needed changes.

Later:
I overlooked one more terminology problem. How do I refer to individual sequences in the genome that might (or might not) promote uptake, but that haven't ever been tested? We used to distinguish between fragments that contain 'USS' and those that don't, but the motif/molecular drive perspective predicts a smooth continuum between sequences, with those that strongly promote uptake at one end and those that resist uptake at the other. It gives no justification for distinguishing between 'real' USS and sequences that look a bit like USS. Should I call them all USS-like sequences? Candidate USS sequences? Fragments with different degrees of correspondence to the genomic USS motif?