I think I've done enough analysis of the USS and DUS motifs, at least for now.
First, what was I hoping to find out? (What were the questions this might answer?) I had already done a thorough Gibbs-based analysis of the USSs in the H. influenzae genome. But the other genomes had only been searched for 'perfect' and singly-mismatched' USS cores, and only for perfect DUSs. The patterns found were enough to confirm that the genomes have the same general type of USS or DUS as their relatives, but not enough to detect more subtle differences. So I wanted to know how different related genomes' motifs are. I also wanted to get to better idea of the distribution of variation within each genome. Do different genomes have the same proportions of strongly-matched and weakly-matched sites, or do some genomes have a lot more poorly-matched sites than others?
I've now used the Gibbs motif sampler to search both strands of all the genome sequences known to have uptake signal sequences (USS or DUS). I used a 'fragmentation mask' for each search - one complex mask for all of the USS searches and a simple 12-position mask for the DUS searches. From the combined forward- and reverse-orientation sites each pair of searches found I combined enough of the top-scoring sites to give me 1.5 times the known number of 'perfect' 9bp USS cores, or of the original 10bp DUSs. These were used to make the logos.
The first conclusion is illustrated by the two Neisseria logos at the left. The Neisseria motifs are very similar not only in the sequence but in the relative strengths of the different positions. I found only minor differences in DUS site density, and they all had about the same balance of strong and weak matches. Several explanations are possible. First, these species may be very closely related, so their uptake machinery and genomic DUSs haven't had any time to diverge. Second, divergence may be selected against because all mutations that change the uptake machinery's specificity reduce its efficiency. Third, maybe selection favours the ability to take up DNA from different members of the genus. All three 'species' live in the human nasopharynx (though N. gonorrhoeae is more often found elsewhere).
The next three logos are for different species in the Hin clade of the Pasteurellaceae. They show the range of minor variation I found in this group. The overall sequence consensus is very similar, but the relative strengths of different positions are variable, and some of the weaker positions have different bases preferred. I'm fairly confident that these differences reflect significant differences in the accumulated USSs, because each is based on more than 2000 sites. I think these differences probably do reflect differences in the biases of the respective uptake machineries, but demonstrating such subtle difference experimentally would be a lot of work, and probably not worth the trouble.
The final logo is for a genome from the Apl clade of the Pasteurellaceae. As expected from the motif previously derived from perfect+one-off USSs, it has a different core sequence (ACAAGCGGT rather than AAGTGCGGT) and a longer right-hand flanking motif. The surprising thing to me is the weak consensus of the first two motif positions, much weaker than most of the non-core positions. This probably means that it was a mistake to treat the 9bp sequence as the core for this clade. Really the core is only 7bp. This means that my use of 1.5 times the number of 9bp cores was probably too stringent a criterion for the Apl clade USSs. I did notice that this cutoff removed many high-scoring sites from the analysis. I should probably go back and do the cutoff analysis again, after scoring the number of perfect 7bp cores in these genomes. Maybe I'll do that tonight.
11 hours ago in Variety of Life