I began using the Gibbs motif search program just to get a better estimate of 'the' USS consensus. But as I get into it (necessary to get it working effectively) I'm discovering more and more questions I can answer with it. Here are what I have so far:
1. Does the direction in which a sequence is replicated affect the consensus of its USSs? That is, do USSs whose "forward" orientation sequence is found in the leading strand during DNA replication have a slightly different consensus than those whose "reverse" orientation sequence is in the leading strand? I raised this issue at the end of an earlier post, but I haven't tested it yet. All I need to do is to get the sequence of each strand of the genome, chop them ar the origin and terminus of replication, and then put the parts together so all the leading strand sequences are in one file, and all the lagging strand sequences are in another file. Then run both files through the motif search.
2. For USSs in coding regions (this is most USSs), does their orientation with respect to the direction of transcription and translation affect their consensus? This seemed very easy to do, as I was able to download from TIGR a file containing the nucleotide sequences of all the H. influenzae coding regions. I had to tidy away the small number of non-ACGT bases (I replaced them with Ns, using Word) and thought I was good to go. I would just have the program search for the forward motif and for the reverse motif, and see if the two motifs were complementary. But for some reason the motif-search program has a much harder time finding a motif in the set of gene sequences than in the set of whole-genome sequences. I'm beginning to suspect that it prefers its sequences in big fragments rather than in gene-sized pieces. This seems a bit improbable, so I'll wait until I've tested it before emailing the expert for help. Tomorrow I'll test it by getting the post-doc to use her clever Perl program to chop the genome into 500bp fragments (in Fasta format), and compare how the program handles this to the usual 9kb fragments.
3. Do the bases at different positions in the motif interact with each other in a way that makes some combinations more likely than others? Put another way, in the aligned set of sequences the motif search produces, position 16 is a T 20% of the time, and position 17 is a T 44% of the time. If all else is equal, we'd expect both positions to be Ts 8.8% of the time. But if the effects of each T work together well, in a way that, for example, an A and a T don't, then more than 8.8% of the sequences might have TTs here. We can use programs designed to detect the kind of genetic interactions called 'linkage disequilibrium' to detect these effects.
We didn't know what program(s) to use, so we posted a brief description of what we wanted to do onto the 'Evoldir' emailing list, where it would be seen by almost every evolutionary biologist on the planet. We got four very helpful replies, and soon I'll understand the programs well enough to answer our question.
21 hours ago in Variety of Life