The second of my three questions about the USS motif was whether, for the large subset of USSs that are in genes, the orientation of USSs with respect to the proteins they help code for affects their motif consensus. So my plan was to assemble all the coding sequences of the genome, all oriented in the direction their proteins are specified (not in the direction their DNA is replicated), and to then compare the motifs of the USSs in the two possible USS orientations.
Assembling the sequences seemed straightforward (download in one file from TIGR, remove unwanted characters). But the motif-search program couldn't find the USSs I knew were there (see last week's post). I spent a week or more trying more tests and variations, to try to figure out what was going wrong, because I didn't feel that I understood the problem well enough to clearly explain it in an email to the helpful expert. Was the number of sequences over the limit? Were the 'N's I'd had to insert causing problems? Were the sequences too short? Was the problem dominant or recessive to a well-behaved sequence?
Yesterday the same problem appeared in some new sequence files, and then was corrected (see previous post). I wasn't entirely sure what I'd done that made the difference, but this did give me confidence that the problem with my gene sequence files was in the formatting, and my prime suspect was the hated carriage returns. These are a nightmare for Unix beginners like me - they're often invisible, they come in several incompatible flavours (Mac vs PC vs Unix), and Unix/Linux is very fussy about them. I can't remember exactly what I did, but I think it involved global search-and-destroy missions against carriage returns in both Word and Unix, then global restoration of the important returns in Word, then a passage through the text editor Mi to convert any Mac-style returns to Unix ones. And presto, the problem seems to be solved!
So while I've been sleeping the program has been busy searching the gene sequence file for USS motifs, and later this morning I hope to be able to compare the forward- and reverse-direction motifs. We know that protein coding constraints do affect the reading frame that USSs are found in - for each USS orientation there's a preferred reading frame that USSs are best tolerated in. So it's reasonable to suspect that the USS consensuses might also differ between the orientations. If they do, we'll understand a bit more about how natural selection acts on USSs.
Later: No, I was overly optimistic. The program is able to find a short version of the USS motif (10bp) but it can't find anything when asked to search for the full-length motif (22bp). I suspect it needs to be given a stronger prior expectation than just the spacing I'm giving it. Maybe I'll try suggesting the consensus sequence.
Leroy Hood and the tool-driven revolution in biology
1 day ago in The Curious Wavefunction