The sxy manuscript has been on hold, partly because one of the two grad students involved in it was in the far north. But he's back, and the manuscript is close enough to being finished that I'm hopeful it will be done soon ('soon' being an elastic term here). So I need to switch my attention to it and away from the motif searches for the USS-defined manuscript that have been consuming my brain power lately.
But before I stop I'm seeing how much I can get finished. I ran and analyzed the leading-strand and lagging-strand searches - their motifs are indeed identical to the composite one I posted.
The more-stringent and less-stringent searches gave the results I expected (fewer and more sites with the motif, with stronger and poorer mean scores, respectively). I used the run that gave the most sites and the worst mean scores to do a correlation analysis. (Having more sites that are imperfectly matched to the consensus increases the power of this analysis to detect weak interactions.)
The goal of the correlation analysis is to find out whether the bases at different positions of the USS interact. For example, the most common bases at position 17 are A and G, and the most common bases at position 21 are T and A. If we find that the individual USSs that have A at position 17 usually have T at position 21, and those that have G at position 17 usually have A at position 21, we would conclude that the bases at these positions interact during DNA uptake. Said another way, we'd conclude that USSs with a G at position 17 function better if they have a A at position 21.
Results: MatrixPlot found only weak correlations between only a few adjacent positions in two clusters. A colleague has kindly used software he wrote to also analyze a preliminary data set for us; I'm going to ask him if he can test the big set. Before doing this, I realized that I only have half the data, as I only did the low stringency searches on the forward strand. So I've queue'd up more searches, with the same and even lower stringency, on both forward and reverse-complement strands.
And, finally, some of the gene searches are working, thanks to fine-tuning advice from the helpful Gibbs expert. These runs are searching the sequences of only the parts of the genome that code for proteins, to see if the direction of coding affects the motif. I had to split the gene set into four parts, and two of these managed to find their motifs. So I've queue'd up more replicates, using more seeds, and also runs looking for the reverse-strand motif.
And last night I read over the Introduction and improved parts of it, though it still needs more work.
Leroy Hood and the tool-driven revolution in biology
1 day ago in The Curious Wavefunction