tag:blogger.com,1999:blog-32079676.post5582171489362285846..comments2024-02-13T21:22:02.522-08:00Comments on RRResearch: Positive control progress on the USS modelRosie Redfieldhttp://www.blogger.com/profile/06807912674127645263noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-32079676.post-52848919649925152532008-04-22T16:54:00.000-07:002008-04-22T16:54:00.000-07:00Hi Rosie,I only used that trick to calculate the H...Hi Rosie,<BR/><BR/>I only used that trick to calculate the Hamming distance because it's very terse and I wanted to save space. A better example of the same would be something like this<BR/><BR/>sub hamming_distance {<BR/>my ($seq1, $seq2) = @_;<BR/><BR/>my $distance = 0;<BR/>if ($seq1 ne $seq2) {<BR/> my $len1 = length($seq1);<BR/> my $len2 = length($seq2);<BR/><BR/> unless ($len1 == $len2) {<BR/> die "unequal lengths: $seq1, $seq2\n";<BR/> }<BR/><BR/> for (my $i = 0; $i < $len1; ++$i) {<BR/> unless (substr($seq1, $i, 1) eq substr($seq2, $i, 1)) {<BR/> ++$distance;<BR/> }<BR/><BR/> return $distance;<BR/>}<BR/><BR/>That's what I'd use in "real code" because it's obvious what's going on. Only if it were acting as a bottleneck would I change it to something more complex.<BR/><BR/>I had a look at the Wikipedia entry - it seems pretty comprehensive. I like the material at bielefeld - it's what I used to read when I was doing bench microbiology.<BR/><BR/><A HREF="http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/node2.html#SECTION00023000000000000000" REL="nofollow">Distance and Similarity</A><BR/><BR/>If you have working code, I wouldn't change it for efficiency reasons unless both 1) your program is running intolerably slowly and 2) you have run it in the Perl profiler and demonstrated that the bit you are thinking of changing really is the bottleneck. (A profiler tells you what proportion of the program runtime is spent in each function.)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-32079676.post-31474077310315961592008-04-22T15:23:00.000-07:002008-04-22T15:23:00.000-07:00Hi Keith,We did it before I read your comment. Co...Hi Keith,<BR/><BR/>We did it before I read your comment. Conveniently, the scores are integers between 0 and 10.<BR/><BR/>The 'Hamming distance' method appears to calculate exactly what we have been using as a USS-match score; we've been using a more cumbersome method to calculate it. Is the Wikipedia entry a good place to find out about this?<BR/><BR/>I'll have to read your post more carefully to see if your tally method is also more efficient than ours. I also need to read about hashtables; we used an array, which I suspect is much less efficient.<BR/><BR/>Thanks,<BR/><BR/>RosieRosie Redfieldhttps://www.blogger.com/profile/06807912674127645263noreply@blogger.comtag:blogger.com,1999:blog-32079676.post-4168887690349020172008-04-22T09:46:00.000-07:002008-04-22T09:46:00.000-07:00Heh, of course, if your scores aren't integers or ...Heh, of course, if your scores aren't integers or are sparsely distributed this might not work so well. You would simply end up the zillions of table entries, all with a frequncy of 1.<BR/><BR/>D'oh! I should think before I type.<BR/><BR/>In that case, you could use a hashtable where the keys are bin numbers (e.g. 0-9 for ten bins) and each score is tested with a switch statement to see in which bin it belongs.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-32079676.post-46732524641219558392008-04-22T09:23:00.000-07:002008-04-22T09:23:00.000-07:00Collecting the frequencies should be pretty easy. ...Collecting the frequencies should be pretty easy. The example below makes a random genome and searches it for "ATTG" with a sliding window, recording the Hamming distance between the query and the window sequences. Each distance score is recorded as the key of a hashtable, with the corresponding hash value incrementing each time a new match with that score is found.<BR/><BR/>#!/software/bin/perl<BR/><BR/>use strict;<BR/>use warnings;<BR/><BR/># Generate a random genome<BR/>my $genome_len = 100;<BR/>my $nuc_idx = {0 => 'A', 1 => 'C', 2 => 'G', 3 => 'T'};<BR/><BR/>my @nuc_distrib;<BR/>for (my $i = 0; $i < $genome_len; ++$i) {<BR/> push(@nuc_distrib, int(rand(4)));<BR/>}<BR/><BR/>my $genome = pack("A" x $genome_len,<BR/> map { $nuc_idx->{$_} } @nuc_distrib);<BR/><BR/># Here's our query<BR/>my $query = "ATTG";<BR/>my $query_len = length($query);<BR/><BR/># Hashtable to store results<BR/>my %hamming_freqs;<BR/><BR/># Search the genome<BR/>my $end = $genome_len - $query_len + 1;<BR/>for (my $i = 0; $i < $end; ++$i) {<BR/> my $dist = hamming_distance(substr($genome, $i, $query_len), $query);<BR/><BR/> # Here we collect frequencies<BR/> $hamming_freqs{$dist}++;<BR/>}<BR/><BR/># Print results<BR/>foreach my $dist (sort keys %hamming_freqs) {<BR/> print "Distance $dist occurred ", $hamming_freqs{$dist}, " times\n";<BR/>}<BR/><BR/><BR/># Calculate Hamming distance between two sequences<BR/>sub hamming_distance {<BR/> my ($seq1, $seq2) = @_;<BR/><BR/> my $distance = 0;<BR/> if ($seq1 ne $seq2) {<BR/> $distance = (($seq1 ^ $seq2) =~ tr/\001-\255//)<BR/> }<BR/><BR/> return $distance;<BR/>}Anonymousnoreply@blogger.com