Comments on RRResearch: Positive control progress on the USS model

Hi Rosie,I only used that trick to calculate the H...

2008-04-22T16:54:00.000-07:00

Hi Rosie,

I only used that trick to calculate the Hamming distance because it's very terse and I wanted to save space. A better example of the same would be something like this

sub hamming_distance {
my ($seq1, $seq2) = @_;

my $distance = 0;
if ($seq1 ne $seq2) {
my $len1 = length($seq1);
my $len2 = length($seq2);

unless ($len1 == $len2) {
die "unequal lengths: $seq1, $seq2\n";
}

for (my $i = 0; $i < $len1; ++$i) {
unless (substr($seq1, $i, 1) eq substr($seq2, $i, 1)) {
++$distance;
}

return $distance;
}

That's what I'd use in "real code" because it's obvious what's going on. Only if it were acting as a bottleneck would I change it to something more complex.

I had a look at the Wikipedia entry - it seems pretty comprehensive. I like the material at bielefeld - it's what I used to read when I was doing bench microbiology.

Distance and Similarity

If you have working code, I wouldn't change it for efficiency reasons unless both 1) your program is running intolerably slowly and 2) you have run it in the Perl profiler and demonstrated that the bit you are thinking of changing really is the bottleneck. (A profiler tells you what proportion of the program runtime is spent in each function.)

Hi Keith,We did it before I read your comment. Co...

2008-04-22T15:23:00.000-07:00

Hi Keith,

We did it before I read your comment. Conveniently, the scores are integers between 0 and 10.

The 'Hamming distance' method appears to calculate exactly what we have been using as a USS-match score; we've been using a more cumbersome method to calculate it. Is the Wikipedia entry a good place to find out about this?

I'll have to read your post more carefully to see if your tally method is also more efficient than ours. I also need to read about hashtables; we used an array, which I suspect is much less efficient.

Thanks,

Rosie

Heh, of course, if your scores aren't integers or ...

2008-04-22T09:46:00.000-07:00

Heh, of course, if your scores aren't integers or are sparsely distributed this might not work so well. You would simply end up the zillions of table entries, all with a frequncy of 1.

D'oh! I should think before I type.

In that case, you could use a hashtable where the keys are bin numbers (e.g. 0-9 for ten bins) and each score is tested with a switch statement to see in which bin it belongs.

Collecting the frequencies should be pretty easy. ...

2008-04-22T09:23:00.000-07:00

Collecting the frequencies should be pretty easy. The example below makes a random genome and searches it for "ATTG" with a sliding window, recording the Hamming distance between the query and the window sequences. Each distance score is recorded as the key of a hashtable, with the corresponding hash value incrementing each time a new match with that score is found.

#!/software/bin/perl

use strict;
use warnings;

# Generate a random genome
my $genome_len = 100;
my $nuc_idx = {0 => 'A', 1 => 'C', 2 => 'G', 3 => 'T'};

my @nuc_distrib;
for (my $i = 0; $i < $genome_len; ++$i) {
push(@nuc_distrib, int(rand(4)));
}

my $genome = pack("A" x $genome_len,
map { $nuc_idx->{$_} } @nuc_distrib);

# Here's our query
my $query = "ATTG";
my $query_len = length($query);

# Hashtable to store results
my %hamming_freqs;

# Search the genome
my $end = $genome_len - $query_len + 1;
for (my $i = 0; $i < $end; ++$i) {
my $dist = hamming_distance(substr($genome, $i, $query_len), $query);

# Here we collect frequencies
$hamming_freqs{$dist}++;
}

# Print results
foreach my $dist (sort keys %hamming_freqs) {
print "Distance $dist occurred ", $hamming_freqs{$dist}, " times\n";
}

# Calculate Hamming distance between two sequences
sub hamming_distance {
my ($seq1, $seq2) = @_;

my $distance = 0;
if ($seq1 ne $seq2) {
$distance = (($seq1 ^ $seq2) =~ tr/\001-\255//)
}

return $distance;
}