Positive control progress on the USS model

The simplest version of our new Perl model of USS evolution has progressed to the state where it runs correctly. This afternoon I've been doing lots of runs, both with a 'positive control' version that replaces a random genome position with a single perfect USS core in every cycle, and with a test version that mutates random fragments and scores them for goodness of match to the USS motif, and then recombined the best-matched one back into the genome. Tomorrow the undergrad and I are going to create a modified version, to try a different way of having the fragments' scores determine whether they recombine with the genome.

With the positive-control version I've been examining the effect of changing the genomic mutation rate. If the mutation rate is zero, the only limit to USS accumulation is the way insertion of new USS disrupts existing USSs. (This happens only because each 10bp fragment is changed to a perfect USS before recombination, and so bears no relation to the original sequence at that position.) Not surprisingly, more USSs are present at equilibrium when the mutation rate is zero, and fewer when the mutation rate is 0.01 or 0.05 changes per position per cycle. The rate of increase in genome score is largely independent of the mutation rate. Because only a single USS is inserted per cycle, the number of cycles to equilibrium depends on the length of the genome.

Wait - good idea! I think we need to add some code to give us the frequency of each sliding-window score at the end of the run. This would let us make a histogram of how many USS-like sequences the genome has at the beginning of the run, and how many it ahs accumulated at the end. Basically, as the sliding-window is scoring match to the motif at each position, it should record the score in a tally (number of sites scoring 0, number scoring 1, number scoring 2, ...... number scoring 10). I could write some inefficient code to do this (barring about a thousand syntax errors - I really should go back and reread the first few chapters of Beginning Perl for Bioinformatics), but this sounds like something the undergrad might have learned an efficient way to do.

Did I learn anything else from the positive control runs? If the genome is very short the program runs very fast but the scores are noisy (no surprise there). I learned that I have no practical insight into how a sequence's USS 'score' reflects the quality of its matches to the motif - that's why we need the tally. I played around with the 'threshold' we use as a cutoff for insignificant matches to the USS consensus, but I think we can't really understand what this accomplishes until we have the score tally.

I also did some runs with the test version (not the positive control). The results of these mostly served to reinforced the importance of the genomic mutations. Under the present recombination system, USS can't accumulate in the genome because they mutate away faster than they're improved by recombination. I tried turning off the mutation of the genome, so that mutation only happens to fragments that are about to be scored for possible uptake. Even with this 'cheating', the genome's USS score crept up slowly and then plateaued at what looks (without the tally) to be a genome with only weak USS sites.

4 comments:

  1. Collecting the frequencies should be pretty easy. The example below makes a random genome and searches it for "ATTG" with a sliding window, recording the Hamming distance between the query and the window sequences. Each distance score is recorded as the key of a hashtable, with the corresponding hash value incrementing each time a new match with that score is found.

    #!/software/bin/perl

    use strict;
    use warnings;

    # Generate a random genome
    my $genome_len = 100;
    my $nuc_idx = {0 => 'A', 1 => 'C', 2 => 'G', 3 => 'T'};

    my @nuc_distrib;
    for (my $i = 0; $i < $genome_len; ++$i) {
    push(@nuc_distrib, int(rand(4)));
    }

    my $genome = pack("A" x $genome_len,
    map { $nuc_idx->{$_} } @nuc_distrib);

    # Here's our query
    my $query = "ATTG";
    my $query_len = length($query);

    # Hashtable to store results
    my %hamming_freqs;

    # Search the genome
    my $end = $genome_len - $query_len + 1;
    for (my $i = 0; $i < $end; ++$i) {
    my $dist = hamming_distance(substr($genome, $i, $query_len), $query);

    # Here we collect frequencies
    $hamming_freqs{$dist}++;
    }

    # Print results
    foreach my $dist (sort keys %hamming_freqs) {
    print "Distance $dist occurred ", $hamming_freqs{$dist}, " times\n";
    }


    # Calculate Hamming distance between two sequences
    sub hamming_distance {
    my ($seq1, $seq2) = @_;

    my $distance = 0;
    if ($seq1 ne $seq2) {
    $distance = (($seq1 ^ $seq2) =~ tr/\001-\255//)
    }

    return $distance;
    }

    ReplyDelete
  2. Heh, of course, if your scores aren't integers or are sparsely distributed this might not work so well. You would simply end up the zillions of table entries, all with a frequncy of 1.

    D'oh! I should think before I type.

    In that case, you could use a hashtable where the keys are bin numbers (e.g. 0-9 for ten bins) and each score is tested with a switch statement to see in which bin it belongs.

    ReplyDelete
  3. Hi Keith,

    We did it before I read your comment. Conveniently, the scores are integers between 0 and 10.

    The 'Hamming distance' method appears to calculate exactly what we have been using as a USS-match score; we've been using a more cumbersome method to calculate it. Is the Wikipedia entry a good place to find out about this?

    I'll have to read your post more carefully to see if your tally method is also more efficient than ours. I also need to read about hashtables; we used an array, which I suspect is much less efficient.

    Thanks,

    Rosie

    ReplyDelete
  4. Hi Rosie,

    I only used that trick to calculate the Hamming distance because it's very terse and I wanted to save space. A better example of the same would be something like this

    sub hamming_distance {
    my ($seq1, $seq2) = @_;

    my $distance = 0;
    if ($seq1 ne $seq2) {
    my $len1 = length($seq1);
    my $len2 = length($seq2);

    unless ($len1 == $len2) {
    die "unequal lengths: $seq1, $seq2\n";
    }

    for (my $i = 0; $i < $len1; ++$i) {
    unless (substr($seq1, $i, 1) eq substr($seq2, $i, 1)) {
    ++$distance;
    }

    return $distance;
    }

    That's what I'd use in "real code" because it's obvious what's going on. Only if it were acting as a bottleneck would I change it to something more complex.

    I had a look at the Wikipedia entry - it seems pretty comprehensive. I like the material at bielefeld - it's what I used to read when I was doing bench microbiology.

    Distance and Similarity

    If you have working code, I wouldn't change it for efficiency reasons unless both 1) your program is running intolerably slowly and 2) you have run it in the Perl profiler and demonstrated that the bit you are thinking of changing really is the bottleneck. (A profiler tells you what proportion of the program runtime is spent in each function.)

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS