The simplest version of our new Perl model of USS evolution has progressed to the state where it runs correctly. This afternoon I've been doing lots of runs, both with a 'positive control' version that replaces a random genome position with a single perfect USS core in every cycle, and with a test version that mutates random fragments and scores them for goodness of match to the USS motif, and then recombined the best-matched one back into the genome. Tomorrow the undergrad and I are going to create a modified version, to try a different way of having the fragments' scores determine whether they recombine with the genome.
With the positive-control version I've been examining the effect of changing the genomic mutation rate. If the mutation rate is zero, the only limit to USS accumulation is the way insertion of new USS disrupts existing USSs. (This happens only because each 10bp fragment is changed to a perfect USS before recombination, and so bears no relation to the original sequence at that position.) Not surprisingly, more USSs are present at equilibrium when the mutation rate is zero, and fewer when the mutation rate is 0.01 or 0.05 changes per position per cycle. The rate of increase in genome score is largely independent of the mutation rate. Because only a single USS is inserted per cycle, the number of cycles to equilibrium depends on the length of the genome.
Wait - good idea! I think we need to add some code to give us the frequency of each sliding-window score at the end of the run. This would let us make a histogram of how many USS-like sequences the genome has at the beginning of the run, and how many it ahs accumulated at the end. Basically, as the sliding-window is scoring match to the motif at each position, it should record the score in a tally (number of sites scoring 0, number scoring 1, number scoring 2, ...... number scoring 10). I could write some inefficient code to do this (barring about a thousand syntax errors - I really should go back and reread the first few chapters of Beginning Perl for Bioinformatics), but this sounds like something the undergrad might have learned an efficient way to do.
Did I learn anything else from the positive control runs? If the genome is very short the program runs very fast but the scores are noisy (no surprise there). I learned that I have no practical insight into how a sequence's USS 'score' reflects the quality of its matches to the motif - that's why we need the tally. I played around with the 'threshold' we use as a cutoff for insignificant matches to the USS consensus, but I think we can't really understand what this accomplishes until we have the score tally.
I also did some runs with the test version (not the positive control). The results of these mostly served to reinforced the importance of the genomic mutations. Under the present recombination system, USS can't accumulate in the genome because they mutate away faster than they're improved by recombination. I tried turning off the mutation of the genome, so that mutation only happens to fragments that are about to be scored for possible uptake. Even with this 'cheating', the genome's USS score crept up slowly and then plateaued at what looks (without the tally) to be a genome with only weak USS sites.
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
From Valley Forge to the Lab: Parallels between Washington's Maneuvers and Drug Development4 weeks ago in The Curious Wavefunction
-
Political pollsters are pretending they know what's happening. They don't.4 weeks ago in Genomics, Medicine, and Pseudoscience
-
-
Course Corrections5 months ago in Angry by Choice
-
-
The Site is Dead, Long Live the Site2 years ago in Catalogue of Organisms
-
The Site is Dead, Long Live the Site2 years ago in Variety of Life
-
Does mathematics carry human biases?4 years ago in PLEKTIX
-
-
-
-
A New Placodont from the Late Triassic of China5 years ago in Chinleana
-
Posted: July 22, 2018 at 03:03PM6 years ago in Field Notes
-
Bryophyte Herbarium Survey7 years ago in Moss Plants and More
-
Harnessing innate immunity to cure HIV8 years ago in Rule of 6ix
-
WE MOVED!8 years ago in Games with Words
-
-
-
-
post doc job opportunity on ribosome biochemistry!9 years ago in Protein Evolution and Other Musings
-
Growing the kidney: re-blogged from Science Bitez9 years ago in The View from a Microbiologist
-
Blogging Microbes- Communicating Microbiology to Netizens10 years ago in Memoirs of a Defective Brain
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl12 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House13 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs13 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby13 years ago in The Large Picture Blog
-
in The Biology Files
Not your typical science blog, but an 'open science' research blog. Watch me fumbling my way towards understanding how and why bacteria take up DNA, and getting distracted by other cool questions.
4 comments:
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS
Subscribe to:
Post Comments (Atom)
Collecting the frequencies should be pretty easy. The example below makes a random genome and searches it for "ATTG" with a sliding window, recording the Hamming distance between the query and the window sequences. Each distance score is recorded as the key of a hashtable, with the corresponding hash value incrementing each time a new match with that score is found.
ReplyDelete#!/software/bin/perl
use strict;
use warnings;
# Generate a random genome
my $genome_len = 100;
my $nuc_idx = {0 => 'A', 1 => 'C', 2 => 'G', 3 => 'T'};
my @nuc_distrib;
for (my $i = 0; $i < $genome_len; ++$i) {
push(@nuc_distrib, int(rand(4)));
}
my $genome = pack("A" x $genome_len,
map { $nuc_idx->{$_} } @nuc_distrib);
# Here's our query
my $query = "ATTG";
my $query_len = length($query);
# Hashtable to store results
my %hamming_freqs;
# Search the genome
my $end = $genome_len - $query_len + 1;
for (my $i = 0; $i < $end; ++$i) {
my $dist = hamming_distance(substr($genome, $i, $query_len), $query);
# Here we collect frequencies
$hamming_freqs{$dist}++;
}
# Print results
foreach my $dist (sort keys %hamming_freqs) {
print "Distance $dist occurred ", $hamming_freqs{$dist}, " times\n";
}
# Calculate Hamming distance between two sequences
sub hamming_distance {
my ($seq1, $seq2) = @_;
my $distance = 0;
if ($seq1 ne $seq2) {
$distance = (($seq1 ^ $seq2) =~ tr/\001-\255//)
}
return $distance;
}
Heh, of course, if your scores aren't integers or are sparsely distributed this might not work so well. You would simply end up the zillions of table entries, all with a frequncy of 1.
ReplyDeleteD'oh! I should think before I type.
In that case, you could use a hashtable where the keys are bin numbers (e.g. 0-9 for ten bins) and each score is tested with a switch statement to see in which bin it belongs.
Hi Keith,
ReplyDeleteWe did it before I read your comment. Conveniently, the scores are integers between 0 and 10.
The 'Hamming distance' method appears to calculate exactly what we have been using as a USS-match score; we've been using a more cumbersome method to calculate it. Is the Wikipedia entry a good place to find out about this?
I'll have to read your post more carefully to see if your tally method is also more efficient than ours. I also need to read about hashtables; we used an array, which I suspect is much less efficient.
Thanks,
Rosie
Hi Rosie,
ReplyDeleteI only used that trick to calculate the Hamming distance because it's very terse and I wanted to save space. A better example of the same would be something like this
sub hamming_distance {
my ($seq1, $seq2) = @_;
my $distance = 0;
if ($seq1 ne $seq2) {
my $len1 = length($seq1);
my $len2 = length($seq2);
unless ($len1 == $len2) {
die "unequal lengths: $seq1, $seq2\n";
}
for (my $i = 0; $i < $len1; ++$i) {
unless (substr($seq1, $i, 1) eq substr($seq2, $i, 1)) {
++$distance;
}
return $distance;
}
That's what I'd use in "real code" because it's obvious what's going on. Only if it were acting as a bottleneck would I change it to something more complex.
I had a look at the Wikipedia entry - it seems pretty comprehensive. I like the material at bielefeld - it's what I used to read when I was doing bench microbiology.
Distance and Similarity
If you have working code, I wouldn't change it for efficiency reasons unless both 1) your program is running intolerably slowly and 2) you have run it in the Perl profiler and demonstrated that the bit you are thinking of changing really is the bottleneck. (A profiler tells you what proportion of the program runtime is spent in each function.)