In the old days, computer programs ran on big centrally located computers that belonged to universities or big corporations, not on personal desktops or laptops.  It just occurred to me that this might still be possible.  I'd gladly pay a modest sum to run a few of my simulations on something that was, say, 10-100 times faster than the MacBook Pro I'm typing this on.  I tried googling "buy time on fast computer" and other permutations, but couldn't find anything (lots of "When is it time to buy a new fast computer" pages).
I think that there must be places to do this.  Perhaps one of my occasional readers will know.  But, in case you don't, I'm going to send an email to the evoldir list, which reaches thousands of evolutionary biologists.
Talk to your computer science department. I'd be amazed if there's no high-performance computing cluster there. It may not be 100 times as fast as your laptop but there's probably some system you have access to that offers a 10-fold speedup.
ReplyDeleteHi Rosie,
ReplyDeleteThere are services such as Amazon's ec2 (Elastic Compute Cloud) that let you run programs on large numbers of virtual computers where you pay for the units of compute that you use.
Given the tiny data sizes that you describe, it sounds like there is some intensive compute going on somewhere. If you have the resources, I would take another look at the way the algorithm is implemented. Can it be improved by changing the data structures or tricks such as memoization? If that seems fine, you may be pushing the limits of plain Perl's suitability for your problem. In that case, other avenues could include looking at specialist Perl modules such as PDL (Perl Data Language).
The terms you are looking for are "on demand computing" or "utility computing". Or, potentially, "supercomputing".
ReplyDeleteFor most of these suggestions, you will likely have to spend time reimplementing your software to use the new system, which I am sure you do not want to spend too much time doing.
With Amazon's EC2 service, you get access to high end computers 1 at a time, billed for only the time that you use. Each computer won't be any faster than yours, but you will be able to get access to lots of computers for only the time you need. So, this system would probably be best if you want to run a large number of experiments in parallel (maybe changing variable in between instances). I would bet frequent commenter Deepak Singh could point you in useful directions: http://www.linkedin.com/in/dsingh
The Sun Grid at network.com looks like it is able to run perl program, but I'm not sure if it is faster:
http://biowiki.org/SunGridEngineExamples
We have a bioinformatics cluster that you're welcome to test drive.
ReplyDeleteCheck your email :^>
Rosie - you might be interested in checking out WestGrid [ http://www.westgrid.ca/ ], in particular its UBC node aka Glacier [ http://www.westgrid.ca/support/quickstart/glacier ] which is suitable for computationally intensive serial jobs. Although I have never used it to run Perl, it being a off-the-shelf Linux cluster I would expect it to have Perl installed by default. Also, it's free!
ReplyDeleteWow, 5 comments already!
ReplyDeleteI've used WestGrid (a cluster shared between several nearby universities) for other parts of this project (e.g. most of the Gibbs analyses). But their individual computers are VERY SLOW, and rewriting our program to use the grid feature would be an absurd waste of time for the small amount of work I want to do now.
The program is computationally intensive. In each cycle it does a bit of analysis on some short sequences, checks the results, changes a parameter a bit, and does the analysis on some more sequences. It keeps doing this until a requirement is satisfied; sometimes this takes hundreds of iterations for a single cycle. And I want to run at least 100,000 cycles.
The nature of the program also means that cloud computing isn't an option. I'm simulating evolution, so each cycle takes as input the results of the previous cycle.
Rob, thanks for the offer. I'll send you the program and the settings file; maybe you can see if it runs on your system, and check how long a simple run takes.
i don't know enough about this to be much help, but SFU has a beowulf cluster.
ReplyDeleteI shared your post on FriendFeed:
ReplyDeletehttp://friendfeed.com/the-life-scientists/c7309267/rrresearch-can-we-buy-time-on-shared-computer
The Life Scientists room in particular has become quite the community hub, and a good place to ask for advice of all sorts.
I don't know how your perl script is setup or written, but there are many limiting factors here.
ReplyDeleteOne is, if as you say that all 100,000 cycles are dependent on the previous one, running your script on a larger machine won't necessary save you running time. Say your MacBook is a 2.3 GHz machine, even if you run on a cluster with 3.0 Xeons you won't be able to gain much, maybe 5-10% depending on how optimized is your code. Your script, if not threaded or parallel won't take advantage of the faster machine.
Two, as mentioned above, if it's not threaded or parallel, nothing will be accomplished in a cluster without some modifications. You will need some changes in the calculations and find a way to distribute them to other machines. Either some external application, such as Hadoop or MapReduce (again, I have no idea of how your code is set), or some major code change in your script.
Three, it's Perl. It's an interpreted language, so the code is not compiled, hence some gain you might have running on a faster CPU might be nullified by running on a interpreter that is not optimized to the environment you're running. Simulations and CPU-intense applications like the one you running are better suited with compiled languages. You may be able to create some C++ code that will do the calculations, and wrap it with Perl script that will take care of the string parsing, etc. Again, it will depend on how your code was designed.
Four, there are many DNA sequence simulators available. One is DAWG that is a wonderful application and fast. Can be compiled on Macs and gives you a lot of features. It might have some of the things you're currently doing.
Five, I don't know if you already did, but it would help a lot if you posted your code. I understand you preach OS, but that's half of the mass. Either Science is open, or not. Without your code, we cannot help, because we don't know what you're doing with it. That would also help you, as you will be able to define more precisely what type of help you want.
Paulo
..
ReplyDeleteTo everyone who suggested modifying my code, thanks, but I'm not competent to make any serious changes to it, and not prepared to learn how at present. Maybe later, after this manuscript is done.
ReplyDeleteNow I realize that mainframe computers no longer exist I see two alternatives. 1. I could decide that I don't need to do the very long runs I was considering. 2. I could break up the genome I would be simulating into many short genomes and run them all on the WestGrid machines.