RRResearch: Can we buy time on a shared computer?

Can we buy time on a shared computer?

By Rosie Redfield on Wednesday, May 27, 2009

In the old days, computer programs ran on big centrally located computers that belonged to universities or big corporations, not on personal desktops or laptops. It just occurred to me that this might still be possible. I'd gladly pay a modest sum to run a few of my simulations on something that was, say, 10-100 times faster than the MacBook Pro I'm typing this on. I tried googling "buy time on fast computer" and other permutations, but couldn't find anything (lots of "When is it time to buy a new fast computer" pages).

I think that there must be places to do this. Perhaps one of my occasional readers will know. But, in case you don't, I'm going to send an email to the evoldir list, which reaches thousands of evolutionary biologists.

11 comments:

iayorkMay 27, 2009 at 6:00 AM
Talk to your computer science department. I'd be amazed if there's no high-performance computing cluster there. It may not be 100 times as fast as your laptop but there's probably some system you have access to that offers a 10-fold speedup.
ReplyDelete
Replies
KeithMay 27, 2009 at 6:05 AM
Hi Rosie,

There are services such as Amazon's ec2 (Elastic Compute Cloud) that let you run programs on large numbers of virtual computers where you pay for the units of compute that you use.

Given the tiny data sizes that you describe, it sounds like there is some intensive compute going on somewhere. If you have the resources, I would take another look at the way the algorithm is implemented. Can it be improved by changing the data structures or tricks such as memoization? If that seems fine, you may be pushing the limits of plain Perl's suitability for your problem. In that case, other avenues could include looking at specialist Perl modules such as PDL (Perl Data Language).
ReplyDelete
Replies
Brian HaugenMay 27, 2009 at 6:50 AM
The terms you are looking for are "on demand computing" or "utility computing". Or, potentially, "supercomputing".

For most of these suggestions, you will likely have to spend time reimplementing your software to use the new system, which I am sure you do not want to spend too much time doing.

With Amazon's EC2 service, you get access to high end computers 1 at a time, billed for only the time that you use. Each computer won't be any faster than yours, but you will be able to get access to lots of computers for only the time you need. So, this system would probably be best if you want to run a large number of experiments in parallel (maybe changing variable in between instances). I would bet frequent commenter Deepak Singh could point you in useful directions: http://www.linkedin.com/in/dsingh

The Sun Grid at network.com looks like it is able to run perl program, but I'm not sure if it is faster:
http://biowiki.org/SunGridEngineExamples
ReplyDelete
Replies
Rob BeikoMay 27, 2009 at 7:12 AM
We have a bioinformatics cluster that you're welcome to test drive.

Check your email :^>
ReplyDelete
Replies
Mario Pineda-KrchMay 27, 2009 at 7:16 AM
Rosie - you might be interested in checking out WestGrid [ http://www.westgrid.ca/ ], in particular its UBC node aka Glacier [ http://www.westgrid.ca/support/quickstart/glacier ] which is suitable for computationally intensive serial jobs. Although I have never used it to run Perl, it being a off-the-shelf Linux cluster I would expect it to have Perl installed by default. Also, it's free!
ReplyDelete
Replies
Rosie RedfieldMay 27, 2009 at 8:06 AM
Wow, 5 comments already!

I've used WestGrid (a cluster shared between several nearby universities) for other parts of this project (e.g. most of the Gibbs analyses). But their individual computers are VERY SLOW, and rewriting our program to use the grid feature would be an absurd waste of time for the small amount of work I want to do now.

The program is computationally intensive. In each cycle it does a bit of analysis on some short sequences, checks the results, changes a parameter a bit, and does the analysis on some more sequences. It keeps doing this until a requirement is satisfied; sometimes this takes hundreds of iterations for a single cycle. And I want to run at least 100,000 cycles.

The nature of the program also means that cloud computing isn't an option. I'm simulating evolution, so each cycle takes as input the results of the previous cycle.

Rob, thanks for the offer. I'll send you the program and the settings file; maybe you can see if it runs on your system, and check how long a simple run takes.
ReplyDelete
Replies
BBMay 27, 2009 at 10:58 AM
i don't know enough about this to be much help, but SFU has a beowulf cluster.
ReplyDelete
Replies
Bill HookerMay 27, 2009 at 8:05 PM
I shared your post on FriendFeed:

http://friendfeed.com/the-life-scientists/c7309267/rrresearch-can-we-buy-time-on-shared-computer

The Life Scientists room in particular has become quite the community hub, and a good place to ask for advice of all sorts.
ReplyDelete
Replies
UnknownMay 28, 2009 at 3:52 PM
I don't know how your perl script is setup or written, but there are many limiting factors here.

One is, if as you say that all 100,000 cycles are dependent on the previous one, running your script on a larger machine won't necessary save you running time. Say your MacBook is a 2.3 GHz machine, even if you run on a cluster with 3.0 Xeons you won't be able to gain much, maybe 5-10% depending on how optimized is your code. Your script, if not threaded or parallel won't take advantage of the faster machine.

Two, as mentioned above, if it's not threaded or parallel, nothing will be accomplished in a cluster without some modifications. You will need some changes in the calculations and find a way to distribute them to other machines. Either some external application, such as Hadoop or MapReduce (again, I have no idea of how your code is set), or some major code change in your script.

Three, it's Perl. It's an interpreted language, so the code is not compiled, hence some gain you might have running on a faster CPU might be nullified by running on a interpreter that is not optimized to the environment you're running. Simulations and CPU-intense applications like the one you running are better suited with compiled languages. You may be able to create some C++ code that will do the calculations, and wrap it with Perl script that will take care of the string parsing, etc. Again, it will depend on how your code was designed.

Four, there are many DNA sequence simulators available. One is DAWG that is a wonderful application and fast. Can be compiled on Macs and gives you a lot of features. It might have some of the things you're currently doing.

Five, I don't know if you already did, but it would help a lot if you posted your code. I understand you preach OS, but that's half of the mass. Either Science is open, or not. Without your code, we cannot help, because we don't know what you're doing with it. That would also help you, as you will be able to define more precisely what type of help you want.

Paulo
ReplyDelete
Replies
UnknownMay 28, 2009 at 3:52 PM
..
ReplyDelete
Replies
Rosie RedfieldJune 1, 2009 at 11:28 AM
To everyone who suggested modifying my code, thanks, but I'm not competent to make any serious changes to it, and not prepared to learn how at present. Maybe later, after this manuscript is done.

Now I realize that mainframe computers no longer exist I see two alternatives. 1. I could decide that I don't need to do the very long runs I was considering. 2. I could break up the genome I would be simulating into many short genomes and run them all on the WestGrid machines.
ReplyDelete
Replies

Add comment

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS

Field of Science

Can we buy time on a shared computer?

11 comments: