Forum:Slurm, Son of Grid Engine, Mesos, and Hadoop YARN vs HTCondor and Torque
4
3
Entering edit mode
6.3 years ago
Shicheng Guo ★ 9.6k

Hi All,

Anyone have any idea to compare these high-throughput computing framework? Which one is the best to choose for current high-throughput computing frame( check TACC vs SDSC vs HTcondor).

Thanks

PBS HPC SDSC Torque • 7.1k views
ADD COMMENT
4
Entering edit mode
6.3 years ago

They're mostly the same at the end of the day, it's more a question of (1) choosing something that will still be supported in 5-10 years (the various SGEs keep losing support) and (2) finding someone locally willing to administer it. We switched from one of the umpteen SGE variants to Slurm a few years ago and are pretty happy. It's still getting regular updates and is widely used, so it's not going anywhere. The same can be said from Torque and LSF. HTCondor is a bit different, since most people would only use that if they need some of its more specialized features (e.g., moving datasets to nodes that lack shared filesystems (you can probably do this with other resource managers, I've never checked) or scavaging resources from unused computers).

BTW, regarding hadoop yarn, I imagine that'd be most useful if your cluster used hadoop. There are vanishingly few bioinformatics applications that natively support hadoop, so I'm not really sure you'd end up gaining anything except headaches.

ADD COMMENT
1
Entering edit mode
6.3 years ago
h.mon 35k

The "best" will probably depend on the size and architecture of your computing grid, and on the technical staff at hand to manage it. The only one I have hands-on administration experience is Torque+Maui, which is relatively simple, but I would only recommend for small clusters.

This Wikipedia link has some information, but it is very incomplete:

https://en.wikipedia.org/wiki/Comparison_of_cluster_software

edit: I just discovered Torque is no longer open source (it had a restrictive license, which some considered non-free):

http://www.adaptivecomputing.com/products/torque/

Note: As of June 2018, Adaptive Computing is offering Torque and Torque Support for purchase. For more information, please fill out the request form and we will respond as soon as possible.

ADD COMMENT
1
Entering edit mode
6.3 years ago

Matt Maurano wrote some wrappers to submit SGE scripts through a Slurm scheduler: https://github.com/mauranolab/sge2slurm

One feature that Slurm offers that I can't recall if our older SGE setup offered is the ability to submit arrays of jobs, which is useful for simulation or permutation tests (Monte Carlo etc.).

We've had some pains with Slurm, mainly due to configuring the fair-share priority mechanism and some other parameters that made less effective use of the cluster than desired, when a lot of jobs are thrown at it.

We also have some job submission timeout issues that it is hard to find out much about online. This seems to be a not-uncommon problem with Slurm deployments. No one seems to know what the problem is.

I'd definitely suggest looking into an ironclad support contract of some kind, regardless of what scheduler you go with. Also, put the cluster through its paces with various levels of load testing from the start, to figure out what needs tweaking for your setup.

ADD COMMENT
1
Entering edit mode

SGE does have job arrays ( http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto ), as does Torque+Maui. But I've seen one Torque+Maui system with instability when large arrays are submitted, causing Maui crashes, and found some posts about the subject as well, so the problem is not uncommon.

ADD REPLY
1
Entering edit mode
6.3 years ago
GenoMax 147k

Best in terms of features is probably LSF (which is not cheap and probably the reason it is not on your list).

ADD COMMENT

Login before adding your answer.

Traffic: 1326 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6