Question

How can I learn cluster computing?

6

Entering edit mode

10.3 years ago

sviatoslav.kendall ▴ 970

I have access to my institutions super-computer and I recognize that knowing how to use a cluster-computing environment is a valuable skill for a bioinformatician, but I do not know how to go about learning to use one.

I imagine there must be some good tutorials out there that I could use to learn the basics. Can some point me in the right direction?

genome next-gen RNA-Seq • 5.2k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by sviatoslav.kendall ▴ 970

4

Entering edit mode

Are you sure that your institution doesn't have a tutorial session or workshop series? They almost always do, because it keeps questions like these from flooding their inbox :)

What cluster software is it running? I may have some notes handy. You might be able to infer the cluster management software by typing one of the following on the command line:

man bsub
man msub
man qsub

Let me know if one of those commands gives you a manpage.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Dan D 7.4k

1

Entering edit mode

I'll add man sbatch to that list.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Devon Ryan 105k

0

Entering edit mode

That would be my first suggestion too; get in touch with the IT people, ask about courses or online material. They usually provide at least some basic guides, in the hope that fewer people will break their system :)

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Neilfws 49k

0

Entering edit mode

Asking internal IT first is necessary, and easiest, also to learn how LSF (or whatever platform is in use) has been set up and some options might be made mandatory. For example, to submit a job it might be as easy as bsub < myscript.sh but probably you need to say how much memory, run time you want.

The man pages are certainly authoritative and worth referring to but they might give the impression that submitting jobs is more complicated than it actually is in practice!

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by dariober 15k

0

Entering edit mode

man pages for LSF are horrible!

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by brentp 24k

0

Entering edit mode

Glad to hear I'm not the only one! Between that and man curl it's a tough competition.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by dariober 15k

0

Entering edit mode

They do offer such courses but somewhat infrequently and I just missed the last one.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by sviatoslav.kendall ▴ 970

2

Entering edit mode

Bring the IT folks coffee and/or beer and I bet they'll give you the quick version of the course (or at the very least give you the slides).

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Devon Ryan 105k

0

Entering edit mode

man bsub brings up a page about "LSF jobs"

man qsub brings up a page about "PBS job"

I guess they've got both types of cluster software. Found a couple of tutorials online but would still be happy to take a look at any notes you have to share.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by sviatoslav.kendall ▴ 970

2

Entering edit mode

do you know how to use linux and bash? If so, then using a queuing system is a relatively small step. Usually if you can do:

echo "some long command" | bash

then it can run as:

echo "some command" | qsub -e msg.err -o msg.out

and you simply have to learn about 5 common flags to reserve the correct number of CPU's and amount of memory.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 10.3 years ago by brentp 24k

1

Entering edit mode

Maybe not a technical skill, but you should learn good practices and common courtesies. Always benchmark your programs/tasks for memory usage and CPU usage efficiency. Usually your goal is to either decrease the wallclock time needed to perform some task, or utilize multiple nodes to overcome some hardware limitation (e.g. memory). You should always look for ways to achieve these goals while utilizing your hardware as efficiently as possible.

A few pointers:

Don't clog nodes/queue up with terrible scripts. Sometimes it can't be avoided but if you do it all the time there will be people looking for a length of steel pipe if they see you have 1200 24 core nodes each running a single threaded perl script that takes 13 hours and the queue is filled with 3k more of these. It is worth the effort at times to make an initial investment in performance, you'll save walltime and people won't hate you.
If your system is heterogeneous, with differently groups of nodes that have different amounts of cores/memory, or if there are different interconnects, be mindful of what you use. Don't run 32 1GB memory python scripts on a 32 core node with 500GB of ram. If you're not using MPI, avoid using nodes with Infiniband/Infinipath ICs.
Be mindful of the funding sources used to build the machine. Sometimes a department or the university/facility will pay for the whole cluster. In other cases the system is paid for in parts from a number of different labs/groups. In this case if you have access to the nodes, use them, but be mindful about who paid for what. When in doubt, ask if your jobs are causing problems.
Always do small test runs before production. Try a single job on a node or two and make sure it is working. If you're having problems with production runs, move back to a small testing size. Don't sit there submitting huge numbers of jobs if you're troubleshooting or still developing.
Each cluster is different, the hardware, software, admin/IT support, number of users and the common types of usage. It is always useful to remember what short cuts you can and can't take because of the features specific to your system. The types/level of usage and how your batch system schedules things can impact how you can best run jobs. Sometimes it is faster to have a few nodes do more work rather than have jobs sitting in the queue waiting for nodes to open up. The amount of hardware can impact how you program, if you have tons of ram you can be lazier about memory management and how you load data. This can come back to hurt you later if you relax too much.
Pay attention to how the file system is set up and how/what is backed up. What nodes can see what directories/file systems? Are there differences in the types/speed of the drives used? How often are directories backed up? What is the maximum size of the snapshots that can be taken? Are daily backups different sizes than monthly/weekly?
This isn't usually the case for academic settings, but it may be the case that you're paying per unit of usage. Either wall time, cpu time and/or storage.
Efficiency, not speed up, is what you're after. Don't use twice as many cores if it only saves you a few hours of runtime. If your jobs are relatively fast, don't use huge numbers of nodes just to save time. Wait a bit longer. Even embarrassingly parallel problems can stop scaling once you hit up against other hardware limits.

In general you want to be more prudent than usual, HPCs are great, you can do huge amounts of stuff in parallel but just remember that "stuff" can mean getting work done, or it can mean "creating a disaster". Though not permanent, it isn't fun to find out that your 1200 jobs run in parallel made 1200 messes.

In addition to learning the typical batch systems, you may want to explore the various tools and means of parallelization your cluster has to offer. Everything from relatively simple tools like GNU parallel, to software specific parallelization (e.g. MATLAB), to simple code based approaches (e.g. R's snow package) to more complex (e.g. MPI). You may need to use these to develop custom tools, or you may need to know how they work for using software from others (e.g. MrBayes).

As others have stated, it isn't difficult to get started if you're already familiar with shell/*nix environments and can program. You may want to see if your school/company/entity has courses or classes on HPC. It can be very useful as the classes typically use whatever system you'll be working on. They'll also go more in depth into different areas of HPC/parallel computing which can be useful in the long run. I've found that they're pretty good places for networking and developing contacts you can approach with technical questions.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by pld 5.1k

Ram · Answer 1 · 2014-12-05

I worked at Vanderbilt for several years, and their compute cluster team was fantastic. They did weekly workshops to teach new users how to properly submit jobs. These slides will hopefully be very helpful to you. They'll show you how to construct a job submission, query for existing jobs, and check resource availability. Skip to slide 16:

http://www.accre.vanderbilt.edu/docs/Intro_to_Cluster.pdf

Based on the comments I think your cluster is using PBS/TORQUE, so hopefully those slides will be applicable. To check, just make a quick shell script job submission and see if it successfully executes after you qsub it.

Ram · Answer 2 · 2014-12-04

2

Entering edit mode

10.3 years ago

Vivek ★ 2.7k

Most commonly used schedulers are LSF & SunGrid Engine (SGE). If you search for LSF + tutorial or Sun Grid Engine + tutorial you'll find links to a bunch of quick start guides at various university webpages.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Vivek ★ 2.7k

Ram · Answer 3 · 2014-12-05

1

Entering edit mode

10.3 years ago

Ron ★ 1.2k

I think this is a very good resource

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ron ★ 1.2k

Ram · Answer 4 · 2014-12-05

0

Entering edit mode

10.3 years ago

873243 • 0

SLURM is a highly modular and scalable resource manager for clusters widely used (note from comment above man sbatch for slurming a script into a cluster). You can send an executable program to be run in the cluster.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by 873243 • 0

Ram · Answer 5 · 2014-12-05

0

Entering edit mode

10.3 years ago

5heikki 11k

If you know how to use a shell, you're already there. Just a few more utils that you need to master like ssh, scp and qsub. If you don't, well, that (shell) is all the practice you need..

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by 5heikki 11k

Ram · Answer 6 · 2014-12-05

I first met SGE/OGE, and people told me to RTFM.

So many things to learn and to understand.

I then learned that there is a parallel implementation of GNU-make for SGE (and slurm)

Just run

qmake -j 10

instead of

make -j 10

no more problem.

see also: How To Organize A Pipeline Of Small Scripts Together? , Standard simple format to describe a bioinformatics analysis pipeline ...