I have been using an HPC cluster for a few years now and regularly need to submit jobs that process large amounts (often over 100) for large files like BAM files etc.
Despite some experience, I feel I am lacking some of the understanding of the basic concepts such as:
- How to estimate how much RAM and runtime a job will need - I know, it's mostly based on experience and no one can ever answer that for me
- The relationship between how much RAM you give a job and how much runtime? Are these 2 parameters independent? Will one affect how long you are in the queue more?
So my question is:
Does anyone know of a nice book/online resource that explains these basic concepts and ideas? I find myself struggling to answer these simple questions and the documentation out there is very often geared towards explaining complicated details about how supercomputers work. I am interested in all that but I would like to start with a dummed-down version first that focusses on how to submit jobs properly.
Any ideas? I should say, the cluster I use works with the Sun Grid Engine[SGE] system.
This depends on knowing the tool and quite a bit of trial and error. You can start off with figuring out how to parallelize runs as much as possible, then you optimize the RAM, wall time and number of cores for each parallel chunk, and also optimize the RAM, runtime and number of cores for the master thread.
Start off with 16 GB RAM and 4-8 cores, wall time of 48-72 hours and tune from there. There are a whole lot of variable that go into the process.
I doubt you will find any resource that explains these because its something you have to figure out for yourself through trial & error. Depends on the program you are using and the data you are processing. There are basically 2 approaches: 1) be extremely generous for each job and request more memory and time than you could possible need, or 2) request only the bare minimum memory and time and see if the job completes successfully, if not then bump them up a little and try again.
m93, people have invested time to answer your question.
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Some programs are written to run by storing data in memory. Other programs are written to run by working on sorted or other predictably-organized data from file streams. Other programs still work best doing a mix of both. It depends on your program and input.
Without knowing what you're doing, this is a tough question to answer with specifics. Yet:
Giving more memory to a program that uses a constant amount of memory will not change how fast it runs. This will just waste memory. However, if you can split the work and run lots of instances of said program, each working concurrently on a small piece of the problem, then more memory will help the overall task complete in less time, because your overall memory use will be, at most, M x N for constant memory cost M and N jobs.
Also, job schedulers will have an easier time moving many small-memory jobs from the wait queue into the run queue, than one monolithic large-memory job, which may need to wait until queue conditions allow allotment of a large chunk of memory.
Actually, it would help the forum, if you could post few tips/suggestions as you have years of experience in submitting bioinformatics jobs to HPC cluster. Take dummy data or public domain data and walk us through till the end. At least, point to your blog/github repo for scripts. m93