I’m considering parallelizing the downloading of multiple FASTQ files from FTP sources using a SLURM job scheduler. Specifically, I’m thinking of using a job array where each element corresponds to either a single FASTQ file (for single-end reads) or two FASTQ files (for paired-end reads). Would this be an efficient approach for handling large-scale downloads on a high-performance computing cluster? Are there potential pitfalls or better alternatives for managing such downloads in parallel?
Thank you for the suggestion. To clarify, are you recommending submitting a single job to the scheduler with
--cpus-per-task=${threads}
and then running GNU Parallel within that job using--jobs ${threads}
?Yep, that should work.
I've seen this configuration before and I don't get reason behind it. Doesn't it defeat the point of having an HPC with a scheduler?
I don't think so. This setup just assumes that most jobs are highly parallel and will be using the full set of resources of one or more nodes. In practice, this is not always true, in which case there will be some level of inefficient resource usage.
The cluster I work on has, over time, decided to take an "opt-in" approach to node sharing, where users who know they don't need an entire node can voluntarily submit to queues designed for multitenancy. So far, this seems to work pretty well.
Fair enough, but it seems a strong assumption. If I submit a job asking for 1 GB or memory and 1 cpu, I don't see why that job should fully occupy a node that is presumably much larger than that. I mean, it's the scheduler's task to find out where my job should go to make the best use of resources.
I actually more-or-less agree with you. I suspect that the next HPC system we set up will probably go from an "opt-in" to "opt-out" approach, forcing users to share node resources by default.
This kind of configuration is quite common on HPCs with very different users, mainly with different backgrounds/interests (eg. academic and industrial users).
HPC-admins I came across usually point to the fact that it is a 'security' measure, as in : node-exclusivity assures you can't snoop around to see what other people are doing or interfere with their jobs.
In my experience it usually does allow that jobs from the same user can run on the same server/node (if resource request allows it)
I don't mean to be a pain, just procrastinating and thinking aloud here... A malicious user can still ssh from the headnode to a working node and see who is running what. Regarding interfering with jobs, if a regular user can do that then I'm inclined to think that either the cluster or the scheduler are not correctly managed (well, once I killed other people's jobs because I filled the
/tmp
directory).There are ways to prevent this. A cluster can be set up to not allow one to ssh to a node where you don't have an active job running. Can that restriction be circumvented? Sure. In general, one would like to believe that your colleagues are not looking to do something malicious.
It's fairly straightforward to prevent ssh to compute nodes without a job allocation.