Is a SLURM job array submission suitable for parallelizing multiple FASTQ file downloads?
2
0
Entering edit mode
6 weeks ago
kalavattam ▴ 280

I’m considering parallelizing the downloading of multiple FASTQ files from FTP sources using a SLURM job scheduler. Specifically, I’m thinking of using a job array where each element corresponds to either a single FASTQ file (for single-end reads) or two FASTQ files (for paired-end reads). Would this be an efficient approach for handling large-scale downloads on a high-performance computing cluster? Are there potential pitfalls or better alternatives for managing such downloads in parallel?

parallel-computing HPC SLURM FASTQ • 1.2k views
ADD COMMENT
4
Entering edit mode
6 weeks ago
Dave Carlson ★ 2.1k

Assuming your Slurm cluster is configured like the one I use, where each job gets exclusive access to one or more compute nodes, this would mean each element in the Slurm job array will run on a separate compute node. If so, that will probably be quite inefficient, since each individual download is probably not using more than one (or a handful) of cores.

As an alternative, a tool like GNU parallel would work well for parallelizing multiple downloads within a single Slurm job.

That said, if your cluster allows multi-tenant jobs (multiple jobs on the same node at the same time), your Slurm array idea might work without wasting too many resources.

ADD COMMENT
1
Entering edit mode

As an alternative, a tool like GNU parallel would work well for parallelizing multiple downloads within a single Slurm job.

Thank you for the suggestion. To clarify, are you recommending submitting a single job to the scheduler with --cpus-per-task=${threads} and then running GNU Parallel within that job using --jobs ${threads}?

ADD REPLY
0
Entering edit mode

Yep, that should work.

ADD REPLY
0
Entering edit mode

Slurm cluster is configured like the one I use, where each job gets exclusive access to one or more compute nodes

I've seen this configuration before and I don't get reason behind it. Doesn't it defeat the point of having an HPC with a scheduler?

ADD REPLY
1
Entering edit mode

I don't think so. This setup just assumes that most jobs are highly parallel and will be using the full set of resources of one or more nodes. In practice, this is not always true, in which case there will be some level of inefficient resource usage.

The cluster I work on has, over time, decided to take an "opt-in" approach to node sharing, where users who know they don't need an entire node can voluntarily submit to queues designed for multitenancy. So far, this seems to work pretty well.

ADD REPLY
1
Entering edit mode

This setup just assumes that most jobs are highly parallel and will be using the full set of resources of one or more nodes.

Fair enough, but it seems a strong assumption. If I submit a job asking for 1 GB or memory and 1 cpu, I don't see why that job should fully occupy a node that is presumably much larger than that. I mean, it's the scheduler's task to find out where my job should go to make the best use of resources.

ADD REPLY
0
Entering edit mode

I actually more-or-less agree with you. I suspect that the next HPC system we set up will probably go from an "opt-in" to "opt-out" approach, forcing users to share node resources by default.

ADD REPLY
0
Entering edit mode

This kind of configuration is quite common on HPCs with very different users, mainly with different backgrounds/interests (eg. academic and industrial users).

HPC-admins I came across usually point to the fact that it is a 'security' measure, as in : node-exclusivity assures you can't snoop around to see what other people are doing or interfere with their jobs.

In my experience it usually does allow that jobs from the same user can run on the same server/node (if resource request allows it)

ADD REPLY
0
Entering edit mode

HPC-admins I came across usually point to the fact that it is a 'security' measure, as in : node-exclusivity assures you can't snoop around to see what other people are doing or interfere with their jobs.

I don't mean to be a pain, just procrastinating and thinking aloud here... A malicious user can still ssh from the headnode to a working node and see who is running what. Regarding interfering with jobs, if a regular user can do that then I'm inclined to think that either the cluster or the scheduler are not correctly managed (well, once I killed other people's jobs because I filled the /tmp directory).

ADD REPLY
0
Entering edit mode

A malicious user can still ssh from the headnode to a working node and see who is running what.

There are ways to prevent this. A cluster can be set up to not allow one to ssh to a node where you don't have an active job running. Can that restriction be circumvented? Sure. In general, one would like to believe that your colleagues are not looking to do something malicious.

ADD REPLY
0
Entering edit mode

It's fairly straightforward to prevent ssh to compute nodes without a job allocation.

ADD REPLY
2
Entering edit mode
6 weeks ago
GenoMax 148k

I’m thinking of using a job array

As long as the storage that your cluster is using can support the necessary bandwidth this should be feasible. Next to come in play will be the cluster network and interconnect, which I assume is fast ethernet/infiniband etc. If your data source supports it, you may want to look into aspera or similar optimized data transfer tool/protocol that may most efficient for data transfers.

On top of everything your local firewall/network set up may end up limiting the network flows. aspera can actually use up full bandwidth of available network, up and/or down.

ADD COMMENT

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6