Is a SLURM job array submission suitable for parallelizing multiple FASTQ file downloads?
1
0
Entering edit mode
2 days ago
kalavattam ▴ 280

I’m considering parallelizing the downloading of multiple FASTQ files from FTP sources using a SLURM job scheduler. Specifically, I’m thinking of using a job array where each element corresponds to either a single FASTQ file (for single-end reads) or two FASTQ files (for paired-end reads). Would this be an efficient approach for handling large-scale downloads on a high-performance computing cluster? Are there potential pitfalls or better alternatives for managing such downloads in parallel?

parallel-computing HPC SLURM FASTQ • 646 views
ADD COMMENT
1
Entering edit mode

I’m thinking of using a job array

As long as the storage that your cluster is using can support the necessary bandwidth this should be feasible. Next to come in play will be the cluster network and interconnect, which I assume is fast ethernet/infiniband etc. If your data source supports it, you may want to look into aspera or similar optimized data transfer tool/protocol that may most efficient for data transfers.

On top of everything your local firewall/network set up may end up limiting the network flows. aspera can actually use up full bandwidth of available network, up and/or down.

ADD REPLY
0
Entering edit mode

This seems like an appropriate answer, along with the comment below. If you’d prefer to add this as a comment instead of a reply, I’d be happy to mark it as an accepted answer.

ADD REPLY
4
Entering edit mode
2 days ago
Dave Carlson ★ 1.9k

Assuming your Slurm cluster is configured like the one I use, where each job gets exclusive access to one or more compute nodes, this would mean each element in the Slurm job array will run on a separate compute node. If so, that will probably be quite inefficient, since each individual download is probably not using more than one (or a handful) of cores.

As an alternative, a tool like GNU parallel would work well for parallelizing multiple downloads within a single Slurm job.

That said, if your cluster allows multi-tenant jobs (multiple jobs on the same node at the same time), your Slurm array idea might work without wasting too many resources.

ADD COMMENT
1
Entering edit mode

As an alternative, a tool like GNU parallel would work well for parallelizing multiple downloads within a single Slurm job.

Thank you for the suggestion. To clarify, are you recommending submitting a single job to the scheduler with --cpus-per-task=${threads} and then running GNU Parallel within that job using --jobs ${threads}?

ADD REPLY
0
Entering edit mode

Yep, that should work.

ADD REPLY
0
Entering edit mode

Slurm cluster is configured like the one I use, where each job gets exclusive access to one or more compute nodes

I've seen this configuration before and I don't get reason behind it. Doesn't it defeat the point of having an HPC with a scheduler?

ADD REPLY
1
Entering edit mode

I don't think so. This setup just assumes that most jobs are highly parallel and will be using the full set of resources of one or more nodes. In practice, this is not always true, in which case there will be some level of inefficient resource usage.

The cluster I work on has, over time, decided to take an "opt-in" approach to node sharing, where users who know they don't need an entire node can voluntarily submit to queues designed for multitenancy. So far, this seems to work pretty well.

ADD REPLY
1
Entering edit mode

This setup just assumes that most jobs are highly parallel and will be using the full set of resources of one or more nodes.

Fair enough, but it seems a strong assumption. If I submit a job asking for 1 GB or memory and 1 cpu, I don't see why that job should fully occupy a node that is presumably much larger than that. I mean, it's the scheduler's task to find out where my job should go to make the best use of resources.

ADD REPLY
0
Entering edit mode

I actually more-or-less agree with you. I suspect that the next HPC system we set up will probably go from an "opt-in" to "opt-out" approach, forcing users to share node resources by default.

ADD REPLY

Login before adding your answer.

Traffic: 1356 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6