I’m considering parallelizing the downloading of multiple FASTQ files from FTP sources using a SLURM job scheduler. Specifically, I’m thinking of using a job array where each element corresponds to either a single FASTQ file (for single-end reads) or two FASTQ files (for paired-end reads). Would this be an efficient approach for handling large-scale downloads on a high-performance computing cluster? Are there potential pitfalls or better alternatives for managing such downloads in parallel?
As long as the storage that your cluster is using can support the necessary bandwidth this should be feasible. Next to come in play will be the cluster network and interconnect, which I assume is fast ethernet/infiniband etc. If your data source supports it, you may want to look into
aspera
or similar optimized data transfer tool/protocol that may most efficient for data transfers.On top of everything your local firewall/network set up may end up limiting the network flows.
aspera
can actually use up full bandwidth of available network, up and/or down.This seems like an appropriate answer, along with the comment below. If you’d prefer to add this as a comment instead of a reply, I’d be happy to mark it as an accepted answer.