Question

How to use GNU parallel to download SRA files

1

Entering edit mode

6.9 years ago

Bioinfonext ▴ 470

Hi,

How I can use GNU parallel to download SRA files fast in the below command:

nohup /mnt//sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --split-3 --gzip SRR1785709 SRR1785715 SRR1785721 SRR1785728 SRR1785734 SRR1785742 SRR1785744 >nohup.out &

RNA-Seq • 4.1k views

ADD COMMENT • link updated 6.3 years ago by Min Dai ▴ 160 • written 6.9 years ago by Bioinfonext ▴ 470

1

Entering edit mode

If possible use EBI-ENA to get the fastq files directly.

Consider that you may be saturating incoming bandwidth on the network connection (once you get this to work). If you are on a shared machine/cluster that can cause issues for others.

ADD REPLY • link 6.9 years ago by GenoMax 148k

1

Entering edit mode

@Pierre's parallel tutorial.

ADD REPLY • link 6.9 years ago by GenoMax 148k

GenoMax · Answer 1 · 2018-01-24

3

Entering edit mode

6.9 years ago

tiago211287 ★ 1.5k

As the sizes of the datasets have increased, we have found that the traditional methods of FTP or HTTP do not have the performance characteristics needed to support this load of data. FTP performance degrades proportionally with the number of hops or switches the data must take to get to you. Aspera performance does not degrade with distance. Aspera is typically 10 times faster than FTP and reduces the chance of drops or time-outs in the middle of a transfer. Best-case transfer rates for ascp are ~ 600 Mbps, while typical rates are closer to 100-200 Mbps. [Aspera Transfer Guide][1]

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

Inside the server/cluster do:

Download Aspera Connect to Linux

execute the shell script:

./aspera-connect-Version-linux-64.sh

aspera will be put on the path:

$HOME/.aspera/connect/bin/

Make a text file with the accession IDs, one way is to cat into a empty file and paste. End cat with CtrL+D:

cat > accessions.txt
SRR1346053
SRR1346054
SRR1346055
SRR1346056
SRR1346057
SRR1346058
SRR1346059

Use GNU parallel

parallel  --max-procs 1 --xapply ascp -v -k 1 -l50m -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR134/{1}/{1}.sra $HOME/OUTPUT/FOLDER/ :::: accessions.txt

Explanation

--max-procs 1 -> allows the download of only 1 item at a time. -v -> verbose mode

-k 1 -> allow you to restart incomplete transfers

-l50m -> limits the band to 50Mbps (~5Mb/second)

-i asperaweb_id_dsa.openssh, public key

{SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file

path to the files: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6="" characters="" of="" accession="">/<accession>/<accession>.sra

Transform all sras in raw fastq files:

find $PWD -name "*.sra" | parallel --maxprocs N fastq-dump --split-files {1}

N = number of simulteneous instances (maximum number of cores to process requests).

ADD COMMENT • link updated 6.9 years ago by GenoMax 148k • written 6.9 years ago by tiago211287 ★ 1.5k

2

Entering edit mode

While all this is great information, OP (Bioinfonext ) should definitely talk with local cluster admins before doing this. It could put a lot of load on the head node (if run there) and/or gum up the network (such that no one else may be able to do anything).

ADD REPLY • link 6.9 years ago by GenoMax 148k

0

Entering edit mode

You are totally right. I made it myself when I was learning. A way of using without upsetting coworkers is limit the network band with less or equal to 50 Mbps "-l50m" and always set the --maxprocs parameter in parallel to a low value.

Talk with the admin is a good idea.

ADD REPLY • link 6.9 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

If you're adding max-procs and setting it to a single thread - there's no point parallelising...? (Concerns about OP sucking up all the bandwidth aside).

ADD REPLY • link 6.9 years ago by Joe 21k

0

Entering edit mode

Actually, there is, using max-procs 1, you avoid loops.

ADD REPLY • link 6.8 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

IMO in general there's no point to try to parallelize downloads. It will not magically increase your download bandwidth, nor increase the speed at which any decently configured server serves you files. Instead, it might lead to the server flagging you and banning your IP address

ADD REPLY • link 6.3 years ago by 5heikki 11k

score 0 · Answer 2 · 2018-02-12

0

Entering edit mode

6.9 years ago

sutturka ▴ 190

There is a parallel-fastq-dump utility available which might be useful. I am yet to test the performance and update the answer soon.

ADD COMMENT • link 6.9 years ago by sutturka ▴ 190

0

Entering edit mode

Given you have no I/O problems, it is a very nice wrapper around fastq-dump. There is now also fasterq-dump available in the current SRAtoolkit. Did not test it yet.

ADD REPLY • link 6.3 years ago by ATpoint 86k

score 0 · Answer 3 · 2018-08-27

0

Entering edit mode

6.3 years ago

Min Dai ▴ 160

seq 5260 5274 | parallel -j 8 wget -P ~/GSE62129 ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR160/SRR160{}/SRR160{}.sra

Refer to: https://www.slashroot.in/how-run-multiple-commands-parallel-linux

ADD COMMENT • link 6.3 years ago by Min Dai ▴ 160