Hi,
How I can use GNU parallel to download SRA files fast in the below command:
nohup /mnt//sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --split-3 --gzip SRR1785709 SRR1785715 SRR1785721 SRR1785728 SRR1785734 SRR1785742 SRR1785744 >nohup.out &
Hi,
How I can use GNU parallel to download SRA files fast in the below command:
nohup /mnt//sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --split-3 --gzip SRR1785709 SRR1785715 SRR1785721 SRR1785728 SRR1785734 SRR1785742 SRR1785744 >nohup.out &
As the sizes of the datasets have increased, we have found that the traditional methods of FTP or HTTP do not have the performance characteristics needed to support this load of data. FTP performance degrades proportionally with the number of hops or switches the data must take to get to you. Aspera performance does not degrade with distance. Aspera is typically 10 times faster than FTP and reduces the chance of drops or time-outs in the middle of a transfer. Best-case transfer rates for ascp are ~ 600 Mbps, while typical rates are closer to 100-200 Mbps. [Aspera Transfer Guide][1]
Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
Inside the server/cluster do:
Download Aspera Connect to Linux
execute the shell script:
./aspera-connect-Version-linux-64.sh
aspera will be put on the path:
$HOME/.aspera/connect/bin/
Make a text file with the accession IDs, one way is to cat into a empty file and paste. End cat with CtrL+D:
cat > accessions.txt
SRR1346053
SRR1346054
SRR1346055
SRR1346056
SRR1346057
SRR1346058
SRR1346059
Use GNU parallel
parallel --max-procs 1 --xapply ascp -v -k 1 -l50m -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR134/{1}/{1}.sra $HOME/OUTPUT/FOLDER/ :::: accessions.txt
Explanation
--max-procs 1 -> allows the download of only 1 item at a time. -v -> verbose mode
-k 1 -> allow you to restart incomplete transfers
-l50m -> limits the band to 50Mbps (~5Mb/second)
-i asperaweb_id_dsa.openssh, public key
{SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file
path to the files: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6="" characters="" of="" accession="">/<accession>/<accession>.sra
Transform all sras in raw fastq files:
find $PWD -name "*.sra" | parallel --maxprocs N fastq-dump --split-files {1}
N = number of simulteneous instances (maximum number of cores to process requests).
While all this is great information, OP (Bioinfonext ) should definitely talk with local cluster admins before doing this. It could put a lot of load on the head node (if run there) and/or gum up the network (such that no one else may be able to do anything).
You are totally right. I made it myself when I was learning. A way of using without upsetting coworkers is limit the network band with less or equal to 50 Mbps "-l50m" and always set the --maxprocs parameter in parallel to a low value.
Talk with the admin is a good idea.
IMO in general there's no point to try to parallelize downloads. It will not magically increase your download bandwidth, nor increase the speed at which any decently configured server serves you files. Instead, it might lead to the server flagging you and banning your IP address
There is a parallel-fastq-dump utility available which might be useful. I am yet to test the performance and update the answer soon.
seq 5260 5274 | parallel -j 8 wget -P ~/GSE62129 ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR160/SRR160{}/SRR160{}.sra
Refer to: https://www.slashroot.in/how-run-multiple-commands-parallel-linux
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
If possible use EBI-ENA to get the fastq files directly.
Consider that you may be saturating incoming bandwidth on the network connection (once you get this to work). If you are on a shared machine/cluster that can cause issues for others.
@Pierre's parallel tutorial.