Question

Is There A Faster Replacement For Fastq-Dump (From The Sra-Toolkit)?

19

Entering edit mode

11.3 years ago

Devon Ryan 105k

I occasionally need to reprocess previously published datasets, often stored in the Short Read Archive. For the most part, I need the raw fastq files, so I use fastq-dump. While that gets the job done, it's annoyingly slow for what it's actually doing. Has anyone come across a program that can more quickly extract reads in fastq format from SRA files? While I could presumably write a faster program, I'd like to avoid reinventing the wheel if needed.

I should note that I'm aware that many datasets are available in fastq format via ENA, but unfortunately they all aren't.

fastq sra • 20k views

ADD COMMENT • link updated 3.8 years ago by al-ash ▴ 210 • written 11.3 years ago by Devon Ryan 105k

4

Entering edit mode

Incidentally the way fastq-dump works has many other limitations - the way it handles the internet connection and its "security handshakes" (what that is I don't know) get in the way.

It is the only bioinformatics program so far that does not work on Bash on Windows! Think about that for a second. For my book I had to find a simple replacement for it and came up with the wonderdump a replacement for the network access of fastq-dump, it uses a plain and faster curl for that - it does indeed work much faster than the regular fastq-dump.

http://data.biostarhandbook.com/scripts/wonderdump.sh

ADD REPLY • link 8.5 years ago by Istvan Albert 102k

2

Entering edit mode

Last I checked, Japan's SRA mirror has not yet moved over to the binrary SRA format yet so you can still grab fastqs off there. Might be a useful workaround

ADD REPLY • link 11.3 years ago by Ying W ★ 4.3k

0

Entering edit mode

Oh the amount of time noticing that would have saved me! :P Good on the Japanese for so far avoiding the SRA format annoyance!

ADD REPLY • link 11.3 years ago by Devon Ryan 105k

1

Entering edit mode

honestly I really doubt it - but I would agree that this binary SRA format is a major PITA

ADD REPLY • link 11.3 years ago by Istvan Albert 102k

0

Entering edit mode

I also doubt anyone has written a second method if this one works and is supported. Dumping from SRA isn't a task you have to do repeatedly and therefore is not a target for optimization. Try to use your fastest hard-drives, assuming IO is the bottleneck. Many Linux systems have a ram-drive on /dev/shm you should look into for vastly sped-up IO.

ADD REPLY • link 11.3 years ago by karl.stamm 4.1k

2

Entering edit mode

I/O is not a bottle-neck. In fact, I can invoke an instance for every core on my workstation and still not max I/O (or come very close for that matter).

ADD REPLY • link 11.3 years ago by Devon Ryan 105k

0

Entering edit mode

fastq-dump will never to be a good choice. download speed is not so fastq and always with some confusing problems, such as 2016-11-12T08:33:35 fastq-dump.2.7 err: item not found while constructing within virtual database module - the path 'SRR1286321' cannot be opened as database or table. I prefer to wget or curl.

ADD REPLY • link 8.5 years ago by Shicheng Guo ★ 9.6k

score 2 · Answer 1 · 2018-02-23

2

Entering edit mode

7.2 years ago

ATpoint 88k

I found parallel-fastq-dump quiet useful, a wrapper from Renan Valieris that makes use of the -N and -X options of fastq-dump to convert multiple chunks of the SRA in parallel, merging them chunks after sucessful conversion into the final fastq. It requires python3 and worked well for in my hands. Easy install with conda: conda install parallel-fastq-dump

Or simply get data directly in fastq format: Fast download of FASTQ files from the European Nucleotide Archive (ENA)

ADD COMMENT • link 5.9 years ago by ATpoint 88k

0

Entering edit mode

Indeed. Link fort direct download of fastq files from ENA archive generated via sra-explorer gives me download speed ~ 15MB/s which is orders of magnitude faster than fastq-dump.

ADD REPLY • link 3.8 years ago by al-ash ▴ 210

score 0 · Answer 2 · 2016-08-04

sam-dump seems a lot faster for me:

sam-dump ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038893/SRR1509032/SRR1509032.sra | head | grep -v '^@' | awk '{print "@"$1"\n"$10"\n+\n"$11}'

Or if your data is paired:

sam-dump <your_data> | grep -v '^@' | awk 'NR%2==1 {print "@"$1"\n"$10"\n+\n"$11}' > samplename_1.fastq 
sam-dump <your_data> | grep -v '^@' | awk 'NR%2==0 {print "@"$1"\n"$10"\n+\n"$11}' > samplename_2.fastq

( awk lines stolen from here: http://www.cureffi.org/2013/07/04/how-to-convert-sam-to-fastq-with-unix-command-line-tools/ )

score 0 · Answer 3 · 2018-02-23

0

Entering edit mode

7.2 years ago

sutturka ▴ 190

Please check my answer in this thread. It might be useful.

ADD COMMENT • link 7.2 years ago by sutturka ▴ 190

score 0 · Answer 4 · 2019-06-05

0

Entering edit mode

5.9 years ago

sschmeier ▴ 120

Old thread but have a look a fasterq-dump: https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump

ADD COMMENT • link 5.9 years ago by sschmeier ▴ 120