I occasionally need to reprocess previously published datasets, often stored in the Short Read Archive. For the most part, I need the raw fastq files, so I use fastq-dump. While that gets the job done, it's annoyingly slow for what it's actually doing. Has anyone come across a program that can more quickly extract reads in fastq format from SRA files? While I could presumably write a faster program, I'd like to avoid reinventing the wheel if needed.
I should note that I'm aware that many datasets are available in fastq format via ENA, but unfortunately they all aren't.
Incidentally the way fastq-dump works has many other limitations - the way it handles the internet connection and its "security handshakes" (what that is I don't know) get in the way.
It is the only bioinformatics program so far that does not work on Bash on Windows! Think about that for a second. For my book I had to find a simple replacement for it and came up with the
wonderdump
a replacement for the network access of fastq-dump, it uses a plain and fastercurl
for that - it does indeed work much faster than the regular fastq-dump.http://data.biostarhandbook.com/scripts/wonderdump.sh
Last I checked, Japan's SRA mirror has not yet moved over to the binrary SRA format yet so you can still grab fastqs off there. Might be a useful workaround
Oh the amount of time noticing that would have saved me! :P Good on the Japanese for so far avoiding the SRA format annoyance!
honestly I really doubt it - but I would agree that this binary SRA format is a major PITA
I also doubt anyone has written a second method if this one works and is supported. Dumping from SRA isn't a task you have to do repeatedly and therefore is not a target for optimization. Try to use your fastest hard-drives, assuming IO is the bottleneck. Many Linux systems have a ram-drive on /dev/shm you should look into for vastly sped-up IO.
I/O is not a bottle-neck. In fact, I can invoke an instance for every core on my workstation and still not max I/O (or come very close for that matter).
fastq-dump will never to be a good choice. download speed is not so fastq and always with some confusing problems, such as 2016-11-12T08:33:35 fastq-dump.2.7 err: item not found while constructing within virtual database module - the path 'SRR1286321' cannot be opened as database or table. I prefer to wget or curl.