Question

Forum:Fastest way from SRA to fastq PE reads

3

Entering edit mode

8.8 years ago

Avi Srivastava ▴ 130

fastq-dump is slowwwww.

I found this as the fastest way.

vdb-dump SRR453569.sra | awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }' - > SRR453569_1.fq
vdb-dump SRR453569.sra | awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2 + 1) }' - > SRR453569_2.fq

Can also be combined using making a copy in memory and writing.

If it may help someone save time.

I see order of magnitude difference.

My run looks like this:

/usr/bin/time fastq-dump --split-3 SRR453569.sra
Read 4032514 spots for SRR453569.sra
Written 4032514 spots for SRR453569.sra
370.25user 3.18system 6:17.10elapsed 99%CPU (0avgtext+0avgdata 46324maxresident)k
0inputs+5064544outputs (0major+246626minor)pagefaults 0swaps

and for vdb it's like:

/usr/bin/time vdb-dump  -I -f fastq  SRR453569.sra | awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }' - > SRR453569_1.fq
23.49user 3.23system 0:58.20elapsed 45%CPU (0avgtext+0avgdata 330488maxresident)k
1026496inputs+64outputs (6major+635833minor)pagefaults 0swaps

UPDATE: Sorry I forgot to add flags for vdb-dump above, new command is:

vdb-dump -I -f fastq SRR453569.sra | awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }' - > SRR453569_1.fq
vdb-dump -I -f fastq SRR453569.sra | awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2 + 1)}' - > SRR453569_2.fq

Again they can be combined

UPDATE2: FASTEST: combined

$ /usr/bin/time vdb-dump  -I -f fastq SRR493371.sra | tee >(awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }' - > SRR493371_tee_1.fq) | awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2 + 1)}' - > SRR493371_tee_2.fq

UPDATE3: use latest version of fastq-dump i.e. 2.5.XX, previous version have issues with running time.

fastq fastq-dump SRA vdb-dump RNA-Seq • 4.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Avi Srivastava ▴ 130

2

Entering edit mode

... or just download the FASTQ files from ENA and skip all of the stupidity related to the SRA format ...

ADD REPLY • link 8.8 years ago by george.ry ★ 1.2k

2

Entering edit mode

It is not easy to understand why you can save the time, or else, when you save the time, you have missed some important filter or checking? I think we need email the author of vdb-dump, can be clear about this question.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Shicheng Guo ★ 9.5k

0

Entering edit mode

That's true I also think I am missing something since difference is substantial. Mailing authors sounds like a good idea.

ADD REPLY • link 8.8 years ago by Avi Srivastava ▴ 130

Ram · Answer 1 · 2016-02-16

6

Entering edit mode

8.8 years ago

piet ★ 1.9k

Before you state anything about speed of a new method, you should seriously check, that the new method works as expected. In this case you will just dump a big pile of mess into a file, but nothing similar to FASTQ. The awk script you have used is intended to fix some deficiencies in malformated FASTQ files, but it is in no way suited to filter the output of vdb-dump.

ADD COMMENT • link 8.8 years ago by piet ★ 1.9k

0

Entering edit mode

I missed some flags for vdb-dump, try now.

ADD REPLY • link 8.8 years ago by Avi Srivastava ▴ 130

1

Entering edit mode

I missed some flags for vdb-dump, try now.

Indeed, vdb-dump has an option for dumping in FASTQ format, I did not knew about that.

Nevertheless, your proposed procedure is still producing incorrect results. The sequences from the second call are one nucleotide longer than they should be. That is a typical programming error, and you should have already found if you would have compared your output with fastq-dump. The second command should read:

vdb-dump -I -f fastq SRR453569.sra | awk 'NR%2==1 { print $0 "/2" } ; NR%2==0 { print substr($0,length($0)/2 + 1) }' - > SRR453569_2.fq

Moreover, in my hands a single run of either fastq-dump (which dumps forward and backward reads in one go) or a single run of vdb-dump take about the same time. This it not surprising since both executables have been build from the same code base. Thus, in my hands your whole procedure takes about double the time than fastq-dump.

Benchmarking is peculiar here, since you are reading and writing large amounts of data from and to disk. Most likely, performance will be I/O bond, which means that you are benchmarking the speed of your hard disk rather than the CPU time spent by the programs. And the first invocation of either fastq-dump or vdb-dump may cause a hidden download of the reference sequence, which will slow down its execution significantly.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by piet ★ 1.9k

0

Entering edit mode

Thanks for the correction in "/2" but it's trivial, in the meantime we are missing the real point. Why would we get two different timing results when using code from same code base i.e. we are having two different timing results when generating fastq using fastq-dump and other using vdb-dump. Also I didn't quite understand your logic of not able to benchmark time here, if code is from same codebase then they should give approx same CPU% usage even if they are IO bound, but that's not the case. I am just raising a concern as you can see big difference in running time, there is a possibility I am missing something. btw what's your run look like?? Also while dumping through vdb it's not true that it'll take double the time of fastq, we can image the .sra one time in memory and dump it two time. But I haven't tried that Ill post it as soon as I try.

My run looks like this:

/usr/bin/time fastq-dump --split-3 SRR453569.sra
Read 4032514 spots for SRR453569.sra
Written 4032514 spots for SRR453569.sra
370.25user 3.18system **6:17.10elapsed** 99%CPU (0avgtext+0avgdata 46324maxresident)k
0inputs+5064544outputs (0major+246626minor)pagefaults 0swaps

and for vdb it's like:

/usr/bin/time vdb-dump  -I -f fastq  SRR453569.sra | awk 'NR%2==1 { print $0 "/1" } ; NR%2==0 { print substr($0,0,length($0)/2) }' - > SRR453569_1.fq
23.49user 3.23system **0:58.20elapsed** 45%CPU (0avgtext+0avgdata 330488maxresident)k
1026496inputs+64outputs (6major+635833minor)pagefaults 0swaps

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Avi Srivastava ▴ 130

1

Entering edit mode

Here are my timings for SRR493371. I have repeated both commands several times. Here are two representative runs:

/usr/bin/time sh -c '/usr/local/lib/sratoolkit/vdb-dump.2.5.7 -I -f fastq ./SRR493371.sra | awk '\''NR%2==1 {print $0} NR%2==0 {print substr($0,0,length($0)/2)}'\'' > vdb-dump.fastq'

  236.23user 20.02system 5:00.30elapsed 85%CPU (0avgtext+0avgdata 329536maxresident)k
  5722536inputs+16422672outputs (8major+1136248minor)pagefaults 0swaps

/usr/bin/time /usr/local/lib/sratoolkit/fastq-dump.2.5.7 --split-3 ./SRR493371.sra

  183.33user 10.79system 4:46.70elapsed 67%CPU (0avgtext+0avgdata 38788maxresident)k
  5722072inputs+32109600outputs (2major+380966minor)pagefaults 0swaps

ls -l 
  -rw-r--r-- 1 piet users 3284339473 Feb 19 12:20 SRR493371.sra
  -rw-r--r-- 1 piet users 8408407764 Feb 19 12:45 vdb-dump.fastq
  -rw-r--r-- 1 piet users 8220054540 Feb 19 12:50 SRR493371_1.fastq
  -rw-r--r-- 1 piet users 8220054540 Feb 19 12:50 SRR493371_2.fastq

Result: fastq-dump or a single run of vdb-dump/awk are about equally fast.

Remarks:

The combined vdb-dump/awk command cannot be executed directly with time. It should be either run as a shell script or executed in a subshell. Quoting the awk mess is really tricky.
The first run is always much slower than any further invocations. The kernel caches disk data for succeeding jobs. I have used a system with 16 GB RAM with no other jobs running. The execution time of repeated runs varies widely on a hard disk drive. A RAM disk (tmpfs) yields much more consistent timings.
Both approaches emit rather bloated description lines. fastq-dump will execute about 15 % faster, if option '-F' used. This results in much shorter descriptions and a 15 % smaller FASTQ file (with SRR493371).
If input and output files are placed on different physical drives (to reduce concurrency between reading and writing), both approaches will run significantly faster.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by piet ★ 1.9k

0

Entering edit mode

OK I think I got the problem, They have released newer version of SRA toolkit in late-December 2015 in which I think they've solved the problem of time. I was using the previous version, which was giving timing issues.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by Avi Srivastava ▴ 130

Ram · Answer 2 · 2016-02-19

Some more analysis on human experiment SRA: SRR493371. One difference which I see in output of vdb-dump and fastq-dump was the naming convention of reads i.e.

@SRR493371.sra.1 ILLUMINA:201:C00K3ACXX:8:1101:1308:2049 length=202
@SRR493371.1 ILLUMINA:201:C00K3ACXX:8:1101:1308:2049 length=101

is the word .sra in the middle so if I get rid of that then file generated by them are exactly same.

courtesy of Rob we are able to combine awk script into one.

$ /usr/bin/time vdb-dump  -I -f fastq SRR493371.sra | awk -F " " '{if(NR % 4 == 1 || NR % 4 == 3) { split($0, a, "length="); l=a[2]/2; str=a[1]"length="l; sub(/\.sra\./, ".", str); print str > "SRR493371_vdb_1.fq"; print str > "SRR493371_vdb_2.fq"} else { s=length($0); print substr($0,0,s/2) > "SRR493371_vdb_1.fq"; print substr($0, s/2 + 1) > "SRR493371_vdb_2.fq" }}' -                                               
145.96user 12.12system 9:18.39elapsed 28%CPU (0avgtext+0avgdata 335168maxresident)k
0inputs+0outputs (0major+5596758minor)pagefaults 0swaps

$ /usr/bin/time fastq-dump --split-3 SRR493371.sra
Read 23544153 spots for SRR493371.sra
Written 23544153 spots for SRR493371.sra
2153.91user 22.63system 38:04.00elapsed 95%CPU (0avgtext+0avgdata 40140maxresident)k
6414736inputs+32109600outputs (5major+411058minor)pagefaults 0swaps

$ du  ./*  
8027404 ./SRR493371_1.fastq
8027404 ./SRR493371_2.fastq
3207368 ./SRR493371.sra
8027404 ./SRR493371_vdb_1.fq
8027404 ./SRR493371_vdb_2.fq