Question

Parallel Bzip2 In Abyss

4

Entering edit mode

13.1 years ago

ALchEmiXt ★ 1.9k

I am using abyss-pe (latest 1.3.1 version) to do a de novo assembly of Illumina PE100 data. To save diskspace and to "limit" some IO bottleneck (I hoped), I compressed my input fastq files with bzip2 (ABySS can handle compressed files by default). the-seven-secrets-of-successful-data-scientists

When running the abyss-pe script it starts by reading the compressed file and decompressing it. However using the non parallel version of bzip2 (htop shows me bunzip2 -c filename ....) which is quite slow. So my guess is that the uncompressed file is stored uncompressed on disk again (~25Gb while compressed 7Gb) since it will not fit in 24G mem.

I would like to replace the default bunzip2 for the pbzip version (parallel bzip2) but cannot find it in the runscripts. Anyone knows where to set those defaults or should it be done by hacking the scripts?
Since the file is decompressed on disk my guess there is NO gain in IO for this particular tool? Nothing found in the ABySS documentation.

assembly parallel • 4.2k views

ADD COMMENT • link updated 13.1 years ago by Jan Van Haarst ▴ 300 • written 13.1 years ago by ALchEmiXt ★ 1.9k

2

Entering edit mode

Actually, no parallel bzip2 is faster than parallel gzip on decompression. pigz can only use 2 threads on decompression, pbzip2 can use all available cores/cpu's. On compression it is indeed faster, but we usually decompress a lot more than we compress. If you want to see for yourself, a script to do that is here : https://gist.github.com/946108 Pa

ADD REPLY • link 13.1 years ago by Jan Van Haarst ▴ 300

1

Entering edit mode

You know with these optimizations can be very tricky with unexpected consequences. I would not take any documentation for face value. Measure it: then you will know if overall you are actually saving time/resources.

ADD REPLY • link 13.1 years ago by Istvan Albert 102k

0

Entering edit mode

Update: it seems that the mapping/assembly already starts during the uncompression so maybe their strategy is different not dumping the whole uncompressed file temporarily on disk...

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Update: it seems that the mapping/assembly already starts during the uncompression so maybe their strategy is different not dumping the whole uncompressed file temporarily on disk... Therby I question what the actual gain for pbzip2 would be additional to the obvious startup of initial reading the compressed file section.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

@Istvan I agree that we should measure it. Not the hardest part maybe. Adapting abyss to use pbzip2 without a hack would be welcomed.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

I do not know the case for abyss, but in general, for large data sets, one should use gzip or even "gzip -1" instead of bzip2. bzip2 is way too slow for tens of GB of data. Gzip decompression is ~10X faster.

ADD REPLY • link 13.1 years ago by lh3 33k

0

Entering edit mode

I do not know the case for abyss, but in general, for large data sets, one should use gzip or even "gzip -1" instead of bzip2. bzip2 is way too slow for tens of GB of data. Gzip decompression is ~10X faster, if not more.

ADD REPLY • link 13.1 years ago by lh3 33k

0

Entering edit mode

Thanks Jan. Was about to quote your test results :)

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

@Jan van Haarst: okay, I was mainly talking about single-threaded applications. The blockwise structure of bzip2 indeed makes it easier to parallelize. Perhaps I should parallelize bgzip some day. I am sure it will easily beat pbzip2 on speed. Good to know. Thanks.

ADD REPLY • link 13.1 years ago by lh3 33k

score 4 · Answer 1 · 2011-12-05

4

Entering edit mode

13.1 years ago

Jts ★ 1.4k

When a bz2 file is given to abyss, the first stage of the assembler makes a call out to bunzip2 and reads the result from a pipe. The decompressed file is not stored on disk.

If you are interested in testing out pbzip2 instead of bzip2, the relevant code is in function zcatExec, in Common/Uncompress.cpp.

ADD COMMENT • link 13.1 years ago by Jts ★ 1.4k

0

Entering edit mode

More exactly line 51 of Uncompress.cpp of abyss-1.3.1.

ADD REPLY • link 13.1 years ago by lh3 33k

0

Entering edit mode

Thanks. But actually I was looking for a "hack-free" solution to circumvent all kind of customisations to be applied and tested every time there is a new release. But For performance sake I can try it once whether it is usefull anyway.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Thanks. But actually I was looking for a "hack-free" solution to circumvent all kind of customisations to be applied and tested every time there is a new release. I found the uncompressing stage by grepping along... Anyway, for performance sake I can try it once to adapt it in source and see whether it is usefull anyway.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Ok for a solution and a benchmark see the accepted answer and my commnets.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

score 2 · Answer 2 · 2011-12-05

2

Entering edit mode

13.1 years ago

Jan Van Haarst ▴ 300

I think the easiest way to change from bzip2 to pbzip2 is to create a link to pbzip2, named as bzip2 somewhere on your PATH, before the location of bzip2.

That way, ABySS will use pbzip2 instead of bzip2, and as the two are switch compatible, everything will just work transparently.

ADD COMMENT • link 13.1 years ago by Jan Van Haarst ▴ 300

0

Entering edit mode

Thanks Jan. I should have thought about that. Symlinking override.... an easy system-hack :)

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Thanks Jan. I should have thought about that. Symlinking override.... an easy system-hack :) Since ABYSS uses pbunzip instead of 'bzip2 -d' I should override bunzip2 using an alias to 'pbzip2 -d' right?

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

hhhmmmm tested and the alias for bunzip2 to pbzip2 -d works on bash but not for scripts. It still uses "classic" bunzip2...puzzled.

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Ok got it to work while keeping the bunzip2 binaries intact in /usr/bin. Made a bash script pbunzip2 in /usr/bin just passing through all cmd args to 'pbzip2 -d'. That script I symlinked in /usr/sbin named bunzip2 to override the default bunzip2. Works like a charm in ABYSS!.
Quick benchmark on the above mentioned PE100 fastq 25G/7G uncomp/compressed in abyss-pe k=96 n=10 reveald a speed up by using pbzip2 from 1h10' to 40' for de novo assembly on a machine with 24G Mem and 6 cores (12 threads) ubuntu 10.10. :)

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Ok got it to work while keeping the bunzip2 binaries intact in /usr/bin. Made a bash script pbunzip2 in /usr/bin just passing through all cmd args to 'pbzip2 -d'. That script I symlinked in /usr/sbin named bunzip2 to override the default bunzip2. Works like a charm in ABYSS!. Quick benchmark on the above mentioned PE100 fastq 25G/7G uncomp/compressed in abyss-pe k=96 n=10 reveald a speed up by using pbzip2 from 1h10' to 40' for de novo assembly on a machine with 24G Mem and 6 cores (12 threads) ubuntu 10.10. :)

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k