I am using abyss-pe (latest 1.3.1 version) to do a de novo assembly of Illumina PE100 data. To save diskspace and to "limit" some IO bottleneck (I hoped), I compressed my input fastq files with bzip2 (ABySS can handle compressed files by default). the-seven-secrets-of-successful-data-scientists
When running the abyss-pe script it starts by reading the compressed file and decompressing it. However using the non parallel version of bzip2 (htop shows me bunzip2 -c filename ....) which is quite slow. So my guess is that the uncompressed file is stored uncompressed on disk again (~25Gb while compressed 7Gb) since it will not fit in 24G mem.
- I would like to replace the default bunzip2 for the pbzip version (parallel bzip2) but cannot find it in the runscripts. Anyone knows where to set those defaults or should it be done by hacking the scripts?
- Since the file is decompressed on disk my guess there is NO gain in IO for this particular tool? Nothing found in the ABySS documentation.
Actually, no parallel bzip2 is faster than parallel gzip on decompression. pigz can only use 2 threads on decompression, pbzip2 can use all available cores/cpu's. On compression it is indeed faster, but we usually decompress a lot more than we compress. If you want to see for yourself, a script to do that is here : https://gist.github.com/946108 Pa
You know with these optimizations can be very tricky with unexpected consequences. I would not take any documentation for face value. Measure it: then you will know if overall you are actually saving time/resources.
Update: it seems that the mapping/assembly already starts during the uncompression so maybe their strategy is different not dumping the whole uncompressed file temporarily on disk...
Update: it seems that the mapping/assembly already starts during the uncompression so maybe their strategy is different not dumping the whole uncompressed file temporarily on disk... Therby I question what the actual gain for pbzip2 would be additional to the obvious startup of initial reading the compressed file section.
@Istvan I agree that we should measure it. Not the hardest part maybe. Adapting abyss to use pbzip2 without a hack would be welcomed.
I do not know the case for abyss, but in general, for large data sets, one should use gzip or even "gzip -1" instead of bzip2. bzip2 is way too slow for tens of GB of data. Gzip decompression is ~10X faster.
I do not know the case for abyss, but in general, for large data sets, one should use gzip or even "gzip -1" instead of bzip2. bzip2 is way too slow for tens of GB of data. Gzip decompression is ~10X faster, if not more.
Thanks Jan. Was about to quote your test results :)
@Jan van Haarst: okay, I was mainly talking about single-threaded applications. The blockwise structure of bzip2 indeed makes it easier to parallelize. Perhaps I should parallelize bgzip some day. I am sure it will easily beat pbzip2 on speed. Good to know. Thanks.