Question

Any alternatives to BBMap's clumpify.sh program to optimize gzip compression?

0

Entering edit mode

2.9 years ago

O.rka ▴ 740

I've had some difficulties implementing this in pipelines because it randomly fails sometimes.

Are there any other programs that can be used in its stead?

fastq genomics rnaseq • 1.3k views

ADD COMMENT • link updated 2.9 years ago by GenoMax 148k • written 2.9 years ago by O.rka ▴ 740

0

Entering edit mode

Only alternative that I know of is picard makduplicates. That will need aligned data for one and then will have its own set of needs (read groups for one).

ADD REPLY • link 2.9 years ago by GenoMax 148k

score 1 · Answer 1 · 2022-01-12

1

Entering edit mode

2.9 years ago

GenoMax 148k

because it randomly fails sometimes.

clumpify is not the program that is doing compression. BBTools programs make use of pigz library for parallel compression operations when it is available. If you do not want to use it simply add pigz=f to your commands to use system gzip program.

That said, perhaps your server is either using an older version of pigz (which you may want to get updated) or try installing it if it is not present.

There is a section of options for compression that you can play with. Check the in-line help.

If you are using multiple threads then there needs to be a sufficiently performant storage system for the read/writes to keep up. If you don't have access to one then reducing the number of threads would be another suggestion.

ADD COMMENT • link 2.9 years ago by GenoMax 148k

0

Entering edit mode

My understanding was that clumpify reorders the sequence file to optimize compression.

ADD REPLY • link 2.9 years ago by O.rka ▴ 740

0

Entering edit mode

reorder=f           Reorder clumps for additional compression.

Reordering is off by default so you must be actually turning it on in your jobs. Reordering is facilitating compression but is not actually doing the process of compression. Thread title and text made it sound like your jobs were being affected by the actual compression process.

On large datasets clumpify can take hundreds of GB of RAM (since it needs to keep a large amount of sequence data in RAM) so I would check to see if the failures you are seeing are because of the job running out of RAM.

ADD REPLY • link 2.9 years ago by GenoMax 148k

0

Entering edit mode

Oh ok, it's likely a ram issue. Does this scale with the number of threads used?