Our team has been having storage space issues; we predicted that we will not have enough available memory to store the files generated by our pipelines. Standard file compressors (gzip, bzip2, 7zip) weren't cutting it and I started experimenting with file-specific compressors. This is where google spat out 'Genozip'.
I've managed to successfully reproduce it's claimed compression ratios on fastq.gz, vcf.gz and BAM files within a timeframe comparable to standard compression tools. I was not able to compress CRAM though (code in comments). It's got some additional utility which allows the user to read the doubly compressed files into stdout without decompressing.
I'm quite impressed with Genozip. It seems to be the best option but I remain a little skeptical as I haven't found any forum posts discussing it.
Has anybody had any experience with Genozip, or recommends another file compressor?
Documentation: https://genozip.readthedocs.io/
This is not really answering your question, but it may be useful.
Disk storage is relatively cheap compared to time and effort needed to test various compression algorithms and then to actually compress the files. Don't know if that would solve your problem, but there are 8-10 Tb hard disks available for under $200. Also, in my experience using almost anything other than
gzip
(say,7z
orxz
in their strongest compression modes) will get you within 2%-5% of those tools that claim the best compression. Is it really worth the effort to squeeze out the last couple of percent?We aren't ruling out buying more storage but it's being left as a last resort. According to my testing with paired-end FASTQ files, conventional compression methods (gzip, bzip2, 7zip, rar) and the specialized DSRC using their highest compression factor gave me compression ratios of between 5 and 8. Genozip gave me a compression ratio of 21 with the same files.
Thats lossless compression to ~4% of the original file size and ~23% of the gzipped file size.
I've tested this multiple times with different files and it seems to be legitimate, which is why I wonder why this tool hasn't been getting any more attention.
I see that a ratio of 21 is being claimed but how much time does that compression add (assuming same amount would be needed for decompression). If you are a smaller lab it may be worth investing that time but for large projects that may simply not be worth it.
Compression time was comparable to standard compression tools