Entering edit mode
3.3 years ago
Divon
▴
230
Genozip is a new(ish) compression software for compressing genomic files. It usually compresses x2-x5 times better than standard compression (eg .gz), and it works on all common genomic file formats. I am its developer.
It is a lot more than just a compressor though, it has some interesting analytical capabilities too.
Installation, documentation and source code: http://genozip.com
Publication: A Universal Extensible Genomic Data Compressor
Feedback / feature requests would be more than welcome.
Note: this tool is not open source, but it is free for non-commercial use, and the source code is available.
This seems like a great tool which has been seriously overlooked.
I've been testing it out and was able to reproduce the compression ratios you claimed in your 2021 paper with fastq.gz, vcf.gz, and .bam files. I'm having trouble with CRAM files however:
The original
sample.cram
is 10GB while the outputsample.cram.genozip
is 15GB. I was given the message:"FYI: header of HTS154_3.cram has contig '1' (and maybe others, too), missing in /scratch/mpace21/GRCh37_latest_genomic.ref.genozip. No harm."
Any suggestions?
Hi Matthew, I sent you a response on the other thread as well, repeating here in case you didn't see it.
First, thank you for your kind words, it is very rewarding to hear.
Can you please send me a small sample (eg first 10k lines) of the CRAM to support@genozip.com and I will look into it.
From Github: Yes, Genozip can compress already-compressed files (.gz .bz2 .xz .bam .cram).
Generally, compression of compressed data does not work well. This is a very amazing computational result.
Well, kinda :) What Genozip does is uncompress the existing compression and then re-compress with the better Genozip compression.
I have just posted some benchmarks showing Genozip's performance with variety of file types: https://www.genozip.com/our-product
Enjoy :)