Reliable Fastq compression programs
4
10
Entering edit mode
9.1 years ago

I would like to use another compression algorithm on fastq files than gzip for long term storage. Decompressing and re compressing could free about half the disk space I am using, which would save me approximately 20-30 To of space now and more in the future. (Funny, for some people this is probably big and for others ridiculously small :) The bottom line here is that disk space is money and up-scaling means the costs won't be linear but probably more expensive.

I have found a few contenders, but the most solid of them is fqzcomp-4.6 (http://sourceforge.net/projects/fqzcomp/files/). It boasts very good compression rates (about twice smaller compressed files when compared to gzip), is fast enough, installs easily.

Now these are important data, so I need to be sure that in 2 or 5 or 10 years I will be able to get them back.

I saw this other question (fastq compression tools of choice), but I really want to know if one of these tools is a) good enough to improve compression b) dependable. The article cited by Charles Plessy doesn't help much in this regard.

For those of you that work in big groups, institutes, organisations... What fastq compression tool would you recommend from the point of view of data safety? What have you used with success?

EDIT

I did a quick compression ration comparison for gzip, bz2, and fqzcomp. I used the default parameter values for gzip and bz2 and -Q {0,3,5} -s5+ -e -q 3 for fqzcomp. Here are the compression ratios (compared to the non compressed files):

Algo     ratio
gzip     0.272
bz2      0.218
fqzcomp  0.101 - 0.181 (depending on the Q param value)

bz2 / gzip      0.801
fqzcomp / gzip  0.371
fqzcomp / bz2   0.463

From these figures, we can see that bz2 reduces files only by an additional ~20% when compared to gzip.

However, fqzcomp files can be as much as 2.7 times smaller than gzip ones and 2.2 times than bz2 files. This is why I am really considering this algorithm. Of course, these figures will change with different fastq files and it assumes you do not care much about the quality values (for Ion Proton, which we sequence in high volume, we actually don't really care that much) but the potential gain is considerable.

fastq compression reliability • 18k views
ADD COMMENT
3
Entering edit mode

Even assuming the absolute worst case of a computer 5-10 years from now being entirely incompatible with your toolkit of choice - which frankly seems so unlikely a possibility as to be a non-issue - so long as you have the source code for "custom" or specialized compression and extraction tools, consider that 5-10 years from now, you could run current operating systems (and older compilers) within a VM hypervisor, like VirtualBox, Docker, etc. to extract archives. To help rebuild the environment down the road, it might help to keep a manifest with an archive, a README that describes the host, development tools and versions of things.

ADD REPLY
1
Entering edit mode

Good idea about the manifest and info about the system.

ADD REPLY
0
Entering edit mode

Is that enough to define dependability? What about bugs that are hard to spot? I don't know, I just want to find a tool I can trust with our group's data :) Mainly, I would like to get recommendations from big users too.

ADD REPLY
3
Entering edit mode

I guess it may depend on the details of your use case, but to me this seems like a lot of worry over a very very tiny part of the problem. I think that as long as you pick a reasonable compression tool where you have the source code and enough instructions to build it on a clean system, you'll be fine.

I would be much more worried about other aspects of the long-term storage problem, like adequate off-site backups and catastrophic fail-safe plans at the physical level. For instance, what if your backup provider goes out of business, the AC and emergency power in your datacenter fails and all your hard drives are damaged, or your systems are compromised and a rogue user tries to permanently delete many files? Or maybe something simpler, like system or backup account credentials being lost or forgotten as people move on? At your timescale and data sizes, I might even be worried about (relatively) far-fetched problems like bit rot.

So in sum, I guess I think that if decompressing the data is the only problem you have after X years, you would be lucky. I would personally focus on having enough independent off-site backups and then think about applying redundancy at other filesystem layers, like maybe with ZFS.

ADD REPLY
0
Entering edit mode

A non-lossy compression tool should be deterministic, which I'd think is a minimum standard for dependability. In other words, if you run the same input bytes through the same compression or extraction algorithm in the same environment, you should get the same expected bytes as output on repeated trials. If you can take the environment out of the equation with a VM, then you just need to worry about the compression tool, which you could probably set up post-compression extraction tests to verify functionality.

ADD REPLY
0
Entering edit mode

Considering the files are on different servers, how would you go about spinning up a VM to (de)compress hundreds of files across different systems weighting a few dozen To? It doesn't seem too fun to me. I need something that will work, now and in the foreseeable future, on *NIX machines without giving me or a possible descendant headaches.

Also, I don't see the link between deterministic and dependable. It just means it will give the same output for the same input, not that the code is well written and not bug ridden.

I am looking for input about quality fastq compression tools (a very specific need) that I can depend on. Your answer is basically that all tools are equal as long as they work in a VM. I don't think I can agree.

ADD REPLY
0
Entering edit mode

So what did you settle on?

ADD REPLY
0
Entering edit mode

Yannick, I am very interested too since we made our lossless compression algorithm Lossless ALAPY Fastq Compressor (now for MacOS X with 10-20% improved speed and compression ratio) and we think it is worth mentioning.

ADD REPLY
1
Entering edit mode

I never felt 100% sure about any of the alternative compression softwares, so I continued depending on gzip at the cost of having to buy more disk space. I would still love to find a better way but a major tradeoff is that fasta.gz and fastq.gz can be read by most bioinfo pieces of software so deviating from that format means a bit more work. This could be fine for long term storage though.

Anybody adopted something other than gzip or bz2 and would like to report?

ADD REPLY
1
Entering edit mode

Have you tried using compressed files with decompression in process substitution in Linux <(...) or read from stdin?

fastqc <(decompress compressed.file)
ADD REPLY
0
Entering edit mode

You know, Clumpify's output is still just a gzipped fastq/fasta. The only difference is that the order of the reads is changed. So it's 100% compatible with all software that can read gzipped fastq. Or bzipped, for that matter.

ADD REPLY
0
Entering edit mode

I agree with you @Eric.

I always get data in gzip format. It really a tiring process to uncompress it and recompress it using fqz_comp to .fqz files. Moreover these new file formats aren't software friendly as gz and required to be decompressed into bigger fastq file such as during fastqc fastqc <(decompress compressed) or during bwa mem which costs additional memory and increase computation time. I doubt how will the jobs behave when fired parallely. Moreover many compression algorithm tweak your NGS data like removal of NNN base or poor quality reads, or cluster similar reads to reduce file size, which will give you false result during fastqc steps.

Please check this post https://www.uppmax.uu.se/support/faq/resources-faq/how-should-i-compress-fastq-format-files/

And caveat section of fqz_comp https://github.com/jkbonfield/fqzcomp/blob/master/README.md

Moreover I find discordance between the uncompressed fqz_comp and original fastq file because of modification the tool does. If you run into any error later on in your analysis, you are always left with a suspicion if something went wrong during compression or did you introduce a bias somewhere like in fastqc reports etc.

ADD REPLY
3
Entering edit mode
7.5 years ago

Eric, I am very interested in what have you selected for fastq compression. We developed ALAPY Compressor Lossless ALAPY Fastq Compressor (now with stdin/stdout support) and technically it is reliable based on about 2000 different files that were compressed, decompressed, md5 sums compared and found to be exact in all cases. It is available on GitHub for free https://github.com/ALAPY/alapy_arc as a compiled binary for Linux and Windows. We hope GitHub will be around in 10 years and it will provide the current functionality of distributing these files. Could you please tell us what do you need to consider compression tool reliable and dependable?

Overall I am very interested in your thought about ALAPY Compressor in general, ie if the compression ratio is good, memory usage, features, etc.

ADD COMMENT
2
Entering edit mode
9.1 years ago

BAM gives you some compression over fastq.gz (particularly if you map and sort first). And pigz produces gzip files, faster than gzip, which allows you to increase the compression level. There's also bz2, which has a parallel implementation and gives better compression than gz.

If you want to be confident in recovering the data at some point in the future... I would go with one of those rather than something that is not widely used.

ADD COMMENT
1
Entering edit mode

I've gone the bz2 way for longer term storage. I haven't really studied the compression algorithm in detail so I don't know how much your data affects the level of compression. As recent examples, 2 x 8.9G fastq files compressed into a 3.3G tar.bz2 archive and 2 x 21G fastq files compressed into a 6.9G tar.bz2 archive. So, overall ca. 6X reduction in size. In these cases, the reads were quite heterogenic (QC'd HiSeq-sequenced metagenomes). LZMA could perhaps achieve a better ratio still although memory requirements may prevent its use..

ADD REPLY
0
Entering edit mode

Hi, by fastq do you actually mean fastq.gz? Because if you mean fastq, then 8.9G to 3.3G seems less efficient than compressing with gz, or am I wrong?

ADD REPLY
0
Entering edit mode

2x8.9G fastq (total 17.8G) into 3.3G (total), i.e. ca. 6X reduction.

ADD REPLY
0
Entering edit mode

In most cases, I do not have access to a reference genome or it would be incomplete and I would lose sequences, so BAM is not an option.

I know about pigz and use it on my computer but compression speed is not the major issue here. Space is. I am not convinced yet I want to go the more risky route of using a less known compression tool without some encouraging user stories from people who handle lots of data.

Thanks for the opinion.

ADD REPLY
0
Entering edit mode

You can store unaligned reads in BAM just fine (all the mapping-related information is just blank). The sequences, quality scores, and read names are all there, so it's effectively lossless.

This has been proposed/pushed before (see here) and discussed in various places like here and here, if you want to see some responses to it. For me, I like the cleanliness of the approach (a "universal" format), but it hasn't really caught on with the datasets and experiments that we see or do.

ADD REPLY
0
Entering edit mode

Actually, If you save them as aligned BAM, the BAM files will be larger than fastq.gz, usually. I think. However, maybe you are right non-aligned BAM will be a little smaller compared with fastq.gz.

ADD REPLY
0
Entering edit mode

About pigz - I think the point was not just about speed. Gzip offers several levels of compression, but the higher levels are slower to compress/decompress. Using parallelized compression, such as pigz or pbgzip can remove such speed bottlenecks and open doors to higher compression levels.

ADD REPLY
2
Entering edit mode
8.4 years ago
John 13k

I tried LFQC, but it had a bug where if the bundled precompiled binaries (for lpaq and zpac) didn't work, the ruby script that controls them would still print "created successfully!", then delete the work space after moving an empty tar file over the top of your original fastq.

I then tried fqzcomp, and its really really fast, and the output is tiny (for ENCODE's ENCFF001LCY.fq, which is 600Mb+ after gzip, fqzcomp has got it down to 250Mb) - but it has an unfortunate condition where it can't decompress what it wrote :/

enter image description here

You do not want to be stuck with a Floating point exception 8 10 years from now, thats for sure - so i think the answer is "everything is terrible, just stick to lzma" :) You'll only squeeze out a few more Mb with the other tools. If however you start using some of the lossy functions of the other tools, filesize will drop quickly. For example, binning quality scores, renaming the ids to just 1, 2, 3, 4... , converting all Ns to low-quality As, not retaining the original order of the fastq and instead sort entries by what compresses best, etc. But these methods all lose some information, so it might not be that exciting.

ADD COMMENT
2
Entering edit mode

I would be sort of wary of any "lossless" compression that uses floating point anywhere...

I like lzma for personal use, but it seems pretty slow compared to low-compression or multithreaded gzip, for a big-data production environment. Is there a multithreaded implementation?

Also, when you say "not retaining the original order of the fastq and instead sort entries by what compresses best"... technically that's kind of not lossy. At least with Illumina, I think you can probably recover the original order (or close to it) by looking at the read names, and I've never thought the order was important except when diagnosing machine problems.

But if you are willing to discard that information, you might want to try Clumpify, a tool I wrote. It re-orders reads so that sequences sharing kmers are close together, quickly and without any mapping, and using an arbitrarily small amount of memory (how small depends on your system's file handle limit). This allows gzip to compress error-free reads generated from a bacterial genome down to the size of around the bacterial genome, even if there is, say, 40x coverage. This near-perfect compression requires you to replace the quality scores with a fixed value, and give the reads very short names (like 1, 2, 3, etc), and it works much better on long, single-ended reads (or merged pairs). But even with paired reads containing sequencing errors, and using the raw quality scores and names, you get a substantial increase in compression. The output fastq file is still a valid fastq file, and for purposes where you don't care about the order of the sequences (which are most purposes I care about), it will be no different... except faster in most cases like mapping, assembly, or kmer-counting, due to improved caching and branch prediction from similar reads being adjacent. Of course if you rename the reads with numbers you get better compression and can easily recover the original order as well.

P.S. I should note that core-counts are increasing, while IPC and frequencies have stagnated, and essentially been flat for 4 years on workloads I care about. 10 years from now, I expect multithreaded compression and decompression to be very important; a fast program capable of using 128 threads is crippled if it can only decompress at 160 MB/s, roughly the current limit of gzip... let alone lzma, which on my computer is many times slower.

ADD REPLY
3
Entering edit mode

Yes, a floating point error for something thats lossless and doesn't contain floating point numbers is a bit weird, but hey - at least it didn't delete my data -_-;

For parallel lzma, the official xz tool (which is the new compressor that does lzma2 compression) has a --threads option, but i've never used it. I'm unsure if you need to decompress with the same number of threads, or what exactly threads means here in terms of speedup. Theres are also a github project called pxz for "parallel xz", which looks like its the lzma cousin of pigz, but however you slice it your point of it not being anywhere close to as fast as gzip for big-data is valid.

Clumpify on the otherhand is an approach I haven't seen anywhere else. Usually in the encoding step before compression, data is split into names/dna/quality to help out the compressor, and converted to binary. You only sort on the DNA and don't convert to binary - which I actually prefer since that's where obscure "floating-point-esk" errors usually crop up, and it's going to be really fast compared to the other methods. And the output is valid FASTQ, and that FASTQ will be faster at being processed by downstream tools. I think thats a really really neat idea, and should probably be something built into the sequencers that are outputting FASTQs in the first place. It might not get the compressed file size down as low as fqzcomp, but I think it answers the OPs question perfectly by being undoubtedly the most reliable method.

ADD REPLY
2
Entering edit mode

Note: Clumpify.sh is part of BBMap suite.

ADD REPLY
0
Entering edit mode
3.3 years ago
Divon ▴ 230

You might also want to try my software Genozip - it provides much better compression than .gz, and has several other advantages too:

  • You can compress whole directories directly into a tar file: genozip *.fq.gz --tar mydata.tar (or to include subdirs and preserve directory structure: find mydir/ | genozip -T - --tar mydata.tar )

  • It is highly scalable with cores - it has been tested with 100+ cores.

  • It can compress FASTQ, BAM, VCF and many other genomic formats.

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

-divon

ADD COMMENT

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6