fastq compression tools of choice
5
2
Entering edit mode
9.2 years ago
Richard ▴ 590

Hi all,

I'll be trying out a few compression tools for fastq files. So far on my list I have the following:

  1. dsrc
  2. lrzip
  3. gzip
  4. bgzf

Anyone have any good/poor experience with any of the above, or other options?

I'll be trying them all plotting compression ratio vs. (comp, decomp) cpu time, but I'm interested if anyone has a reason to not consider any of the above, or if there are other tools that should be considered.

Indexing and RAM usage are not of concern.

EDIT Oct 28, 2015: We have tested lrzip, gzip, dsrc, bzip2, and others and found that by far dsrc is the best tool for fastq compression. It is the fastest to compress and has the highest compression ratio. Are there other folks out there using dsrc?

Thanks,
Richard

fastq compression • 9.4k views
ADD COMMENT
7
Entering edit mode

What do you need out of compression? Fast compression time? Fast extraction time? Best compression efficiency? Low run-time memory usage? Do you need indexing (random access)?

Compression is a deep subject. Different algorithms have different characteristics that make them suitable for different use cases. You probably need to specify your criteria, first, before this becomes an answerable question.

ADD REPLY
4
Entering edit mode
9.2 years ago
Charles Plessy ★ 2.9k

You may be interested in the article published in PLOS ONE (2013;8(3):e59190) by James K. Bonfield and Matthew V. Mahoney: Compression of FASTQ and SAM Format Sequencing Data.

ADD COMMENT
0
Entering edit mode

The article is good but is not enough to guide your choice. What we need is tested tools. Gzip is trustable and is not likely to contain much bugs that would be detrimental to your data. What about all these new tools presented in the article? Which ones are dependable?

ADD REPLY
3
Entering edit mode
9.2 years ago

I would say that the gzip format will make the compressed file compatible with more aplications

ADD COMMENT
3
Entering edit mode

And if you opt for block-gzip compression, you get that backwards-compatibility, plus the ability to utilize multiple cores for the compression and decompression, e.g. via pbgzip. (It comes at a small cost of compression ratio compared to normal gzip, but since you get to utilize multiple cores you can probably recover that by increasing the gzip compression level)

ADD REPLY
2
Entering edit mode
9.2 years ago

Algorithms that do well with text compression are probably worth investigating, insofar as uncompressed FASTQ is structured text. This site offers a pretty comprehensive comparison of various algorithms as applied to different corpora (Wikipedia, XML, etc.).

ADD COMMENT
1
Entering edit mode
3.3 years ago
Divon ▴ 230

You might also want to take a look at my Genozip program. It produces at a least 2x better compression than .gz for FASTQ, often a lot more. It can also compress other genomic formats like BAM and VCF.

genozip --make-reference hs37d5.fa.gz   <--- prepare a reference file

genozip file-R1.fq.gz file-R2.fq.gz --pair --reference hs37d5.ref.genozip <---compress paired-end FASTQ files

genozip *.fq.gz --tar mydata.tar <--- compress an entire directory directly into a tar file

Once you have the genozipped file, you can operate on it directly:

genocat --downsample 20 file-R1+2.fq.genozip   <---- downsample and shard
genocat --grep ACTGGGTC file-R1+2.fq.genozip   <---- search for specific reads
genocat --interleaved file-R1+2.fq.genozip   <----- display in interleaved format
genocat --coverage file-R1+2.fq.genozip  <---- estimated coverage per chromosome

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

ADD COMMENT
0
Entering edit mode

Please create a single tools post to announce your program. That would be the best way to do this instead of posting in multiple old threads related to fastq compression.

ADD REPLY
0
Entering edit mode

Thanks, good suggestion!

ADD REPLY
0
Entering edit mode
9 months ago
Xi • 0

Checkout repaq: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552150/ I compared it with SPRING. Based on my dataset, SPRING can achieve average compression ratio of 64% compared to fq.gz and 13.84% compared to fq, repaq can achive average 54% compared to fq.gz and 11.83% compared to fq. But it takes almost twice as longer to compress, so be aware. My tests are down with 32 threads.

ADD COMMENT

Login before adding your answer.

Traffic: 2606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6