Extra compressed formats for raw/aligned reads and variant tables have been around for some time but I think saw slow adoption.
Our current disk space usage is making us have another look at switching to file formats that offer better compression than vanilla FASTQ, BAM and BCF..
For example:
- CRAM instead of BAM
- CRAM(unmapped) instead of FASTQ
- uBAM (unmapped BAM) instead of FASTQ
- DRAGEN ORA (From Illumina /Enancio) instead of FASTQ
- spVCF instead of VCF/BCF
- etc.
At least these aspect are important when considering new file formats:
- compression factor to be gained / file size reduction to be gained
- lossy or lossless
- biological still meaningful
- technical compatible with current pipelines and tools (e.g. bwa/gatk/bcftools, IGV)
- open (source) file format / API specification
We care most about improved compression / reduced file size for the FASTQ and BAM files. Less about improved compression for BCF.
Did you / your organization already make the switch to file formats that offer better compression than vanilla FASTQ/BAM/BCF?
How did this switch turn out? Looking for example at the above listed aspects?
Relevant external blog post and benchmark:
The Illumina /Enacio format (fastq.ora) offers c.a. 5X extra compression over fastq.gz according to their FAQ. But the format is closed I think, and you have to rely on Illumina keeping their converter available and free for use. https://www.illumina.com/company/about-us/mergers-acquisitions/enancio.html/
Illumina will have a vested interest in keeping this format supported (should that catch on) for a long time since they are in the business of selling sequencers. Problem is to see if other technologies adopt it or go their own way. That is when we will have problems of competing formats causing additional headaches for end-users.
I don't understand why Illumina doesn't just publish the format. Since indeed they are in the business of selling sequencing machines/kits. And an open format would improve adoption , i.e. fastq.ora becoming an open industry standard. And the format does not seem to be too difficult to copy / improve upon, i.e. fast and dirty mapping against reference, encode difference (or exact match) of read v.s. reference.
Illumina just spent (a good bit of money?) to acquire that technology. They are also the dominant player in the market so perhaps they don't have an immediate need to make the technology (my guess is it is not a simple format) public. Perhaps if a competitor announces an open/comparable technology (and if it looks like it may start getting adopted) then they would face some pressure. In any case, they are making decompressors available for free so end-users are not locked out of the data.
For the free decompression software, you have to be careful, for example the latest version of dragen (3.10) allows paired-end compression (which has many advantages, a better compression rate, simplified the possibility of pipe the result of the decompression to a software of analysis without the need to use mkfifo in particular) nevertheless this compression uses version 2.6 of ora and the decompression is still at 2.5.5 and unable to decompress from the paired end
Ion Torrent uses uBAM over fastq. We actually lose information if uBAM is converted to fastq
Edit. I haven't used this tool but it seems to offer lossless BAM to CRAM for Ion Torrent reads..