Question

BAM files compression

0

Entering edit mode

5.4 years ago

User000 ▴ 750

Hello,

I have a lof of bam files (nearly 500) each 10GB. In total my data occupies 7T. I know bam files are already compressed. Does it make sense to compress the ones I do not use as one unique tgz file? Or any other format?

bam • 15k views

ADD COMMENT • link updated 4.0 years ago by Divon ▴ 230 • written 5.4 years ago by User000 ▴ 750

1

Entering edit mode

Compressing bams using gzip will not be worth the effort, as the time spent compressing/decompressing them will be more expensive than the space you will end up saving overall.

Look for archival solutions. If you're dealing with 500 BAMs, you are most probably working for an institution that has HPC cluster with storage and archival options.

ADD REPLY • link 5.4 years ago by Ram 45k

1

Entering edit mode

Not many tools accept CRAM input so if you ever need to do anything with these files they will have to be reconverted. So take that into account in making the decision.

ADD REPLY • link 5.4 years ago by GenoMax 152k

0

Entering edit mode

I did the maths on how long it takes to recover AWS CPU costs (based on a spot price some arbitrary time ago) in the reduction of AWS standard S3 disk charges for a BAM to CRAM conversion. At that point it happened to be around 1 day! Obviously longer for cheaper storage tiers.

I didn't do the reverse costs - CRAM to BAM - but it'll be a similar order of magnitude.

If you absolutely must keep BAM format it's always possible to uncompress them first (zcat in.bam > in.u.bam) and then recompress using another tool with far superior compression ratios, such as bsc or mcm. It'll still be likely considerably larger than CRAM though and it'll take considerably longer. The process can be reversed, ending with bgzip to recompress the BAM.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 5.4 years ago by jkbonfield ★ 1.3k

0

Entering edit mode

Do I have to use reference genomes as well? something like samtools view -T ref.fa -C -o file.cram file.bam? or is it possible to avoid it?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 5.4 years ago by User000 ▴ 750

0

Entering edit mode

no, you have to specify a genome.

ADD REPLY • link 5.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

when I use the command line above will it create the cram file and bam will disappear? Thanks to all, your help was very useful!

ADD REPLY • link 5.4 years ago by User000 ▴ 750

0

Entering edit mode

No it won't. Well written tools (actually, all non-destructive tools) don't overwrite/delete input files.

ADD REPLY • link 5.4 years ago by Ram 45k

0

Entering edit mode

I see that the CRAM lossless compression reduces the BAM size from 1.7 to 1.4, so from 10 T I'll go to 7-6T. I guess this is the maximum compression? How do you guys archive the BAM files in your clusters?

ADD REPLY • link 5.4 years ago by User000 ▴ 750

4

Entering edit mode

4.0 years ago

Divon ▴ 230

You might also want to try Genozip, which compresses BAM better than CRAM does, and also compresses many other genomic formats such as FASTQ, VCF (and even CRAM).

Documentation: https://www.genozip.com.

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

Full disclosure: I am the developer of Genozip

ADD COMMENT • link 4.0 years ago by Divon ▴ 230

0

Entering edit mode

Is genozip a commercial tool? I ask because of the .com URL. Is it FOSS?

ADD REPLY • link 4.0 years ago by Ram 45k

0

Entering edit mode

It is not FOSS, but it is free for non-commercial use, and the source code is available on github.

ADD REPLY • link 4.0 years ago by Divon ▴ 230

score 3 · Accepted Answer · 2020-02-05

3

Entering edit mode

5.4 years ago

Pierre Lindenbaum 166k

Or any other format?

CRAM

ADD COMMENT • link 5.4 years ago by Pierre Lindenbaum 166k

score 2 · Accepted Answer · 2020-02-05

2

Entering edit mode

5.4 years ago

JC 13k

As you pointed, it is already compressed, so tgz is not helpful. However, you can convert them to CRAM or use a reference based method to reduce file size.

ADD COMMENT • link 5.4 years ago by JC 13k

score 2 · Accepted Answer · 2020-02-05

2

Entering edit mode

5.4 years ago

Rm 8.3k

CRAM format is the one option for you. which is significantly better lossless compression than BAM

ADD COMMENT • link 5.4 years ago by Rm 8.3k

Ram · Accepted Answer · 2020-02-05

CRAM generation is actually faster than BAM generation in samtools, at least at the default compression levels. CRAM decoding is slower than BAM though unless you're I/O bound, in which case CRAM will be faster due to being smaller.

See https://github.com/samtools/www.htslib.org/pull/23/commits/6a123b6aa7e677c899799cf615b6ca27659193d0 (not merged yet sadly) for some modern benchmarks.

For archival, you have to be certain the reference will be around for as long as the archive too. Either cache a copy of it with your files or use the embedded reference mode of CRAM. You can do this with

samtools view -O cram,embed_ref in.bam -o out.cram