BAM files compression
5
0
Entering edit mode
4.8 years ago
User000 ▴ 710

Hello,

I have a lof of bam files (nearly 500) each 10GB. In total my data occupies 7T. I know bam files are already compressed. Does it make sense to compress the ones I do not use as one unique tgz file? Or any other format?

bam • 13k views
ADD COMMENT
1
Entering edit mode

Compressing bams using gzip will not be worth the effort, as the time spent compressing/decompressing them will be more expensive than the space you will end up saving overall.

Look for archival solutions. If you're dealing with 500 BAMs, you are most probably working for an institution that has HPC cluster with storage and archival options.

ADD REPLY
1
Entering edit mode

Not many tools accept CRAM input so if you ever need to do anything with these files they will have to be reconverted. So take that into account in making the decision.

ADD REPLY
0
Entering edit mode

I did the maths on how long it takes to recover AWS CPU costs (based on a spot price some arbitrary time ago) in the reduction of AWS standard S3 disk charges for a BAM to CRAM conversion. At that point it happened to be around 1 day! Obviously longer for cheaper storage tiers.

I didn't do the reverse costs - CRAM to BAM - but it'll be a similar order of magnitude.

If you absolutely must keep BAM format it's always possible to uncompress them first (zcat in.bam > in.u.bam) and then recompress using another tool with far superior compression ratios, such as bsc or mcm. It'll still be likely considerably larger than CRAM though and it'll take considerably longer. The process can be reversed, ending with bgzip to recompress the BAM.

ADD REPLY
0
Entering edit mode

Do I have to use reference genomes as well? something like samtools view -T ref.fa -C -o file.cram file.bam? or is it possible to avoid it?

ADD REPLY
0
Entering edit mode

no, you have to specify a genome.

ADD REPLY
0
Entering edit mode

when I use the command line above will it create the cram file and bam will disappear? Thanks to all, your help was very useful!

ADD REPLY
0
Entering edit mode

No it won't. Well written tools (actually, all non-destructive tools) don't overwrite/delete input files.

ADD REPLY
0
Entering edit mode

I see that the CRAM lossless compression reduces the BAM size from 1.7 to 1.4, so from 10 T I'll go to 7-6T. I guess this is the maximum compression? How do you guys archive the BAM files in your clusters?

ADD REPLY
3
Entering edit mode
4.8 years ago

Or any other format?

CRAM

ADD COMMENT
2
Entering edit mode
4.8 years ago
JC 13k

As you pointed, it is already compressed, so tgz is not helpful. However, you can convert them to CRAM or use a reference based method to reduce file size.

ADD COMMENT
2
Entering edit mode
4.8 years ago
Rm 8.3k

CRAM format is the one option for you. which is significantly better lossless compression than BAM

ADD COMMENT
2
Entering edit mode
4.8 years ago
jkbonfield ★ 1.3k

CRAM generation is actually faster than BAM generation in samtools, at least at the default compression levels. CRAM decoding is slower than BAM though unless you're I/O bound, in which case CRAM will be faster due to being smaller.

See https://github.com/samtools/www.htslib.org/pull/23/commits/6a123b6aa7e677c899799cf615b6ca27659193d0 (not merged yet sadly) for some modern benchmarks.

For archival, you have to be certain the reference will be around for as long as the archive too. Either cache a copy of it with your files or use the embedded reference mode of CRAM. You can do this with

samtools view -O cram,embed_ref in.bam -o out.cram
ADD COMMENT
4
Entering edit mode
3.4 years ago
Divon ▴ 230

You might also want to try Genozip, which compresses BAM better than CRAM does, and also compresses many other genomic formats such as FASTQ, VCF (and even CRAM).

Documentation: https://www.genozip.com.

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

Full disclosure: I am the developer of Genozip

ADD COMMENT
0
Entering edit mode

Is genozip a commercial tool? I ask because of the .com URL. Is it FOSS?

ADD REPLY
0
Entering edit mode

It is not FOSS, but it is free for non-commercial use, and the source code is available on github.

ADD REPLY

Login before adding your answer.

Traffic: 1652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6