Hello,
I have a lof of bam files (nearly 500) each 10GB. In total my data occupies 7T. I know bam files are already compressed. Does it make sense to compress the ones I do not use as one unique tgz file? Or any other format?
Hello,
I have a lof of bam files (nearly 500) each 10GB. In total my data occupies 7T. I know bam files are already compressed. Does it make sense to compress the ones I do not use as one unique tgz file? Or any other format?
Or any other format?
CRAM format is the one option for you. which is significantly better lossless compression than BAM
CRAM generation is actually faster than BAM generation in samtools, at least at the default compression levels. CRAM decoding is slower than BAM though unless you're I/O bound, in which case CRAM will be faster due to being smaller.
See https://github.com/samtools/www.htslib.org/pull/23/commits/6a123b6aa7e677c899799cf615b6ca27659193d0 (not merged yet sadly) for some modern benchmarks.
For archival, you have to be certain the reference will be around for as long as the archive too. Either cache a copy of it with your files or use the embedded reference mode of CRAM. You can do this with
samtools view -O cram,embed_ref in.bam -o out.cram
You might also want to try Genozip, which compresses BAM better than CRAM does, and also compresses many other genomic formats such as FASTQ, VCF (and even CRAM).
Documentation: https://www.genozip.com.
Full disclosure: I am the developer of Genozip
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Compressing bams using gzip will not be worth the effort, as the time spent compressing/decompressing them will be more expensive than the space you will end up saving overall.
Look for archival solutions. If you're dealing with 500 BAMs, you are most probably working for an institution that has HPC cluster with storage and archival options.
Not many tools accept
CRAM
input so if you ever need to do anything with these files they will have to be reconverted. So take that into account in making the decision.I did the maths on how long it takes to recover AWS CPU costs (based on a spot price some arbitrary time ago) in the reduction of AWS standard S3 disk charges for a BAM to CRAM conversion. At that point it happened to be around 1 day! Obviously longer for cheaper storage tiers.
I didn't do the reverse costs - CRAM to BAM - but it'll be a similar order of magnitude.
If you absolutely must keep BAM format it's always possible to uncompress them first (
zcat in.bam > in.u.bam
) and then recompress using another tool with far superior compression ratios, such as bsc or mcm. It'll still be likely considerably larger than CRAM though and it'll take considerably longer. The process can be reversed, ending with bgzip to recompress the BAM.Do I have to use reference genomes as well? something like
samtools view -T ref.fa -C -o file.cram file.bam
? or is it possible to avoid it?no, you have to specify a genome.
when I use the command line above will it create the cram file and bam will disappear? Thanks to all, your help was very useful!
No it won't. Well written tools (actually, all non-destructive tools) don't overwrite/delete input files.
I see that the CRAM lossless compression reduces the BAM size from 1.7 to 1.4, so from 10 T I'll go to 7-6T. I guess this is the maximum compression? How do you guys archive the BAM files in your clusters?