I tried to compress 5 bam files using:
tar -czvf original_bams.tar.gz *.bam
The resulting file sizes ("ll --block-size=M") are:
8067M file1.bam
6962M file2.bam
10662M file3.bam
7794M file4.bam
7346M file5.bam
40828M original_bams.tar.gz
There's a difference of 3MB between the archive and the sum of the sizes of the bam files. Is this expected? I know that there is CRAM (which I will turn to next) but I'm surprised to see that good old .tar.gz has 0 effect?
CRAM is good for archive purposes - it can take ~24 hours for a CRAM file to be created out of a ~30GB BAM file, and the size will be probably ~60% of the BAM. Check out if your BAM files have qual scores binned, and try to bin them while creating the CRAM - that will have a nontrivial impact on the size.
that seems like a really long time. do you have benchmarks?
Not really - I was running trials and I tried converting a really small BAM file and a large BAM file to check compression ratios.
You'll actually get better compression by converting them to sam.gz (or better yet, sam.bz2), and the process is quite fast using pigz/pbzip2.