Entering edit mode
8 months ago
geocarvalho
▴
390
Hi, I was trying to decide which algorithm from samtools
to use for CRAM compression and I noticed the BAM files recovered from CRAM are smaller (-10 GiB) than the original BAM file. Do you know what information I am losing with this transformation?
$ docker run -v $PWD:$PWD quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1 samtools view -@ 14 -T $PWD/hg38.fa -C --output-fmt-option archive -o $PWD/SAMPLE-P_archive.cram $PWD/SAMPLE-P.bam
$ docker run -v $PWD:$PWD quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1 samtools view -@ 14 -T $PWD/hg38.fa --input-fmt-option archive -o $PWD/SAMPLE-P_unarchive.bam $PWD/SAMPLE-P_archive.cram
-rw-rw-r-- 1 where where 61G Mar 14 21:25 SAMPLE-P.bam
-rw-rw-r-- 1 where where 9.0M Mar 14 21:25 SAMPLE-P.bam.bai
-rw-r--r-- 1 root root 18G Mar 15 00:16 SAMPLE-P_archive.cram
-rw-r--r-- 1 root root 51G Mar 15 01:02 SAMPLE-P_unarchive.bam
-rw-r--r-- 1 root root 18G Mar 15 01:23 SAMPLE-P_small.cram
-rw-r--r-- 1 root root 51G Mar 15 01:47 SAMPLE-P_unsmall.bam
-rw-r--r-- 1 root root 19G Mar 15 01:58 SAMPLE-P_normal.cram
-rw-r--r-- 1 root root 51G Mar 15 02:14 SAMPLE-P_unormal.bam
-rw-r--r-- 1 root root 21G Mar 15 03:13 SAMPLE-P_fast.cram
-rw-r--r-- 1 root root 51G Mar 15 03:35 SAMPLE-P_unfast.bam
Don't depend on file sizes for any decisions. Look inside/compare the reads.
Just as gzip -1 to gzip -9 can give different file sizes, so can two identical BAMs be very different in size. That may or may not be the cause. You'd have to uncompress to test. (Note there's no point in
--input-fmt-option archive
as the input format is self-describing, but I wonder if it somehow enabled archive mode for BAM output, which would indeed be something like gzip -9.)Best thing though is to convert both to SAM and compare them. Htslib comes with a compare_sam.pl tool in the test directory to aid such things. It's slow as it's not designed for anything other than testing, but it'd maybe help give some confidence.
Also, if you know you'll be sticking to samtools/htslib/noodles derived tools for decoding CRAM then you could also try
-O cram,archive,version=3.1
to get maximum compression. ALthough frankly "archive" is typically too extreme IMO. It's good compression, but "small" is often a better tradeoff. Try both and see.