Entering edit mode
6.4 years ago
James Reeve
▴
130
As part of my pipeline I'm using the Picard program SortSam to order the reads in my BAM file by their position (SORT_ORDER=coordinate
). However when I run this code, my output file has less space.
java -Djava.io.tmpdir=[tmp-directory] -jar picard.jar SortSam \
I=before-sort.bam \
O=after-sort.bam \
SORT_ORDER=coordinate
du before-sort.bam
= 44131980 KB
du after-sort.bam
= 28874760 KB
Do I have a loss of data, or does SortSam have a filtering step I dont' know of?
I checked my files. They have the same number of reads, thanks for the help.
Do you know why my file is nearly 50% smaller after sorting? This is remarkable compression form a programe that I assumed only rearanges the data.
How did you create the before-sort.bam? Maybe you used a very low compression level on this one? I think standard compression level for most tools is 5 (from 0-9). I think (if I remember correctly) typically the size difference between an uncompressed BAM and a standard BAM that you get from normal
samtools view -b
is like 20%, but I have to say that I really have no expert knowledge on compression and stuff so do not take me as a reference^^I compressed from SAM to BAM using
samtools view -b
. SortSam sets the default compression level to 5 (20%).It seems a previous post (Sam To Bam - Loss Of Data Or Just Great Compression?) found SortSam to be very efficent when converting SAM to BAM. I guess part of the SortSAM program compresses the files.
Ok I see. I never cared too much about file sizes as our HPC cluster has almost 350TB of space, so I often leave intermediate files completely uncompressed to save time.