I use picard tools to take a BWA alignment sam file and convert it into a sorted bam file. Normally this works well, but for a small number of samples, I am getting VERY small bam files. e.g. SAM file = 42G, BAM file = 831M Samtools produces the same BAM file size. If I take the bam and convert it back to SAM, the 42G file is reproduced.
I'm confused as to why the BAM file is so small, when for the majority of other samples, the BAM file is ~1/4 of the size of the SAM file - i.e. should be about 10G here.
I'm using picard 1.77 and this command:
java -Xmx${JAVMEM} -jar ${pic_dir}/SortSam.jar SO=coordinate INPUT=${out_dir}/"4_"${SAMPLE_ABB}"_BWA_pe12.sam" OUTPUT=${out_dir}/"5_"${SAMPLE_ABB}"_BWA_pe12.bam" VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true MAX_RECORDS_IN_RAM=500000 TMP_DIR=${tmp_dir}
I would investigate this file in more detail. A compression rate of almost 50 fold is very surprising - so much so that it makes me suspect that there is no useful information in your file, otherwise it wouldn't compress so well.
Agreed. I was thinking it might be a highly targeted experiment, where they got 10000x depth on a very small number of regions. You'd expect those results to be highly compressible, since many of the sequences would be identical.
@Clare Typically, a
bam
file can be reduced by nearly a factor of four, as what you observed. The size of the finalbam
file depends on the number of reads and the compression algorithm. How many reads do you have in this file?