I am not too familiar with the subject, neither with details of compression algorithms and nor the exact specification of the BAM format, but I think it is just a matter of more or less efficient compression.
Various implementations of zlib exist and the samtools authors have run some benchmarks showing significant differences, at least for lower compression levels.
But even with the exact same implementation and compression level, sizes of compressed files may vary, if their content is reordered. The reason is, that all compression works by storing similar content only once and referencing this piece of information at multiple locations. By design, zlib is limited to a 32 KB window, which was a sensible choice in the early '90s, when it was conceived. So if sorting brings more similar content closer together and into that 32kb window (e.g. by a different handling of mates or secondary alignments), it allows for a more efficient compression. This is nicely exemplified by clumpify
from the BBTools suite, which can shrink compressed FastQ files by another 50% or so just by reordering reads by sequence similarity.
Since today's computers can hold much more information in memory, even in mobile and embedded environments, the window for compression algorithms can be increased for a way better performance. Facebook's Zstandard, for example, has no inherent limit and can address terabytes of memory. It mostly operates on something between 1 - 8 MB, though. However, it has a long range mode --long
that allows for a maximum window size of 2GB. You would probably be surprised how much that can shrink a Fasta file of some reference genome interspersed with long transposons. Also zstd --train
is a really cool feature, because it builds a dictionary of repeating patterns informed by the content that needs to be compressed. Once trained, using a dictionary works really well, if thousands of similar, small files need to be compressed individually.
But I'm getting off topic - ultimately you just wanted to know if everything was fine with your BAM files, and it is ;-)
Generally do not use file sizes as a metric for any QC. Extent of compression can change based on what data is near each other.
If the file was already sorted then perhaps it was compressed more when you tried sorting it again. You can control the level of compression using following option for
samtools sort
.I've noticed what OP sees too - there's definitely a difference in size. I charted it up to difference in compression levels as well as there is no easy way to check if two BAMs are identical in terms of content - we'd also need to examine the significance of each difference, which IMO is a pain.
BAM's header may not contain all the PG lines. To make sure that both STAR's BAM and samtools BAM contain the same number of reads run:
If these are the same then IMHO there is nothing to worry. The size diff may be caused by different compression level (default for gzip is 6), blocksize or if somehow you got unmapped reads in the input BAM samtools minimizer option.
Curious why that can happen? OP is re-sorting a sorted file (not sure why).
You are right that in the OP case there is nothing indicating any missing PG lines. I meant a general case scenario where "two BAM headers match 100%" may not reflect the differences in BAM bodies. I.e.
bedtools intersect
outputs BAMs without the relevant PG linewas re-sorting to double-check (and to double-guess, apparently)
also multiple functions from the rseqc suite were printing an explicit warning saying that the input should be coordinate-sorted by samtools. And I understood this as if samtools introduces some tags or whatnot exclusive to it and only it during its sorting step. And that it might be different from STAR's sorted output files. Basically I was bullied by an analysis suite's warning messages
samtools idxstats
outputs identical outputs for both files, so I assume the difference in size is due to the compression then