Question

Why samtools sort input file is 4.3 Gb but output become 3.5Gb....

1

Entering edit mode

8.7 years ago

bxia ▴ 180

Try to sort the bam file, but the input is 4.3 Gb, output is 3.5 Gb

command line:

samtools sort 231.bam 231.sorted.bam

ChIP-Seq • 2.4k views

ADD COMMENT • link updated 8.7 years ago by John 13k • written 8.7 years ago by bxia ▴ 180

score 6 · Answer 1 · 2016-05-30

This is because BAM files are block compressed, and sorted data is more easily compressible than unsorted data (repetitions and similar records tend to be closer together, and the underlying compression algorithm can do a better job). This is a well-documented phenomenon, and there are even a number of compression tools (e.g. SCALCE, MINCE, ORCOM, etc.) that are built on this principle of compression "boosting".

score 1 · Answer 2 · 2016-05-30

Just as Rob has said, sorting helps out the compression algorithm as reads sorted by position will have very similar DNA, eg:

AAAAAGTGTT
AAAAAGTGTT
AAAAGTGTTA
AAAAGTGTTA
AAAGTGTTAA
...

Also the chromosome name will often be the same, as will the first few bits of the position value.

From a more theoretical point, which I think is interesting enough to worth mentioning, sorting is actually all about removing information from the data. This might sound weird, but in terms of entropy, thats exactly what's happening:

 Original         Sorted
    5               1
    3               2
    4               3
    2               4
    1               5
   10               6
    9               7
    8               8
    7               9
    6              10
   14              11
   16              12
   15              13
   17              14
   13              15
   11              16
   12              17

You might say that these two lists contain exactly the same amount of information in them, but you'd be wrong. Although there's a 1:1 mapping of items in the lists, I encoded a message in the original list, where if the number on the next row is higher than the number on the current row, we write down a 1, else a 0. Can you read the message?

Thus, sorting the list reduce the amount of information/entropy in the dataset. This is, fundamentally, why sorting helps compression.