Question

Is there any benefit in sorting a sam/bam file by coordinates vs. queryname?

3

Entering edit mode

5.8 years ago

O.rka ▴ 740

I'm running the dropseq pipeline and there is a part where the samfile gets sorted. It looks like there is an option to either sort by coordinates or by queryname. Is there a benefit to either of these?

alignment • 4.6k views

ADD COMMENT • link updated 5.8 years ago by h.mon 35k • written 5.8 years ago by O.rka ▴ 740

1

Entering edit mode

sorting by coordinates is more efficient when visualised in for instance a genome browser (because the way it needs to be queried is based on genomic location rather than on name). I can't think of one immediately but I'm sure that sorting on names also has it use cases.

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

1

Entering edit mode

I think paired-end reads would be guaranteed to be adjacent in a sort-by-name, whereas it wouldn't necessarily be so with a coordinate sorted BAM.

ADD REPLY • link 5.8 years ago by manuel.belmadani ★ 1.4k

0

Entering edit mode

True. Quantification of paired-end reads when counting fragments (defined by the two mates) requires name-sorting. Tools like featureCounts will reorder the BAMs by query name given you specify paired-end input.

ADD REPLY • link 5.8 years ago by ATpoint 86k

score 6 · Accepted Answer · 2019-03-11

6

Entering edit mode

5.8 years ago

h.mon 35k

Reading a file sequentially is faster than random access, and keeping in memory just the information necessary for some calculation is more efficient than keeping the whole file. Some tasks are more easily performed depending on how the bam is sorted, because the bam can be read sequentially and just part of the data need to be kept in memory.

For example, marking duplicates (which, for paired reads, is done by looking at 5' mapping positions of both reads) is a lot easier for bams sorted by position, because you guarantee the reads physically closer inside the bam are also close on the genome. If they weren't, one would need to scan the whole file to build a hash of reads per position in order to mark duplicates.

Conversely, counting reads mapping to features it easier for name-sorted files, as read pairs are next to each other, and secondary / supplementary alignments are next to primary alignments. Again, if they weren't, one would need to scan the whole file to build a hash of reads names per feature mapped.

Of course, most immediately as an end-user, one has to pay attention to which settings are necessary and which sorting order is expected by the tool of choice.

ADD COMMENT • link 5.8 years ago by h.mon 35k

1

Entering edit mode

My memory is a bit fuzzy on this but I recall a discussion that name or coordinate sorted BAM file can compress better (similar characters near by compress better). If someone does not comment on this I will check on it tomorrow.

ADD REPLY • link 5.8 years ago by GenoMax 148k

0

Entering edit mode

You are right, coordinate-sorted bam files are smaller than unsorted bam with the same compression level. In fact, even fastq files can be further compressed by clustering similar sequences, as is done by clumpify.sh from the BBTools package.

My intuition says name sorting wouldn't help much, if anything, to further compress a bam file.

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Do you recall if name sorted BAM files are (not sure by how much) smaller/larger than co-ordinate sorted ones (same file)? One should be smaller since name sorted BAM's will have fastq headers (similar) near each other. Based on your comment about clumpify my feeling is the name sorted bam may be smaller (by not a lot but still) than same file sorted by co-ordinates. If you don't check on it tonight I will check it tomorrow.

ADD REPLY • link 5.8 years ago by GenoMax 148k

1

Entering edit mode

I don't recall, but I just made some quick tests with smalls bams I have around (example files from several installed programs). I name- and coordinate-sorted these files and compared sizes:

ls -l -S

total 817060
-rw-r--r-- 1 hmon hmon 474147656 Mar 12 00:10 nsorted_f1.bam
-rw-r--r-- 1 hmon hmon 323868932 Mar 12 00:07 csorted_f1.bam
-rw-r--r-- 1 hmon hmon  16269272 Mar 12 00:17 nsorted_f4.bam
-rw-r--r-- 1 hmon hmon  15985146 Mar 12 00:17 csorted_f4.bam
-rw-r--r-- 1 hmon hmon   2596560 Mar 12 00:20 nsorted_f2.bam
-rw-r--r-- 1 hmon hmon   1762304 Mar 12 00:20 csorted_f2.bam
-rw-r--r-- 1 hmon hmon   1290775 Mar 12 00:17 nsorted_f3.bam
-rw-r--r-- 1 hmon hmon    736082 Mar 12 00:17 csorted_f3.bam

Coordinate-sorted files were always smaller.

ADD REPLY • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Thanks for checking that. I guess having chromosome names lined up makes for better compression than the read names.

ADD REPLY • link 5.8 years ago by GenoMax 148k

0

Entering edit mode

It is not the chromosome names that compress better, it is the sequences - they are generally longer than chromosome names. When sorting by coordinate, similar or identical reads end up next to each other, improving compression.

I also faintly remember a thread where this issue was discussed in detail, but I can't find it. However, the question is not new, and has been discussed several times, e.g.:

sorting a BAM produces a smaller file than the original

Size of BAM file reduces after sorting with samtools

ADD REPLY • link 5.8 years ago by h.mon 35k