I'm running the dropseq pipeline and there is a part where the samfile gets sorted. It looks like there is an option to either sort by coordinates or by queryname. Is there a benefit to either of these?
I'm running the dropseq pipeline and there is a part where the samfile gets sorted. It looks like there is an option to either sort by coordinates or by queryname. Is there a benefit to either of these?
Reading a file sequentially is faster than random access, and keeping in memory just the information necessary for some calculation is more efficient than keeping the whole file. Some tasks are more easily performed depending on how the bam is sorted, because the bam can be read sequentially and just part of the data need to be kept in memory.
For example, marking duplicates (which, for paired reads, is done by looking at 5' mapping positions of both reads) is a lot easier for bams sorted by position, because you guarantee the reads physically closer inside the bam are also close on the genome. If they weren't, one would need to scan the whole file to build a hash of reads per position in order to mark duplicates.
Conversely, counting reads mapping to features it easier for name-sorted files, as read pairs are next to each other, and secondary / supplementary alignments are next to primary alignments. Again, if they weren't, one would need to scan the whole file to build a hash of reads names per feature mapped.
Of course, most immediately as an end-user, one has to pay attention to which settings are necessary and which sorting order is expected by the tool of choice.
You are right, coordinate-sorted bam files are smaller than unsorted bam with the same compression level. In fact, even fastq files can be further compressed by clustering similar sequences, as is done by clumpify.sh from the BBTools package.
My intuition says name sorting wouldn't help much, if anything, to further compress a bam file.
Do you recall if name sorted BAM files are (not sure by how much) smaller/larger than co-ordinate sorted ones (same file)? One should be smaller since name sorted BAM's will have fastq headers (similar) near each other. Based on your comment about clumpify
my feeling is the name sorted bam may be smaller (by not a lot but still) than same file sorted by co-ordinates. If you don't check on it tonight I will check it tomorrow.
I don't recall, but I just made some quick tests with smalls bams I have around (example files from several installed programs). I name- and coordinate-sorted these files and compared sizes:
ls -l -S
total 817060 -rw-r--r-- 1 hmon hmon 474147656 Mar 12 00:10 nsorted_f1.bam -rw-r--r-- 1 hmon hmon 323868932 Mar 12 00:07 csorted_f1.bam -rw-r--r-- 1 hmon hmon 16269272 Mar 12 00:17 nsorted_f4.bam -rw-r--r-- 1 hmon hmon 15985146 Mar 12 00:17 csorted_f4.bam -rw-r--r-- 1 hmon hmon 2596560 Mar 12 00:20 nsorted_f2.bam -rw-r--r-- 1 hmon hmon 1762304 Mar 12 00:20 csorted_f2.bam -rw-r--r-- 1 hmon hmon 1290775 Mar 12 00:17 nsorted_f3.bam -rw-r--r-- 1 hmon hmon 736082 Mar 12 00:17 csorted_f3.bam
Coordinate-sorted files were always smaller.
It is not the chromosome names that compress better, it is the sequences - they are generally longer than chromosome names. When sorting by coordinate, similar or identical reads end up next to each other, improving compression.
I also faintly remember a thread where this issue was discussed in detail, but I can't find it. However, the question is not new, and has been discussed several times, e.g.:
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
sorting by coordinates is more efficient when visualised in for instance a genome browser (because the way it needs to be queried is based on genomic location rather than on name). I can't think of one immediately but I'm sure that sorting on names also has it use cases.
I think paired-end reads would be guaranteed to be adjacent in a sort-by-name, whereas it wouldn't necessarily be so with a coordinate sorted BAM.
True. Quantification of paired-end reads when counting fragments (defined by the two mates) requires name-sorting. Tools like
featureCounts
will reorder the BAMs by query name given you specify paired-end input.