I have a wgs dataset and when I attempt use it with sambamba depth
command, it gives sambamba-depth: All files must be coordinate-sorted
error. What is the reason for this and why coordinate sorting is required?
I have a wgs dataset and when I attempt use it with sambamba depth
command, it gives sambamba-depth: All files must be coordinate-sorted
error. What is the reason for this and why coordinate sorting is required?
In an unsorted BAM file, reads can be in any random order. In a co-ordinate sorted BAM file, reads are in the order in which they map to the reference genome. When they're sorted that way, to find a depth at a certain position, the program only needs to navigate to that position and account for all reads that exist at that position. As soon as a read that maps to the next position is found, the algorithm can stop looking.
In an unsorted BAM, the algorithm will need to look at every single read in the entire file before it's sure that all reads aligned to the position of interest are accounted for.
If your input has 500 positions, the sorted approach will mean going through the file once, jumping to each of the 500 positions The unsorted approach will mean going through the file 500 times, which is extremely unproductive.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
hi ,you can look at this fig in software mosdepth . By traversing a sorted BAM file from the beginning, one can obtain depth information. This algorithm will be faster and more memory-efficient. I believe the principle of Sambamba depth is the same.
I've moving this to a comment as "use a different tool with the same requirement" is not an answer to "why is it a requirement"