Question

MarkDuplicates of a queryname sorted bam more expensive than a coordinate sorted bam

0

Entering edit mode

5.0 years ago

drkennetz ▴ 560

Hi all, I am currently aligning/sorting some DNA data (WGS, WES) with high depth. I have performed gatk SortSam using SORT_ORDER="queryname" prior to marking duplicates ultimately because of consistency of output results. When I coordinate sort and then mark duplicates, I commonly see situations where a position will have multiple duplicates associated with it and each time a different read will be marked as the primary read and the rest duplicates.

With queryname sorting, I can guarantee that the same read will be marked as the primary at a position, and the rest will be marked as duplicates. One thing I noticed is that for these high depth samples, duplicate marking fails:

INFO    2020-04-01 10:39:36     MarkDuplicates  Read 1,494,000,000 records.  
Elapsed time: 05:36:07s.  Time for last 1,000,000:   15s. Last read position: 1
9:56,701,510 
INFO    2020-04-01 10:39:36     MarkDuplicates  Tracking 52 as yet unmatched pairs. 0 records in RAM.
[Wed Apr 01 10:42:15 CDT 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 338.82 minutes.
Runtime.totalMemory()=31823757312
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

picard.PicardException: Exception writing ReadEnds to file.
...
Caused by: java.io.IOException: No space left on device

I am not concerned with answers on how to fix this, I have seen many and none of them work. My question is:

When I coordinate sort and then mark duplicates for this same sample, it completes successfully. Why is MarkDuplicates with queryname sorting more expensive than MarkDuplicates with coordinate sorting?

next-gen alignment • 2.9k views

ADD COMMENT • link updated 4.3 years ago by i.sudbery 21k • written 5.0 years ago by drkennetz ▴ 560

score 0 · Answer 1 · 2020-12-09

I imagine this post answers your question: https://github.com/broadinstitute/picard/issues/1119

"When the input is coordinate-sorted, unmapped mates of mapped records and supplementary/secondary alignments are not marked as duplicates. However, when the input is query-sorted (actually query-grouped), then unmapped mates and secondary/supplementary reads are not excluded from the duplication test and can be marked as duplicate reads."

For this reason, queryname is bigger than sort! I hope it helps!

score 0 · Answer 2 · 2020-12-09

0

Entering edit mode

4.3 years ago

i.sudbery 21k

Presumably it is because when the input is query sorted MarkDuplicates has to scan the entire file to be sure to have caught all reads that map to a particular location, where as with coordinate sorted, it can tell fairly soon if it unlikely to see any more reads with the same coordinates.

ADD COMMENT • link 4.3 years ago by i.sudbery 21k