Hi all, I am currently aligning/sorting some DNA data (WGS, WES) with high depth. I have performed gatk SortSam using SORT_ORDER="queryname" prior to marking duplicates ultimately because of consistency of output results. When I coordinate sort and then mark duplicates, I commonly see situations where a position will have multiple duplicates associated with it and each time a different read will be marked as the primary read and the rest duplicates.
With queryname sorting, I can guarantee that the same read will be marked as the primary at a position, and the rest will be marked as duplicates. One thing I noticed is that for these high depth samples, duplicate marking fails:
INFO 2020-04-01 10:39:36 MarkDuplicates Read 1,494,000,000 records.
Elapsed time: 05:36:07s. Time for last 1,000,000: 15s. Last read position: 1
9:56,701,510
INFO 2020-04-01 10:39:36 MarkDuplicates Tracking 52 as yet unmatched pairs. 0 records in RAM.
[Wed Apr 01 10:42:15 CDT 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 338.82 minutes.
Runtime.totalMemory()=31823757312
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: Exception writing ReadEnds to file.
...
Caused by: java.io.IOException: No space left on device
I am not concerned with answers on how to fix this, I have seen many and none of them work. My question is:
When I coordinate sort and then mark duplicates for this same sample, it completes successfully. Why is MarkDuplicates with queryname sorting more expensive than MarkDuplicates with coordinate sorting?