Hi all,
We're going through & revising our variant calling pipeline on NGS data from cancer patients and a question came up:
Which step should be done first (and why), base recalibration or mark duplicates?
Currently we recalibrate bases first and then mark duplicates.
The reason I'm asking this is that we originally based part of our pipeline on the following article, which said that you recalibrate bases and then mark duplicates: http://www.htslib.org/workflow/#mapping_to_variant
However, in the following Broad Institute best practices page it says the opposite, you mark duplicates and then recalibrate bases, saw it in another paper as well: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS
Thanks in advance!
Alon
As per GATK best practices workflow here, https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png, mark duplicates first, followed by base recalibration.