ReadsPipelineSpark marking duplicates differently than MarkDuplicates

0

Entering edit mode

4.5 years ago

from the mountains ▴ 250

I am using GATK 4.1.0.0 to mark duplicates in and recalibrate my bam. My workflow is currently to use: 1. MarkDuplicates 2. BaseRecalibrator 3. ApplyBQSR

But recently I have wanted to replace them with spark enabled pipelines to increase efficiency. I came across ReadsPipelineSpark, which marks duplicates in the bam, but it results in a slightly different number of duplicate reads (same total reads).

for i in results_* ; do echo $i ; samtools view $i/bams/sampleA/sampleA.bam | wc -l ; samtools view -F 1024 $i/bams/sampleA/sampleA.bam | wc -l ;done 
results_ReadsPipelineSpark
315570
242745
results_regular
315570
243265

I am running both with SUM_OF_BASE_QUALITIES duplicate scoring strategy (default for both).

Does anybody understand why the two results would differ?

alignment gatk DNA-Seq • 1.0k views

ADD COMMENT • link 4.5 years ago by from the mountains ▴ 250

Login before adding your answer.