Entering edit mode
4.0 years ago
from the mountains
▴
250
I am using GATK 4.1.0.0 to mark duplicates in and recalibrate my bam. My workflow is currently to use: 1. MarkDuplicates 2. BaseRecalibrator 3. ApplyBQSR
But recently I have wanted to replace them with spark enabled pipelines to increase efficiency. I came across ReadsPipelineSpark, which marks duplicates in the bam, but it results in a slightly different number of duplicate reads (same total reads).
for i in results_* ; do echo $i ; samtools view $i/bams/sampleA/sampleA.bam | wc -l ; samtools view -F 1024 $i/bams/sampleA/sampleA.bam | wc -l ;done
results_ReadsPipelineSpark
315570
242745
results_regular
315570
243265
I am running both with SUM_OF_BASE_QUALITIES
duplicate scoring strategy (default for both).
Does anybody understand why the two results would differ?