Hi all
I have "a quick question" about Picard MarkDuplicates. I have ATAC-seq data, already filtered for mitochondrial and unmapped reads. The initial file has about 59 million reads. When I run MarkDuplicates as so: java -jar /exports/igmm/eddie/hill-lab/Zoe/References_and_Scripts/picard-tools-2.5.0/picard.jar MarkDuplicates I=Mutant1_paired_align_subMitoUnc_sorted.bam O=Mutant1_align_filtered.bam M=Mutant1_test_metrics.txt REMOVE_DUPLICATES=true The file then has 39 million reads
However if I run it with REMOVE_DUPLICATES=FALSE and then use samtools to remove the 1024 flagged reads I end up with 56 million reads. I really can't seem to understand why using the remove_duplicates=TRUE causes such a difference? Should the output of both methods not be similar? Thanks in advance!
All the best, Zoe
Hi, thanks for the reply. But I'm still a bit confused. It's not that using Samtools rmdup is removing fewer reads, I have never even tried it. It's just that when I remove the reads flagged as duplicates by MarkDuplicates rather than using its own REMOVE_DUPLICATES=TRUE option I am getting different results. Should the REMOVE_DUPLICATES=TRUE option not just remove those it is flagging (it appears to remove a hell of a lot more)?
"However if I run it with REMOVE_DUPLICATES=FALSE and then use samtools to remove the 1024 flagged reads I end up with 56 million reads"
" It's not that using Samtools rmdup is removing fewer reads, I have never even tried it"
How do you expect an answer if you can't give clear information in the question?
1) State clearly what your goal is.
2) State clearly what you have done.
3) State clearly the results you have got from what you did.
4) State clearly what is confusing you.
Sorry if that wasn't clear enough.
Original read file: 59 million reads. The commands are as follows: java -jar /exports/igmm/eddie/hill-lab/Zoe/References_and_Scripts/picard-tools-2.5.0/picard.jar MarkDuplicates I=Mutant1_paired_align_subMitoUnc_sorted.bam O=Mutant1_align_filtered.bam M=Mutant1_test_metrics.txt REMOVE_DUPLICATES=true
Output: 39 million reads
OR
java -jar /exports/igmm/eddie/hill-lab/Zoe/References_and_Scripts/picard-tools-2.5.0/picard.jar MarkDuplicates I=Mutant1_paired_align_subMitoUnc_sorted.bam O=Mutant1_align_filtered.bam M=Mutant1_test_metrics.txt REMOVE_DUPLICATES=false Then samtools view -F 0x400 Mutant1_align_filtered.bam > Mutant1_align_filtered_2.bam
Output: 56 million reads
Question: should the reads number flagged with the 0x400 flag not match the read number removed when REMOVE_DUPLICATES=true. Which in this case it is not with one removing 3 million and the other about 20 million