Hello Bio Stars,
I have doubt regarding duplicate removal from BAM file. I used two tools "samtools rmdup" and Piccard MarkDuplicates.
I would like to understand why both of the tools remove different amount of duplicate reads.
The document says:
The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file
Samtools rmdup: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set
Before removing duplicates
57757837 + 0 in total (QC-passed reads + QC-failed reads)
3902505 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
57757837 + 0 mapped (100.00% : N/A)
53855332 + 0 paired in sequencing
27555734 + 0 read1
26299598 + 0 read2
44132336 + 0 properly paired (81.95% : N/A)
45979358 + 0 with itself and mate mapped
7875974 + 0 singletons (14.62% : N/A)
278406 + 0 with mate mapped to a different chr
123930 + 0 with mate mapped to a different chr (mapQ>=5)
Samtools rmdup result
command: Samtools -S input.bam output.bam
17595767 + 0 in total (QC-passed reads + QC-failed reads)
1161712 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
17595767 + 0 mapped (100.00% : N/A)
16434055 + 0 paired in sequencing
8398895 + 0 read1
8035160 + 0 read2
14057950 + 0 properly paired (85.54% : N/A)
14586355 + 0 with itself and mate mapped
1847700 + 0 singletons (11.24% : N/A)
75278 + 0 with mate mapped to a different chr
38992 + 0 with mate mapped to a different chr (mapQ>=5)
Piccard MarkDuplicates result
command: java -jar /apps/picard.jar MarkDuplicates I=input.bam O=outpu.bam M=marked_dup_metrics.txt REMOVE_DUPLICATES=true
41909982 + 0 in total (QC-passed reads + QC-failed reads)
3902505 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
41909982 + 0 mapped (100.00% : N/A)
38007477 + 0 paired in sequencing
19124624 + 0 read1
18882853 + 0 read2
34855826 + 0 properly paired (91.71% : N/A)
36456720 + 0 with itself and mate mapped
1550757 + 0 singletons (4.08% : N/A)
244146 + 0 with mate mapped to a different chr
107980 + 0 with mate mapped to a different chr (mapQ>=5)
Because they use different algorithum. Piccard MarkDuplicates seems better.
If you ask these kind of questions, provide the command lines you used so that people can kind of reproduce what you did.
I have edited my query.