Picard vs Samtools duplicate removal
2
2
Entering edit mode
8.2 years ago
conorproud89 ▴ 20

Duplicate reads have first been removed using picard:

java -jar -Xmx3g picard/dist/picard.jar MarkDuplicates INPUT=input.bam OUTPUT=output.bam METRICS_FILE=output.dup_metrics CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT

When I run samtools flagstat on the output bamfile I get the following:

2182812 + 0 in total (QC-passed reads + QC-failed reads)
226710 + 0 duplicates
2176925 + 0 mapped (99.73%:-nan%)
2182812 + 0 paired in sequencing
1091406 + 0 read1
1091406 + 0 read2
2156992 + 0 properly paired (98.82%:-nan%)
2171322 + 0 with itself and mate mapped
5603 + 0 singletons (0.26%:-nan%)
9776 + 0 with mate mapped to a different chr
7030 + 0 with mate mapped to a different chr (mapQ>=5)

So presumably the file still contains 226710 duplicate reads. If I filter out duplicates according to this table:

Flag        Chr     Description
0x0001      p       the read is paired in sequencing
0x0002      P       the read is mapped in a proper pair
0x0004      u       the query sequence itself is unmapped
0x0008      U       the mate is unmapped
0x0010      r       strand of the query (1 for reverse)
0x0020      R       strand of the mate
0x0040      1       the read is the first read in a pair
0x0080      2       the read is the second read in a pair
0x0100      s       the alignment is not primary
0x0200      f       the read fails platform/vendor quality checks
0x0400      d       the read is either a PCR or an optical duplicate

using the command:

samtools view -F 400 output.bam | wc -l

I get 506072, not: total reads (2182812) - duplicates (226710) = 1956102

My question is why does samtools flagstat indicate that there are still duplicates present after running picard tools, and why are these figures inconsistent when I attempt to filter out duplicates using samtools? I've been asked to remove duplicates for a project. At the moment I am very confused as to which method I should use given the inconsistencies in results.

sequencing alignment next-gen Assembly • 14k views
ADD COMMENT
7
Entering edit mode
8.2 years ago

Picard MarkDuplicates marks duplicates, rather than removing them. Instead, when it finds a duplicate read it sets the duplicate flag to true and then outputs it. To remove the duplicates you need to add

REMOVE_DUPLICATES=true

to the command line.

ADD COMMENT
5
Entering edit mode
8.2 years ago

samtools view -F 400 output.bam

To filter out duplicates use -F 1024 or -F 0x0400, not 400. See also https://broadinstitute.github.io/picard/explain-flags.html

ADD COMMENT

Login before adding your answer.

Traffic: 1620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6