Duplicate reads have first been removed using picard:
java -jar -Xmx3g picard/dist/picard.jar MarkDuplicates INPUT=input.bam OUTPUT=output.bam METRICS_FILE=output.dup_metrics CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT
When I run samtools flagstat on the output bamfile I get the following:
2182812 + 0 in total (QC-passed reads + QC-failed reads)
226710 + 0 duplicates
2176925 + 0 mapped (99.73%:-nan%)
2182812 + 0 paired in sequencing
1091406 + 0 read1
1091406 + 0 read2
2156992 + 0 properly paired (98.82%:-nan%)
2171322 + 0 with itself and mate mapped
5603 + 0 singletons (0.26%:-nan%)
9776 + 0 with mate mapped to a different chr
7030 + 0 with mate mapped to a different chr (mapQ>=5)
So presumably the file still contains 226710 duplicate reads. If I filter out duplicates according to this table:
Flag Chr Description
0x0001 p the read is paired in sequencing
0x0002 P the read is mapped in a proper pair
0x0004 u the query sequence itself is unmapped
0x0008 U the mate is unmapped
0x0010 r strand of the query (1 for reverse)
0x0020 R strand of the mate
0x0040 1 the read is the first read in a pair
0x0080 2 the read is the second read in a pair
0x0100 s the alignment is not primary
0x0200 f the read fails platform/vendor quality checks
0x0400 d the read is either a PCR or an optical duplicate
using the command:
samtools view -F 400 output.bam | wc -l
I get 506072, not: total reads (2182812) - duplicates (226710) = 1956102
My question is why does samtools flagstat indicate that there are still duplicates present after running picard tools, and why are these figures inconsistent when I attempt to filter out duplicates using samtools? I've been asked to remove duplicates for a project. At the moment I am very confused as to which method I should use given the inconsistencies in results.