MarkDuplicates gives too few reads
1
0
Entering edit mode
2.3 years ago
wormball ▴ 10

Hello!

I am trying to call mutations from tumor only human samples (enriched with gene panel) from illumina. One of the steps includes marking duplicates like this:

picard MarkDuplicates  INPUT=$ALIGNED_BAM OUTPUT=$MARKDUP_BAM  METRICS_FILE=$MARKDUP_METRICS  REMOVE_DUPLICATES='false' ASSUME_SORTED='true'  DUPLICATE_SCORING_STRATEGY='SUM_OF_BASE_QUALITIES'  OPTICAL_DUPLICATE_PIXEL_DISTANCE='100'   VALIDATION_STRINGENCY='LENIENT' QUIET=true VERBOSITY=ERROR

But this step returns conspicuously few reads, e. g. at one mutant locus i have:

Before:                          After:

chr1:155 904 799                 chr1:155 904 799
Total count: 8041                Total count: 323
A : 8 (0%, 8+, 0- )              A : 0
C : 7903 (98%, 4353+, 3550- )    C : 259 (80%, 163+, 96- )
G : 0                            G : 0
T : 130 (2%, 0+, 130- )          T : 64 (20%, 0+, 64- )
N : 0                            N : 0

You may notice that it selectively removes mostly the prevalent (reference) reads.

However in the loci not covered by the gene panel and therefore less represented it acts less eagerly, e. g.:

Before:                          After:

chr1:155 902 971                 chr1:155 902 971
Total count: 48                  Total count: 18
A : 38 (79%, 24+, 14- )          A : 14 (78%, 6+, 8- )
C : 0                            C : 0
G : 10 (21%, 10+, 0- )           G : 4 (22%, 4+, 0- )
T : 0                            T : 0
N : 0                            N : 0

Is this the way it is supposed to work? If no, how can i make it leave more of my precious reads? If yes, what could be a problem with my sequencing?

Thanks in advance.

MarkDuplicates illumina gatk picard • 2.2k views
ADD COMMENT
0
Entering edit mode

You have remove duplicates set to false so no reads should be removed.

ADD REPLY
0
Entering edit mode

I already have REMOVE_DUPLICATES='false' . It writes file about as big as the input file, but IGV does not show most of the reads in the resulting file, and other GATK tools seem to ignore them.

ADD REPLY
1
Entering edit mode

Check overrepresented reads in Fastqc report and the number of duplicate reads with samtools flagstat. If the layout is paired, check library complexity with picard EstimateLibraryComplexity.

ADD REPLY
0
Entering edit mode

What does this mean?

samtools flagstat before.bam
147235735 + 0 in total (QC-passed reads + QC-failed reads)
17012633 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
146920983 + 0 mapped (99.79% : N/A)
130223102 + 0 paired in sequencing
65111551 + 0 read1
65111551 + 0 read2
1065758 + 0 properly paired (0.82% : N/A)
129908350 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
4594246 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat after.bam 
147235735 + 0 in total (QC-passed reads + QC-failed reads)
17012633 + 0 secondary
0 + 0 supplementary
98431530 + 0 duplicates
146920983 + 0 mapped (99.79% : N/A)
130223102 + 0 paired in sequencing
65111551 + 0 read1
65111551 + 0 read2
1065758 + 0 properly paired (0.82% : N/A)
129908350 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
4594246 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

picard EstimateLibraryComplexity https://pastebin.com/sCi7iNHh

ADD REPLY
0
Entering edit mode

I found that i passed two first read fastq files instead of first and second reads. :( However with correct input files it stays basically the same.

ADD REPLY
0
Entering edit mode

Can you share the EstimateLibraryComplexity for the current alignment and the samtools flagstat output after the mark duplicates step is performed?

ADD REPLY
0
Entering edit mode
samtools flagstat _S3.MarkDuplicates.bam 
148208552 + 0 in total (QC-passed reads + QC-failed reads)
17985450 + 0 secondary
0 + 0 supplementary
94933505 + 0 duplicates
147744925 + 0 mapped (99.69% : N/A)
130223102 + 0 paired in sequencing
65111551 + 0 read1
65111551 + 0 read2
122185128 + 0 properly paired (93.83% : N/A)
129457530 + 0 with itself and mate mapped
301945 + 0 singletons (0.23% : N/A)
5061890 + 0 with mate mapped to a different chr
3421754 + 0 with mate mapped to a different chr (mapQ>=5)

picard EstimateLibraryComplexity https://pastebin.com/KiecjBD7

ADD REPLY
0
Entering edit mode

In IGV make sure you have "filter duplicate reads" option unchecked in Preferences --> Alignments That may simply be the reason why IGV is not showing the reads after dup marking.

dups

ADD REPLY
1
Entering edit mode
2.3 years ago
ATpoint 85k

Gene panel enrichment is expected to have large duplication rates, that is why duplicates are generally not remove in these types of assay. Gene panels are typically created with custom primers / probes that do produce identical or near-identical amplicons. It is the nature of the assay that duplication rate (duplication = fragment with identical start/end coordinates after mapping) is excessive, just by design.

ADD COMMENT
1
Entering edit mode

Duplicates are not being removed in this case. I think the confusion is because IGV does not show the duplicates if the preference setting above is set to filter duplicate reads. So when @wormball checks the dup marked files in IGV they seem to think that reads were removed.

ADD REPLY
0
Entering edit mode

I know that they are not removed, but when i run Mutect2 on this data, it shows mutation read depth of about several hundreds or even less (despite peak read depths over 10 thousand), so i think it ignores such duplicates.

Also i wonder whether gatk ignores improper read pairs. Mate reads are aligned to different chromosomes

ADD REPLY
0
Entering edit mode

On the other hand if they were fragmentation coincidences, they should generally have only one coinciding end of two, however my case they have both ends coinciding. So i think they are actually PCR artifacts. Mate reads are aligned to different chromosomes

ADD REPLY
0
Entering edit mode

Gene panels with amplicons have no fragmentation. The PCR creates the amplicons based on the primers right away.

ADD REPLY
0
Entering edit mode

I was told that in this case mechanical fragmentation was used.

ADD REPLY

Login before adding your answer.

Traffic: 1790 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6