Question

picard tools HsMetrics not using duplicate reads to calculate coverage

2

Entering edit mode

8.6 years ago

Floydian_slip ▴ 170

Hi, I am using Picard CollectHsMetrics tool to obtain a bunch of statistics about the regions in my targeted panel such as mean coverage and mean coverage for each interval. Since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is. Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.

Here is how I run it:

java -jar picard.jar CollectHsMetrics I=sample.bam O=sample.realigned.metrics TI=intervals.bed PER_TARGET_COVERAGE=sample.target_region.coverage BI=intervals.bed REFERENCE_SEQUENCE=hg19.fa

BTW, the input file (sample.bam) has gone through the step of duplicates marking. Even if I run CollectHsMetrics before marking duplicates, the results are the same.

Any help will be appreciated. Thanks!

picard duplicates HsMetrics • 5.4k views

ADD COMMENT • link updated 6.6 years ago by Biostar 20 • written 8.6 years ago by Floydian_slip ▴ 170

0

Entering edit mode

Usually, if there are N duplicated reads, N-1 of them are marked as duplicates. Here's a little example of a before/after picard MarkDuplicates:

enter image description here

I've made A1 and A2 duplicates, and A3 and A4 unique - and Picard correctly marks only A1 as duplicated.

So to answer your question, coverage - to me anyway - is the amount of bases of the genome covered by sequenced reads. To others, it's the amount of the genome covered by fragments/templates that enter the sequencer. For others, it's the amount of genome covered by signal - which obviously depends on your assay.

In all of these situations however, duplicates should not effect the coverage calculated. Do you mean depth?

ADD REPLY • link 8.6 years ago by John 13k

0

Entering edit mode

Yes, sorry, I meant depth. HsMetrics is reporting a mean depth of 2X in regions where I can see from the bam file in IGV that there are 30 reads mapping in that location. Please advise.

ADD REPLY • link 8.6 years ago by Floydian_slip ▴ 170

score 0 · Answer 1 · 2016-05-04

0

Entering edit mode

8.6 years ago

Pierre Lindenbaum 164k

since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is.

yes, https://github.com/broadinstitute/picard/blob/master/src/java/picard/analysis/directed/TargetMetricsCollector.java#L432

 if (!record.getDuplicateReadFlag()) { // ignore duplicates for unique reads/bases

Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.

as far as I can see in the code: no

Solution; reset the duplicate flag with RevertSam : A: Remove flags of MarkDuplicates (picard)

ADD COMMENT • link 8.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks, Pierre. I can try that. However, as I mentioned above that running HsMetrics on the sorted bam file which has not undergone duplicate marking gives the same results: Lower depth than actual due to duplicate reads. So, apparently, HsMetrics does not take duplicate reads into account even at that stage. Any thoughts?

ADD REPLY • link 8.6 years ago by Floydian_slip ▴ 170

1

Entering edit mode

it could be the mapq==0 ? some other flags (fails quality , supplementary, etc... )

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Pierre, sorry, I did not follow you; you suggest mapq==0 as a switch in HsMetrics? BTW, are there any other tools out there that can do the same: provide depth information on the intervals (supplied in a bed format) and some other post-alignment quality metrics? Thanks!

ADD REPLY • link 8.6 years ago by Floydian_slip ▴ 170

1

Entering edit mode

if your reads have a mapq==0 they will be ignored in HsMetrics

are there any other tools out there that can do the same: provide depth information on the intervals

GATK :Depth of Coverage.

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 164k