Hi, I am using Picard CollectHsMetrics tool to obtain a bunch of statistics about the regions in my targeted panel such as mean coverage and mean coverage for each interval. Since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is. Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.
Here is how I run it:
java -jar picard.jar CollectHsMetrics I=sample.bam O=sample.realigned.metrics TI=intervals.bed PER_TARGET_COVERAGE=sample.target_region.coverage BI=intervals.bed REFERENCE_SEQUENCE=hg19.fa
BTW, the input file (sample.bam) has gone through the step of duplicates marking. Even if I run CollectHsMetrics before marking duplicates, the results are the same.
Any help will be appreciated. Thanks!
Usually, if there are N duplicated reads, N-1 of them are marked as duplicates. Here's a little example of a before/after picard MarkDuplicates:
I've made A1 and A2 duplicates, and A3 and A4 unique - and Picard correctly marks only A1 as duplicated.
So to answer your question, coverage - to me anyway - is the amount of bases of the genome covered by sequenced reads. To others, it's the amount of the genome covered by fragments/templates that enter the sequencer. For others, it's the amount of genome covered by signal - which obviously depends on your assay.
In all of these situations however, duplicates should not effect the coverage calculated. Do you mean depth?
Yes, sorry, I meant depth. HsMetrics is reporting a mean depth of 2X in regions where I can see from the bam file in IGV that there are 30 reads mapping in that location. Please advise.