Question

Clarification on sequence duplication levels and Qualimap estimated duplication rate

1

Entering edit mode

20 months ago

keremozdel9 ▴ 10

Hello,

I'm working with illumina paired-end reads. I have merged the data of 2 different runs of the same sample, in order to increase coverage depth. Firstly, I removed duplicates using picard MarkDuplicates:

java -jar picard.jar MarkDuplicates \
  INPUT=input.bam \
  OUTPUT=dedup_output.bam \
  METRICS_FILE=dup_metrics.txt \
  REMOVE_DUPLICATES=true

Then, when I checked the quality of my alignment, I observed duplication rate of 46% in qualimap results. According to the qualimap documentation, qualimap estimates duplication from the start positions of read alignments. In FASTQC, estimated sequence duplication rate is around 27%. Also, here is my MarkDuplicates output:

 LIBRARY    UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 14184319    1029778028  25072251    52291303    8394435 248696417   9802236 0.243901    1822194394

I am confused due to this increased duplication rate in qualimap results, and I have 2 questions related to this topic:

Is there a distinction between sequence duplication levels and the duplication rate estimated by Qualimap?
Could the observed 46% duplication rate estimated by Qualimap be attributed to the merging of data at the beginning of the analysis?

Thank you in advance!

picard deduplication qualimap • 641 views

ADD COMMENT • link 20 months ago by keremozdel9 ▴ 10