Clarification on sequence duplication levels and Qualimap estimated duplication rate
0
1
Entering edit mode
20 months ago

Hello,

I'm working with illumina paired-end reads. I have merged the data of 2 different runs of the same sample, in order to increase coverage depth. Firstly, I removed duplicates using picard MarkDuplicates:

java -jar picard.jar MarkDuplicates \
  INPUT=input.bam \
  OUTPUT=dedup_output.bam \
  METRICS_FILE=dup_metrics.txt \
  REMOVE_DUPLICATES=true

Then, when I checked the quality of my alignment, I observed duplication rate of 46% in qualimap results. According to the qualimap documentation, qualimap estimates duplication from the start positions of read alignments. In FASTQC, estimated sequence duplication rate is around 27%. Also, here is my MarkDuplicates output:

 LIBRARY    UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 14184319    1029778028  25072251    52291303    8394435 248696417   9802236 0.243901    1822194394

I am confused due to this increased duplication rate in qualimap results, and I have 2 questions related to this topic:

  1. Is there a distinction between sequence duplication levels and the duplication rate estimated by Qualimap?
  2. Could the observed 46% duplication rate estimated by Qualimap be attributed to the merging of data at the beginning of the analysis?

Thank you in advance!

picard deduplication qualimap • 641 views
ADD COMMENT

Login before adding your answer.

Traffic: 3402 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6