Hello,
I'm working with illumina paired-end reads. I have merged the data of 2 different runs of the same sample, in order to increase coverage depth. Firstly, I removed duplicates using picard MarkDuplicates:
java -jar picard.jar MarkDuplicates \
INPUT=input.bam \
OUTPUT=dedup_output.bam \
METRICS_FILE=dup_metrics.txt \
REMOVE_DUPLICATES=true
Then, when I checked the quality of my alignment, I observed duplication rate of 46% in qualimap results. According to the qualimap documentation, qualimap estimates duplication from the start positions of read alignments. In FASTQC, estimated sequence duplication rate is around 27%. Also, here is my MarkDuplicates output:
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 14184319 1029778028 25072251 52291303 8394435 248696417 9802236 0.243901 1822194394
I am confused due to this increased duplication rate in qualimap results, and I have 2 questions related to this topic:
- Is there a distinction between sequence duplication levels and the duplication rate estimated by Qualimap?
- Could the observed 46% duplication rate estimated by Qualimap be attributed to the merging of data at the beginning of the analysis?
Thank you in advance!