Picard MarkDuplicates and EstimateLibraryComplexity - incongruency in duplicates percentage
1
2
Entering edit mode
22 months ago
ipediez93 ▴ 40

Hi everyone,

I'm currently processing some RNA-seq samples and using different tools to extract metrics. In particular, I'm a bit confused about the outputs I'm having from Picard MarkDuplicates and EstimateLibraryComplexity. For the same sample, MarkDuplicates is estimating that 76% of reads are duplicated, whereas EstimateLibraryComplexity is estimating a 29% of duplicates. I thought that being two functions from the same toolbox and using the default parameters, both approaches to estimate duplicates should be the same, but I might be missing something.

Could you help me understand where does this difference come from?

You can see below my code and the outputs.

picard \
MarkDuplicates \
-I $outDir/${sample}_sorted.bam \
-O $outDir/${sample}_marked.bam \
-M $outDir/mkDup_dupMetrics.txt \
-REMOVE_DUPLICATES false \
-VALIDATION_STRINGENCY STRICT \
> $outDir/mark_duplicates.out 2> $outDir/mark_duplicates.err

picard \
EstimateLibraryComplexity \
--INPUT $outDir/${sample}_sorted.bam \
--OUTPUT $outDir/libComplexityPicard.txt

MarkDuplicates summary output

htsjdk.samtools.metrics.StringHeader
MarkDuplicates --INPUT sample_sorted.bam --OUTPUT test_picard/sample_marked.bam --METRICS_FILE test_picard/mkDup_dupMetrics.
txt --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY STRICT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLI
CY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QU
ALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as nume
ric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help fa
lse --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
htsjdk.samtools.metrics.StringHeader
Started on: Thu Feb 09 10:24:08 UTC 2023

METRICS CLASS        picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
Unknown Library 4435    31846923        8078425 0       4027    24327944        306141  0.763913        7642236

EstimateLibraryComplexity summary output

htsjdk.samtools.metrics.StringHeader
EstimateLibraryComplexity --INPUT sample_sorted.bam --OUTPUT test_picard/libComplexityPicard.txt --MIN_IDENTICAL_BASES 5 --MAX_DIFF_RATE 0.03 --MIN_MEAN_QUALITY 20
 --MAX_GROUP_RATIO 500 --MAX_READ_LENGTH 0 --MIN_GROUP_COUNT 2 --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VAL
IDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 2279706 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
htsjdk.samtools.metrics.StringHeader
Started on: Thu Feb 09 10:34:29 UTC 2023

METRICS CLASS        picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION     ESTIMATED_LIBRARY_SIZE
Unknown 0       31420425        0       0       0       9199932 79302   0.292801        42796687
picard markduplicates • 1.3k views
ADD COMMENT
3
Entering edit mode
22 months ago
ipediez93 ▴ 40

Finally found an answer lost in the GATK forum. I'm leaving here the answer just in case others wonder the same:

"We would expect some differences in these metrics because the tools do not work the same. MarkDuplicates uses alignment information to determine duplicates. EstimateLibraryComplexity determines duplicates from the bases of the reads, allowing for some error, and ignoring the reference."

ADD COMMENT

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6