Hi everyone,
I'm currently processing some RNA-seq samples and using different tools to extract metrics. In particular, I'm a bit confused about the outputs I'm having from Picard MarkDuplicates and EstimateLibraryComplexity. For the same sample, MarkDuplicates is estimating that 76% of reads are duplicated, whereas EstimateLibraryComplexity is estimating a 29% of duplicates. I thought that being two functions from the same toolbox and using the default parameters, both approaches to estimate duplicates should be the same, but I might be missing something.
Could you help me understand where does this difference come from?
You can see below my code and the outputs.
picard \
MarkDuplicates \
-I $outDir/${sample}_sorted.bam \
-O $outDir/${sample}_marked.bam \
-M $outDir/mkDup_dupMetrics.txt \
-REMOVE_DUPLICATES false \
-VALIDATION_STRINGENCY STRICT \
> $outDir/mark_duplicates.out 2> $outDir/mark_duplicates.err
picard \
EstimateLibraryComplexity \
--INPUT $outDir/${sample}_sorted.bam \
--OUTPUT $outDir/libComplexityPicard.txt
MarkDuplicates summary output
htsjdk.samtools.metrics.StringHeader
MarkDuplicates --INPUT sample_sorted.bam --OUTPUT test_picard/sample_marked.bam --METRICS_FILE test_picard/mkDup_dupMetrics.
txt --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY STRICT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLI
CY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QU
ALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as nume
ric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help fa
lse --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
htsjdk.samtools.metrics.StringHeader
Started on: Thu Feb 09 10:24:08 UTC 2023
METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 4435 31846923 8078425 0 4027 24327944 306141 0.763913 7642236
EstimateLibraryComplexity summary output
htsjdk.samtools.metrics.StringHeader
EstimateLibraryComplexity --INPUT sample_sorted.bam --OUTPUT test_picard/libComplexityPicard.txt --MIN_IDENTICAL_BASES 5 --MAX_DIFF_RATE 0.03 --MIN_MEAN_QUALITY 20
--MAX_GROUP_RATIO 500 --MAX_READ_LENGTH 0 --MIN_GROUP_COUNT 2 --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VAL
IDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 2279706 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
htsjdk.samtools.metrics.StringHeader
Started on: Thu Feb 09 10:34:29 UTC 2023
METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown 0 31420425 0 0 0 9199932 79302 0.292801 42796687