GATK MarkDuplicates Metrix issues
0
0
Entering edit mode
3.4 years ago
aka ▴ 10

Hello,

I have RNAseq data that I processed with an aligner and I made a Markduplicates with this command:

    "java -Xmx4g -jar $PICARD MarkDuplicates I={input.bam_sort} O={output.final_bam_MARKUP} --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 M={output.metric} ASO=coordinate && samtools flagstat {output.final_bam_MARKUP} > {output.flag} "

I got this flagstat with 59374462 of duplicates

174209662 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
3114348 + 0 supplementary
59374462 + 0 duplicates
112681507 + 0 mapped (64.68% : N/A)
171095314 + 0 paired in sequencing
85547657 + 0 read1
85547657 + 0 read2
94945264 + 0 properly paired (55.49% : N/A)
100496082 + 0 with itself and mate mapped
9071077 + 0 singletons (5.30% : N/A)
5415164 + 0 with mate mapped to a different chr
4365520 + 0 with mate mapped to a different chr (mapQ>=5)

However when I look at the metrix file below, the numbers don't match and I have 0,54% duplicates...

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[../results/mapped_reads/bwa_mem/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_sort_mapping.bam] OUTPUT=../results/mapped_reads/bwa_mem/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_mapping_MARK.bam METRICS_FILE=../results/mapped_reads/bwa_mem/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_mapping_METRIC.txt ASSUME_SORT_ORDER=coordinate    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Thu Jul 15 13:36:58 CEST 2021

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 9071077 50248041    3114348 61528155    8471292 25451585    981225  0,5419  31261767

## HISTOGRAM    java.lang.Double
BIN CoverageMult    all_sets    optical_sets    non_optical_sets
1.0 1,008057    12935692    0   13177329
2.0 1,210093    6039282 847530  6042497

Does I have the good number of duplicats in my flagstat? The % in the flagstat is 34% of duplicates

And why I have so much duplicates ? I think the lab did more PCR but I will do Differential expression analysis with DEseq2. Is it better to remove or keep duplicates?

Thank you in advance,

Aka

Markduplicates Picard RNASeq • 980 views
ADD COMMENT

Login before adding your answer.

Traffic: 1283 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6