Entering edit mode
3.4 years ago
aka
▴
10
Hello,
I have RNAseq data that I processed with an aligner and I made a Markduplicates with this command:
"java -Xmx4g -jar $PICARD MarkDuplicates I={input.bam_sort} O={output.final_bam_MARKUP} --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 M={output.metric} ASO=coordinate && samtools flagstat {output.final_bam_MARKUP} > {output.flag} "
I got this flagstat with 59374462 of duplicates
174209662 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
3114348 + 0 supplementary
59374462 + 0 duplicates
112681507 + 0 mapped (64.68% : N/A)
171095314 + 0 paired in sequencing
85547657 + 0 read1
85547657 + 0 read2
94945264 + 0 properly paired (55.49% : N/A)
100496082 + 0 with itself and mate mapped
9071077 + 0 singletons (5.30% : N/A)
5415164 + 0 with mate mapped to a different chr
4365520 + 0 with mate mapped to a different chr (mapQ>=5)
However when I look at the metrix file below, the numbers don't match and I have 0,54% duplicates...
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[../results/mapped_reads/bwa_mem/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_sort_mapping.bam] OUTPUT=../results/mapped_reads/bwa_mem/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_mapping_MARK.bam METRICS_FILE=../results/mapped_reads/bwa_mem/3373-1_CCGCGGTT-CTAGCGCT-AHV5HLDSXY_L004_mapping_METRIC.txt ASSUME_SORT_ORDER=coordinate MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Thu Jul 15 13:36:58 CEST 2021
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 9071077 50248041 3114348 61528155 8471292 25451585 981225 0,5419 31261767
## HISTOGRAM java.lang.Double
BIN CoverageMult all_sets optical_sets non_optical_sets
1.0 1,008057 12935692 0 13177329
2.0 1,210093 6039282 847530 6042497
Does I have the good number of duplicats in my flagstat? The % in the flagstat is 34% of duplicates
And why I have so much duplicates ? I think the lab did more PCR but I will do Differential expression analysis with DEseq2. Is it better to remove or keep duplicates?
Thank you in advance,
Aka