Question

GATK MarkDuplicates output and bam

0

Entering edit mode

6.3 years ago

bioguy24 ▴ 230

The below command executes and produces the dup_metrics below, I am trying to interpret them and don't really understand the metrics and in the bam file is the PG tag MarkDuplicates a duplicate read. I am using public data and if I GATK BQSR do I need to remove the duplicates or just mark them? Thank you :).

for file in /home/cmccabe/Desktop/fastq/*.bam
do
bname=`basename $file`
echo "The bam file is:" $bname
    sample=$(basename $file .bam | cut -d- -f1)
echo "The matching sample is:"$sample
    java -XX:ParallelGCThreads=16 -jar /home/cmccabe/Desktop/fastq/picard/build/libs/picard.jar MarkDuplicates \
      I=/home/cmccabe/Desktop/fastq/$bname \
      O=/home/cmccabe/Desktop/fastq/${sample}_marked_duplicates.bam \
      M=/home/cmccabe/Desktop/fastq/${sample}_marked_dup_metrics.txt
done

dup_metrics.txt

## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/home/cmccabe/Desktop/fastq/NA12878.bam] OUTPUT=/home/cmccabe/Desktop/fastq/NA12878_marked_duplicates.bam METRICS_FILE=/home/cmccabe/Desktop/fastq/NA12878_marked_dup_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Tue Aug 07 11:25:52 CDT 2018

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 118135  15851082    89664   270033  106517  13703167    6997    0.864632    2149266

## HISTOGRAM    java.lang.Double
BIN VALUE
1.0 1.000002
2.0 1.000629
3.0 1.000629

PG:Z:MarkDuplicates

GATK MarkDuplicates • 3.9k views

ADD COMMENT • link updated 6.3 years ago by finswimmer 16k • written 6.3 years ago by bioguy24 ▴ 230

score 1 · Answer 1 · 2018-08-07

1

Entering edit mode

6.3 years ago

finswimmer 16k

Hello,

the explanation of the metrics can be found here. 86% duplicates is very high. What kind of data is this?

Marking and not removing duplicates is fine. Most other tools take care of it.

You should think about it whether BQSR is really necessary. If your overall base quality is fine and you do not have low diversity data, it is very likely that the impact of BQSR it will be negligible and so the step is just a waste of time.

fin swimmer

ADD COMMENT • link 6.3 years ago by finswimmer 16k

0

Entering edit mode

This is public exome data using NA1278. I will have to read up on illumina analysis, i am very new to it. So the BQSR step seems dependent on base quality, is there a particular tool to use. Currently, i am aligning with bwa-mem, markdupicates, bqsr. Maybe a qualitry step after alignment. Thank you :).

ADD REPLY • link 6.3 years ago by bioguy24 ▴ 230

1

Entering edit mode

This is public exome data using NA1278

You mean NA12878? If it's public, can you link to it?

In exom dataset with a duplication rate of 86%? How was the library prep done? How was this sequenced (platform, single-end/paired-end, number of cycles)?

fin swimmer

ADD REPLY • link 6.3 years ago by finswimmer 16k

0

Entering edit mode

Yea, sorry for the typo. I downloaded it from basespace, combining read group 1 and 2, aligned, and markduplicates. I have to look more into it as i am not sure the details. Thank you :).

ADD REPLY • link 6.3 years ago by bioguy24 ▴ 230

0

Entering edit mode

So this was nextseq data with a PF of 130,580,858 and RG1 had 56,000,000 and RG2 ha 55,000,000 with the Q30 ~ 87%. Thank you :).

https://basespace.illumina.com/analyses/113086485/results/140322187

ADD REPLY • link 6.3 years ago by bioguy24 ▴ 230