The below command executes and produces the dup_metrics below, I am trying to interpret them and don't really understand the metrics and in the bam file is the PG tag MarkDuplicates a duplicate read. I am using public data and if I GATK BQSR do I need to remove the duplicates or just mark them? Thank you :).
for file in /home/cmccabe/Desktop/fastq/*.bam
do
bname=`basename $file`
echo "The bam file is:" $bname
sample=$(basename $file .bam | cut -d- -f1)
echo "The matching sample is:"$sample
java -XX:ParallelGCThreads=16 -jar /home/cmccabe/Desktop/fastq/picard/build/libs/picard.jar MarkDuplicates \
I=/home/cmccabe/Desktop/fastq/$bname \
O=/home/cmccabe/Desktop/fastq/${sample}_marked_duplicates.bam \
M=/home/cmccabe/Desktop/fastq/${sample}_marked_dup_metrics.txt
done
dup_metrics.txt
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates INPUT=[/home/cmccabe/Desktop/fastq/NA12878.bam] OUTPUT=/home/cmccabe/Desktop/fastq/NA12878_marked_duplicates.bam METRICS_FILE=/home/cmccabe/Desktop/fastq/NA12878_marked_dup_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Tue Aug 07 11:25:52 CDT 2018
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 118135 15851082 89664 270033 106517 13703167 6997 0.864632 2149266
## HISTOGRAM java.lang.Double
BIN VALUE
1.0 1.000002
2.0 1.000629
3.0 1.000629
PG:Z:MarkDuplicates
This is public exome data using NA1278. I will have to read up on illumina analysis, i am very new to it. So the BQSR step seems dependent on base quality, is there a particular tool to use. Currently, i am aligning with bwa-mem, markdupicates, bqsr. Maybe a qualitry step after alignment. Thank you :).
You mean NA12878? If it's public, can you link to it?
In exom dataset with a duplication rate of 86%? How was the library prep done? How was this sequenced (platform, single-end/paired-end, number of cycles)?
fin swimmer
Yea, sorry for the typo. I downloaded it from basespace, combining read group 1 and 2, aligned, and markduplicates. I have to look more into it as i am not sure the details. Thank you :).
So this was nextseq data with a PF of 130,580,858 and RG1 had 56,000,000 and RG2 ha 55,000,000 with the Q30 ~ 87%. Thank you :).
https://basespace.illumina.com/analyses/113086485/results/140322187