hi, I ran the following command for 'marking and removing duplicates' from my WGS data from illumina HiSeq platform:
java -Xms4g \
-jar /usr/local/picard-tools-1.129/picard.jar \
MarkDuplicates \
INPUT=2102.bwa.sam.sort.bam \
OUTPUT=2102.bwa.sam.sort.rmdup1.bam \
METRICS_FILE=2102.bwa.sam.sort.rmdup.txt2 \
REMOVE_DUPLICATES=true \
VALIDATION_STRINGENCY=LENIENT
I found my input file was input file was 6.8G whereas output file formed of 7.0G. Moreover, I didn't find any duplicates removed from the files after visualization by IGV or via command line by samtools i.e.
diff -c <(samtools view 2102.bwa.sam.sort.bam | cut -f -9) <(samtools view 2102.bwa.sam.sort.rmdup1.bam | cut -f -9) | less
Any suggestion where I am missing something?
Thanks,
Ravi
Try to use
/usr/local/picard-tools-1.129/MarkDuplicates.jar
directly. Alternatively you can trysamtools rmdup
to test if results change.hi, thank you for suggestion. going to try it now.
Hi, the
picard-tools-1.129
directory didn't haveMarkDuplicates.jar
file. The only files available are:So I don't think I have the option to use 'MarkDuplicates' the way you suggested?
Can you please post the content (or the basic stats) from the Metrics file that was output? It should be
2102.bwa.sam.sort.rmdup.txt2
as you specified in your command line.Thank you. Yes, here is the content of the above file (i shortened the names of files while posting this query):
I should also add that the discrepancy in size is likely due to the fact that MarkDuplicates adds an entry to each record in the BAM file. See the statement which starts at line 182:
https://github.com/broadinstitute/picard/blob/1dc88674926819984de793bfc1bf04847d1fff1a/src/java/picard/sam/markduplicates/MarkDuplicates.java
yes, i was also assuming this fact that picard might have add some 'mark' in .bam file, which leads in increase in size but at the same time i have read that after 'duplicate removal' the overall size of the file should be reduced. So i am bit confused with my file's size (& number of lines) outcome. Sorry, i can't understand java. Thanks for comment, Ravi
Thanks! There are definitely some duplicates there that Picard caught. Looks like your duplication rate is just under 1%.
Try running this command on both your input BAM and your MarkDuplicates output BAM:
What value do you get for each file?