Hi,
I am using Picard to remove duplicates from BAM files. The manual says that it matters how the BAM files are sorted and you would get different results for different ways of sorting. I wanted to test how it affects with my data. So, I sorted my paired-end BAM file using either query name or coordinates using samtools (with -n option or without -n option, respectively). I get different output BAM files (different number of alignments) as mentioned in the manual. However, the duplication statistics in "Duplication Matrics" (output from "M" option) file is same for both. Shouldn't it be different for different ways the BAM files was sorted? Am I missing something? Please let me know. I have provided both commands below for your review.
Thank you in advance.
Best, Ravi
Command used for coordinate sorted BAM file:
time java -jar picard.jar MarkDuplicates I=coord.sorted.bam O=coordSorted_marked.bam M=coordSorted_marked.txt TAG_DUPLICATE_SET_MEMBERS=true TAGGING_POLICY=All REMOVE_DUPLICATES=true ASSUME_SORT_ORDER=coordinate READ_NAME_REGEX=null
Command used for query name sorted BAM file:
time java -jar picard.jar MarkDuplicates I= readName.sorted.bam O=querySorted_marked.bam M=querySorted_marked.txt TAG_DUPLICATE_SET_MEMBERS=true TAGGING_POLICY=All REMOVE_DUPLICATES=true ASSUME_SORT_ORDER=queryname READ_NAME_REGEX=null
Duplicate statistics for coordinate sorted BAM file:
Unknown Library 390015 32881469 0 6736693 292349 8862584 0 0.272361 49457941
Duplicate statistics for query name sorted BAM file:
Unknown Library 390015 32881469 0 6736693 292349 8862584 0 0.272361 49457941
Picard markduplicates is marking reads as duplicate based on their sequence. Why should the result change whether you sort the file based on co-ordinates or name?
If you are looking to identify reads as
optical
duplicates (as opposed toPCR
duplicates) based on their cluster co-ordinates, then you need to useclumpify.sh
from BBMap suite: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq FilesHi @genomax,
Because the Picard manual says so. My understanding of this based on their manual is as follows. In case of single-end reads, you can predict PCR duplicates by finding reads that map at a same location on the genome. In case of paired-end reads, you could use the paired-end information to more accurately predict PCR duplicates (by looking for read-pairs that have both ends mapped at a same location on the genome). But it might be computationally challenging to find read-pairs from a bam file unless the bam file is sorted by read names (reads in a pair have same name). My understanding is that Picard doesn't look for read pairs in the input bam file explicitly, unless they are one after the other. If the bam file is sorted by read name, the reads of a pairs will be one after the other, and Picard does the right thing of using paired-end information for calling PCR duplicates. If the bam file is sorted by chromosomal location, Picard may not use the paired-end information. Hence, it would matter how the bam file is sorted in case of paired-end data (which is the case in my data). I get different bam outputs (after removing PCR duplicates) depending on how I sorted the input bam file. However the matrix file is identical for different ways of sorting the input bam file. Therefore, I am trying to understand why is this the case.
Thank you for mentioning about clumpify.sh. I will keep that in mind if I wanted to find optical duplicates (Picard does that too by the way), but I am mostly interested in PCR duplicates at this point.
Thanks.
Best, Ravi
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.