How to interpret the "Matrics" file of MarkDuplicates (Picard)
1
0
Entering edit mode
6.3 years ago
ravipatel4 ▴ 50

Hi,

I am using Picard to remove duplicates from BAM files. The manual says that it matters how the BAM files are sorted and you would get different results for different ways of sorting. I wanted to test how it affects with my data. So, I sorted my paired-end BAM file using either query name or coordinates using samtools (with -n option or without -n option, respectively). I get different output BAM files (different number of alignments) as mentioned in the manual. However, the duplication statistics in "Duplication Matrics" (output from "M" option) file is same for both. Shouldn't it be different for different ways the BAM files was sorted? Am I missing something? Please let me know. I have provided both commands below for your review.

Thank you in advance.

Best, Ravi

Command used for coordinate sorted BAM file:

time java -jar picard.jar MarkDuplicates I=coord.sorted.bam O=coordSorted_marked.bam M=coordSorted_marked.txt TAG_DUPLICATE_SET_MEMBERS=true TAGGING_POLICY=All REMOVE_DUPLICATES=true ASSUME_SORT_ORDER=coordinate READ_NAME_REGEX=null

Command used for query name sorted BAM file:

time java -jar picard.jar MarkDuplicates I= readName.sorted.bam O=querySorted_marked.bam M=querySorted_marked.txt TAG_DUPLICATE_SET_MEMBERS=true TAGGING_POLICY=All REMOVE_DUPLICATES=true ASSUME_SORT_ORDER=queryname READ_NAME_REGEX=null

Duplicate statistics for coordinate sorted BAM file:

Unknown Library 390015  32881469    0   6736693 292349  8862584 0   0.272361    49457941

Duplicate statistics for query name sorted BAM file:

Unknown Library 390015  32881469    0   6736693 292349  8862584 0   0.272361    49457941
picard MarkDuplicates • 4.9k views
ADD COMMENT
0
Entering edit mode

Picard markduplicates is marking reads as duplicate based on their sequence. Why should the result change whether you sort the file based on co-ordinates or name?

If you are looking to identify reads as optical duplicates (as opposed to PCR duplicates) based on their cluster co-ordinates, then you need to use clumpify.sh from BBMap suite: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files

ADD REPLY
0
Entering edit mode

Hi @genomax,

Because the Picard manual says so. My understanding of this based on their manual is as follows. In case of single-end reads, you can predict PCR duplicates by finding reads that map at a same location on the genome. In case of paired-end reads, you could use the paired-end information to more accurately predict PCR duplicates (by looking for read-pairs that have both ends mapped at a same location on the genome). But it might be computationally challenging to find read-pairs from a bam file unless the bam file is sorted by read names (reads in a pair have same name). My understanding is that Picard doesn't look for read pairs in the input bam file explicitly, unless they are one after the other. If the bam file is sorted by read name, the reads of a pairs will be one after the other, and Picard does the right thing of using paired-end information for calling PCR duplicates. If the bam file is sorted by chromosomal location, Picard may not use the paired-end information. Hence, it would matter how the bam file is sorted in case of paired-end data (which is the case in my data). I get different bam outputs (after removing PCR duplicates) depending on how I sorted the input bam file. However the matrix file is identical for different ways of sorting the input bam file. Therefore, I am trying to understand why is this the case.

Thank you for mentioning about clumpify.sh. I will keep that in mind if I wanted to find optical duplicates (Picard does that too by the way), but I am mostly interested in PCR duplicates at this point.

Thanks.

Best, Ravi

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY
3
Entering edit mode
6.2 years ago
h.mon 35k

When marking duplicates in position sorted files, only mapped reads are analyzed. When marking duplicates in query name sorted files, unmapped reads will also be analyzed - unmapped reads may be marked as duplicates with query name sorted files. My interpretation of Picard MarkDuplicates (which may be incorrect) is that: 1) there is no difference in how mapped reads are analyzed for name or position sorted files, 2) for unmapped reads in name sorted files, probably only optical duplication is analyzed, but not PCR duplication.

So its not that there should be differences, but there may be differences, between the output of both methods. In your particular case, you disabled optical duplication detection (READ_NAME_REGEX=null), so I expect there will be no differences between both methods.

ADD COMMENT
0
Entering edit mode

Hi @h.mon,

I see your point. But I think MarkDuplicates's behavior is different from your interpretation, because, there are different number of mapped reads in query name sorted input vs coordinate sorted input bam.

More importantly, the issue I am trying to resolve is the discrepancies between output bam files and output matrics files. My bam outputs are different (i.e. different number of output alignments) when I sort my input files in different ways. But the matrics file is same irrespective of the way I sort the input bam file. I am trying to resolve this discrepancy. I hope I could explain the issue clearly.

Thank you.

ADD REPLY
2
Entering edit mode

You will see differences regarding exactly which reads/pairs are marked as duplicate depending on the sort order of the file, but the same number of reads/pairs will get marked as duplicates regardless (if that weren't the case the tool would be broken).

ADD REPLY
1
Entering edit mode

How are the bam files different? How are you measuring or checking the differences?

but the matrix file is same

I believe you are talking about the metrics file, correct?

ADD REPLY
2
Entering edit mode

Hi h.mon,

Yes, I meant "matrics" file. Sorry about that.

I performed a test to see if the number of mapped reads are same in the output irrespective of the way the input is sorted. But that is not true. The MarkDuplicate's behavior seems different from your interpretation.

EDIT: I think I made a mistake in the command that used to count mapped reads in a bam file. It turns out that the number of mapped reads in output bam files is same regardless of how the file was sorted, meaning that your interpretation of how MarkDuplicates work looks correct. Sorry about the confusion.

ADD REPLY

Login before adding your answer.

Traffic: 1799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6