I have been analyzing Transcription Factor ChIP-Seq data for several months and I am still perpetually confused about "handling of duplicates" during peak calling with MACS2. This may seem like a topic raised many a time by various users, and I have gone through multiple posts without arriving at any concrete solution.
- Some analysts claim they always remove duplicates.
- Few say they don't remove duplicates, but instead rely on filtering reads with high qualities (samtools -q 20 or samtools -q 30) and retaining all reads thereafter.
- Some have recommended using Picard to remove dupliactes, though that doesn't essentially clear the confusion around the original question.
My questions are as follows
- From what I have read, there are "good" duplicates and "bad duplicates". How do I know beforehand which is which? For instance, my ChIP-Seq samples have an average of 30-40 million reads. The sequencing depth is understandably very high and naturally there will be a lot of duplicates. If I remove all duplicates in MACS2, then what was the point of increasing the sequencing depth?
- Some of my samples in their FastQC reports show high sequence duplication levels, and the % sequences remaining after being de-duplicated varies from 15-30%. Does this mean that MACS2 will perform peak calling with only 20% of my total sequenced reads? This seems like a loss of both money and resources, and what can I do to avoid it?
- Once in a while some ChIP-Seq samples turn out with lesser redundancy rate (duplicates as per MACS2 report), say 15% or 30%. But in most of the cases, there are around 50-60% duplicates. Though I had followed the same protocol while performing ChIP, I don't have an answer why those few samples showed low redundancy rates compared to others.
- In MACS2, there is an option to specify a maximum of times a read is duplicated for it to be considered. Should I use this parameter at all?
- In paired-end ChIP-Seq do I need to provide the parameters fragment size 'd' or --extsize? I have used macs2 predictd and there it says that it works only for singe-end reads. Do I skip providing these parameters and let MACS2 decide what to use?
I request anyone with expertise in handling ChIP-Seq data to answer my queries and also believe it will be of help to many other budding analysts.
Couple of things to note:
I used Picard previously to mark duplicates. Here I have a question. Can MACS2 inherently identify which reads are marked as duplicates by Picard or do I have to specify them with any additional parameter? If it detects that Picard Markdup has been run (by sam flag 1024), will it shut off its inherent duplicate-removing step and instead ignore all reads marked by Picard?
~As far as I'm aware flag 1024 is how MACS2 detects duplicates. Without having run MarkDuplicates MACS2's duplicate removal does nothing.~
EDIT: Actually, looking at the latest versions of the documentation, its not clear if that is true any more. It looks from the text of the manual like deduplication marking might now be built in.