Hi,
The purpose of removing duplicates is to mitigate the effects of PCR amplification bias. However, this step can lead to removing reads that were not a consequence of PCR amplification, so removing important info. Some works suggest that duplicate removal is not necessary because the impact of doing so is minimal when calling variants (see link).
What do you recommend? Do you know any paper or work which points benefits of doing duplicate removal?
Yes, do it as part of your standard pipeline. PCR bias is what it is, a bias and therefore mostly undirected and not reproducible for a given DNA fragment. For this reason I somewhat doubt that a single study as the one you linked can comprehensively give advice on that matter. PCR bias might be present or not depending on the sample prep method and the polymerase used. I always mark duplicates with
samblaster
likealigner (...) | samblaster --ignoreUnmated | samtools view -o out.bam
. You are free to use any tool of your choice.Sorry, I have a bunch of .bam files likely marked duplicates, how I can check if duplicates already have been removed or just have been marked waiting for removal?
You don't need to remove them. Any proper variant caller (or general NGS piece of software) will ignore them. If you still want to see if they have been removed, you could take a subset of the files and rerun any duplicate detection software. Typically these output a summary of how many reads were duplicated.
Ideally they are marked with GATK and those information is already taken care by a standard variant caller.