I know samtools rmdup is obsolete and markdup should be used instead. My old pipeline used rmdup and now I'm trying to upgrade it to use markdup.
When comparing the results between these two, using default settings, rmdup removes more reads on my test dataset (188M vs 185M remaining). I'm checking the manual, it looks like markdup by default removes PCR duplicates and not optical duplicates, I think that's what rmdup does too. (rmdup does not have an option for dealing with optical reads).
Where does this difference come from? How can I reproduce results similar to samtools rmdup using samtools markdup.
Thanks!
If it is not documented then it is unlikely that
rmdup
did that. Still, why bothering with something like this? I recommend just usingmarkdup
(since it is the currently recommended tool within samtools) and then proceed with the analysis. One can spend a lot of time on these lowlevel things but eventually there is no benefit in overthinking it.I was thinking the same recently... since I need to do variant calling, I was wondering whether we should remove duplicate reads or just mark them? and will it affect the variant calling? if I just markdup, will duplicated we ignored?
Yes, a proper variant calling tool will ignore duplicates if these are marked as such.