Rmdup removes reads, so you'd need markdup -r
to do the same thing. However that's not ideal way of working anyway. It's better to leave all the data intact and just filter on-the-fly while processing, as then you can always go back from BAM to FASTQ still meaning you don't need to keep a separate FASTQ copy.
The rmdup -S
option doesn't seem to have an equivalent that I can see, but I'm not sure what it was for anyway. If you have pair information, why not use it to get a more accurate determination of duplicates?
Finally, markdup
man page does imply optical duplicates aren't processed by default ("Default is 0 to not look for optical duplicates"), but it's easy enough to enable it as described there. I've no idea what rmdup did, but there's no mention of "optical" anywhere in its source code so I think you're correct in that it didn't cover this case.
Note markdup
runs on a coordinate sorted file, that has previously been through samtools fixmate -m
while name collated. I don't know what sort order rmdup
needs as it claims to want "sorted" without actually defining if it's name or position! I assume position sorted though?
Note the strategy of the two halves of markdup (fixmate + mark) in different sort orders is a way to improve performance. The aligner output will be name collated, so we can cache some data about supplementaries, secondaries, quality values for mate pairs, etc at that stage while all the data is collated together. After the normal position sort step has taken place, we can then rapidly determine the duplicates by looking at the alignment positions coupled with the earlier cached data. This avoids internal sorting steps used by e.g. elPrep, meaning both the fixmate and markdup phases are streaming operations.
Just a note about optical duplicates. In
markdup
(andrmdup
) optical duplicates are a subset of duplicates. The optical duplicate-d
option inmarkdup
labels duplicates as either PCR or optical in the auxiliary tags, but it does not add any more duplicates. There will be the same number of duplicates with or without the optical option being set.So if you are committed to removing the duplicates entirely then I would not bother setting the optical option. All it does is label a set of duplicates that you are subsequently removing from the file and will never see. If you do not set
-d
then the program will run faster (as it is doing less work).