Picard's MarkDuplicates tool is very useful. As far as I know it's the standard for identifying duplicates in BAM files. However, a lengthy discussion with an investigator about the relationship between optical duplication and sequence complexity got me thinking in deeper detail about its current methodology.
It seems that in the case of a low-complexity library, you would have a substantial number of library duplicates incorrectly getting labeled as optical duplicates simply because they're close together on the flowcell surface. If this is true, then perhaps a way to prevent this would be to compare the quality scores of the suspected optical duplicates. If there's a substantial divergence in the quality scores between two clusters whose coordinates indicate close proximity, then this would indicate that the the reads originate from separate clusters and are therefore not optical duplicates but library duplicates.
Is this line of thinking correct?
If no one has a good argument to the contrary, I'll update the Picard MarkDuplicates tool to add an option for considering quality scores. There's a question of what threshold to use for that divergence number, but that can be set to some default, much like the distance threshold variable which is currently present in the tool (DEFAULT_OPTICAL_DUPLICATE_DISTANCE
).
I would think that there would be significant divergence in quality scores due to optical duplication (in that the peripheral faux-clusters would have lower quality scores). Have you compared the apparent optical duplication rate between low and regular complexity libraries? I would think that that would be informative.
Perhaps a metric could be developed that, given a library complexity would quantify the expected levels of duplication that also look like optical duplicates . Then the mark duplicate program should only remove duplicates until this expected level is reached.