Question

Should quality scores be considered when discriminating between optical and library duplicates?

3

Entering edit mode

10.1 years ago

Dan D 7.4k

Picard's MarkDuplicates tool is very useful. As far as I know it's the standard for identifying duplicates in BAM files. However, a lengthy discussion with an investigator about the relationship between optical duplication and sequence complexity got me thinking in deeper detail about its current methodology.

It seems that in the case of a low-complexity library, you would have a substantial number of library duplicates incorrectly getting labeled as optical duplicates simply because they're close together on the flowcell surface. If this is true, then perhaps a way to prevent this would be to compare the quality scores of the suspected optical duplicates. If there's a substantial divergence in the quality scores between two clusters whose coordinates indicate close proximity, then this would indicate that the the reads originate from separate clusters and are therefore not optical duplicates but library duplicates.

Is this line of thinking correct?

If no one has a good argument to the contrary, I'll update the Picard MarkDuplicates tool to add an option for considering quality scores. There's a question of what threshold to use for that divergence number, but that can be set to some default, much like the distance threshold variable which is currently present in the tool (DEFAULT_OPTICAL_DUPLICATE_DISTANCE).

opticalduplicates duplication picard • 4.4k views

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Dan D 7.4k

0

Entering edit mode

I would think that there would be significant divergence in quality scores due to optical duplication (in that the peripheral faux-clusters would have lower quality scores). Have you compared the apparent optical duplication rate between low and regular complexity libraries? I would think that that would be informative.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 104k

0

Entering edit mode

Perhaps a metric could be developed that, given a library complexity would quantify the expected levels of duplication that also look like optical duplicates . Then the mark duplicate program should only remove duplicates until this expected level is reached.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

Ram · Answer 1 · 2014-10-15

2

Entering edit mode

10.1 years ago

mikhail.shugay 3.5k

I was really interested in this question since I've started working with UMI (unique molecular identifier) - barcoded libraries. In these libraries in case two reads have the same sequence in degenerate region (usually 12 N nucleotides inside primer sequence), they could be attributed to the same molecule. However we had some issues for HiSeq data which we deemed to be resulting from misread UMI sequence in close clusters.

I've done some basic analysis using UMI sequences and the tile/coordinates from header. I've analyzed 10mln read pairs from a ~30mln reads FASTQ file (data from here), analyzing separately read pairs with same and distinct UMI. There were 143723 and 139649 pairs on the same tile for matching and distinct UMI sequences respectively, those were further analyzed by calculating distance and quality for reads in pair.

Those UMI duplicates are clearly present for clusters closer than 100 pixels:

Also if plotting the (distance) vs. (minimal quality in pair) density plot for all data and those putative duplicates ("Same-close"), one can see that some of them are indeed low-quality reads:

Hope this information will be useful.

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Interesting analysis! One thing I don't fully get though is why isn't there anything in the middle? How come these are all either far away or close to one another?

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k

0

Entering edit mode

My rather naive suggestions was that when selecting at random point pairs at a very large area the probability of selecting two close points is negligibly small.

If I recall it correctly, there are 96 tiles per flow cell in HiSeq, so for 10mln pairs one should get 100 000 pairs on the same tile (approx. what I get). Given 200mln reads per lane yield, there would be 2mln reads per tile, that is 4x10^12 possible pairs >> 10^5 pairs we check here.

PS And those clusters that are close to each other are only observed in case of duplicate UMI sequence

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by mikhail.shugay 3.5k

0

Entering edit mode

I see, that makes sense.

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 10.1 years ago by Istvan Albert 102k