Optical duplicates?
0
2
Entering edit mode
3.5 years ago
abascalfederico ★ 1.2k

Investigating a weird mutation calling artefact I found that a considerable fraction of those artefacts (20-30%, very rough estimate) have certain similarities in their coordinates/tiles. We are using a conservative threshold of 2500 pixels to flag optical duplicates out of NovaSeq S4 flow cells. The following examples are further than 2500 pixels, but they show striking similarities (only showing lane:tile:x:y). The separation of 1000 in tile numbers is very frequent... update: just read that the thousand digit (1 or 2) indicates whether it is "top" or "bottom" in the tile (not sure what that means)

3:1338:9489:28416
3:1338:9489:12195

4:1308:18385:15890
4:2308:17861:17644

3:2630:7835:29684
3:1630:10818:30624

Does anyone have an idea of what may be going on?

duplicate optical • 4.0k views
ADD COMMENT
2
Entering edit mode

2500 may be too low a number for NovaSeq. That number is more appropriate for HiSeq 3K/4K. @Brian Bushnell recommends 12000 for NovaSeq. As I recall X:Y co-ordinates do not directly translate to pixels. I think Illumina was not willing to give the mapping information out.

ADD REPLY
0
Entering edit mode

Thank you! We'll increase to 12000. How much higher can one go?

It seems in many cases I see reads that are closer than 2500 (our threshold) but on different surfaces of the flow cell. They must be the same cluster. However at least biobambam does not flag them if they are on different surfaces. Any idea what's going on here? Is that how other OD flagging programs work too?

If anyone can recommend a reference or post on the optimal OD thresholds that would be much appreciated too.

ADD REPLY
1
Entering edit mode

How much higher can one go?

Devon Ryan had done some empirical testing over at SeqAnswers for this. It probably does not make sense to go much higher.

but on different surfaces of the flow cell. They must be the same cluster.

If the reads are on different surfaces then they are unlikely to be from the same cluster. Illumina likes to call these cluster duplicates (identical sequences in nearby wells). Just to be sure are you using clumpify.sh from BBMap suite for this analysis?

ADD REPLY
0
Entering edit mode

No, we are not using clumpify.sh. We are using biobambam.

Umm... I don't think the same read is on both surfaces but the machine may be seeing the cluster from both sides? It is really a pattern I'm seeing repeatedly.

Reading that post, it is interesting that saturation isn't reached until 20000. I wonder how many reads would you lose with 20000 depending on your duplicate rate... I may try to reproduce Devon's analysis

ADD REPLY
0
Entering edit mode

The number above was specifically meant to be used for clumpify.sh from BBMap suite. I don't know how that will translate to biobambam.

but the machine may be seeing the cluster from both sides?

I doubt that. My understanding is that the imaging should be precise with a laser doing the scanning.

ADD REPLY
2
Entering edit mode

Just want to confirm that I've observed an increased number of duplicates on the opposite surfaces consistently across three sequencers in two facilities (iSeq, NextSeq, NovaSeq). See picture, which is log2 of number of duplicates between different tiles of a flowcell. enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 2589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6