Investigating a weird mutation calling artefact I found that a considerable fraction of those artefacts (20-30%, very rough estimate) have certain similarities in their coordinates/tiles. We are using a conservative threshold of 2500 pixels to flag optical duplicates out of NovaSeq S4 flow cells. The following examples are further than 2500 pixels, but they show striking similarities (only showing lane:tile:x:y). The separation of 1000 in tile numbers is very frequent... update: just read that the thousand digit (1 or 2) indicates whether it is "top" or "bottom" in the tile (not sure what that means)
3:1338:9489:28416
3:1338:9489:12195
4:1308:18385:15890
4:2308:17861:17644
3:2630:7835:29684
3:1630:10818:30624
Does anyone have an idea of what may be going on?
2500 may be too low a number for NovaSeq. That number is more appropriate for HiSeq 3K/4K. @Brian Bushnell recommends 12000 for NovaSeq. As I recall X:Y co-ordinates do not directly translate to pixels. I think Illumina was not willing to give the mapping information out.
Thank you! We'll increase to 12000. How much higher can one go?
It seems in many cases I see reads that are closer than 2500 (our threshold) but on different surfaces of the flow cell. They must be the same cluster. However at least biobambam does not flag them if they are on different surfaces. Any idea what's going on here? Is that how other OD flagging programs work too?
If anyone can recommend a reference or post on the optimal OD thresholds that would be much appreciated too.
Devon Ryan had done some empirical testing over at SeqAnswers for this. It probably does not make sense to go much higher.
If the reads are on different surfaces then they are unlikely to be from the same cluster. Illumina likes to call these
cluster
duplicates (identical sequences in nearby wells). Just to be sure are you usingclumpify.sh
from BBMap suite for this analysis?No, we are not using clumpify.sh. We are using biobambam.
Umm... I don't think the same read is on both surfaces but the machine may be seeing the cluster from both sides? It is really a pattern I'm seeing repeatedly.
Reading that post, it is interesting that saturation isn't reached until 20000. I wonder how many reads would you lose with 20000 depending on your duplicate rate... I may try to reproduce Devon's analysis
The number above was specifically meant to be used for
clumpify.sh
from BBMap suite. I don't know how that will translate to biobambam.I doubt that. My understanding is that the imaging should be precise with a laser doing the scanning.
Just want to confirm that I've observed an increased number of duplicates on the opposite surfaces consistently across three sequencers in two facilities (iSeq, NextSeq, NovaSeq). See picture, which is log2 of number of duplicates between different tiles of a flowcell.