Hello everyone,
After dealing with multiple failed Illumina sequencing runs on our NovaSeq X platform, which demultiplexed mostly into the 'unknown' bin due to an index-cycle error rate many times higher than our previous instruments, I have developed a new demultiplexing program that can recover these runs using statistical analysis. Extensive testing and validation has demonstrated that it works on a wide variety of barcode-failed runs, recovering a majority of the unknown bin in all cases except poly-G barcodes (the case where there was no recorded signal during the barcode-reading cycles). Furthermore, it accomplishes this without increasing crosstalk compared to Illumina's BCL Convert integrated demultiplexer.
This tool is compatible with all Illumina platforms (or indeed, other platforms too as long as the read headers follow the Illumina format with barcodes in the read headers) and should generally increase yield in all cases (including mostly successful runs) when the yield is below ~93% (meaning 7% unknown), and especially in cases where there was an illumination or imaging failure on certain cycles. But it's provisionally named "NovaDemux" because the design goal was to fix persistent barcode-sequencing failures on NovaSeq X, where it can routinely increase yield from 15%-50% up to >85%, for 10bp dual unique indexes. It works with non-unique pairs and single indexes too.
Even for mostly successful runs - say, 85% yield with 15% of reads sent to unknown - in my testing, often some of the individual libraries in a "successful" pool still had very low yields that were recovered by NovaDemux. In other words, Illumina's demultiplexing success rate can be highly variable between libraries in the same lane; so while increasing a lane's yield from 85% to 93% might not sound like a huge gain, that can mean that some individual libraries jumped from 5% yield to 90% yield, salvaging otherwise lost experiments.
Due to the potential commercial value, NovaDemux is under review by LBL's legal department, and it currently appears that it will be closed-source and not free for commercial use, but possibly free or reduced-cost for academic use. Right now I don't even know how much benefit it would bring on average. In our lab, depending on the product type (such as whether it is PCR-amplified), we have demultiplexing yields of anywhere from 13% to 93%. For pools with 13% yield using Illumina's demultiplexer, NovaDemux can increase the yield to 85%, changing the NovaSeq X from unusable to completely viable. For the pools where the barcode reads worked correctly, Illumina alread gives a ~93% yield, so NovaDemux has little impact aside from maybe a 1% increase in yield. But I don't know how widespread these issues are outside of JGI.
So - if you have any interest in this, OR if you want to disclose some data points about what machine you use and what the %unknown is after demultiplexing, please post here or contact me privately. I can also provide free demos if you're interested in trying it out to see how much it increases your yield (something I'm also curious about); alternatively, if you have a lane with a bad yield, you can send me the headers with no data and I can run it on those to tell you what the yield (per-library and overall) would be. You can generate this by, e.g., "reformat.sh in=reads.fastq.gz out=reads.header.gz" which will strip any usable information. The headers should look something like this:
@LH00223:25:22H5GLLT3:5:1101:1259:1032 1:N:0:TGCGCTTA+TGCGCTTA
Additionally, a file containing expected barcodes is required, with one barcode (or barcode pair) per line, like this:
AAGCGCAT+AAGCGCAT
CTAGCAAG+CTAGCAAG
GTGCTTAC+GTGCTTAC
...etc
Note that NovaDemux is for processing fastq files. It does not do base calling and does not use BCL or image files. As a result, you can use it to reprocess old data as long as you have the full lane of fastqs (including the unknown file); so that irreplaceable library that failed demultiplexing a year ago? May be salvagable, as long as you have the fastqs for that lane.
Supporting figures, taken from some slides I prepared for internal use.
HDist 0, 1, and 2 correspond to 0, 1, and 2 mismatches allowed in demultiplexing. The green bar is NovaDemux (running in its standard 'probabilistic' mode). Lane 1 was a particularly bad sequencing failure; the reads were fine, but the barcodes had too many errors to demultiplex. It's worth nothing that in this figure the hamming distances are not actually Illumina results, but from running NovaDemux in hamming-distance mode, which is approximately the same thing. These dual unique 10bp barcodes were designed to ensure at least hamming distance 2 between any pair, but demultiplexing them while allowing 2 mismatches caused substantially increased crosstalk, and our applications tend to be very crosstalk-sensitive.
This slide shows the crosstalk (barcode-hopping) after demultiplexing using either 1-mismatch demultiplexing or NovaDemux. Despite the increased yield, NovaDemux generally reduces crosstalk. This was a good plate for crosstalk analysis because of 89 libraries, there were ~82 distinct species, allowing reference-based quantification of demultiplexing results. That's not possible for most pools, with resequencing/RNA-seq of one or two organisms, or metagenomes with no reference. NovaDemux is called BBTools here because it's an early slide.
NovaDemux versus Illumina per-library read count for 72 pools across 9 runs; 3100 libraries total. The circled area is libraries that had low yield using Illumina's software but were recovered with NovaDemux. The X axis are all actual read counts from actual Illumina demultiplexing, the actual settings for Illumina's demultiplexing seemed to be allowing 1 mismatch in most cases, but I'm told Illumina sometimes adjusts that on a per-library basis depending on the other barcodes present in the pool.
Very few small/medium sequencing centers likely have NovaSeq X. This is the first I have heard of this issue. But it would be curious to see if this post sets off an "Aha" moment for someone who has been encountering this issue.
Can you tell us how many samples JGI pools with 10x10 indexes and are they made into a super pool that runs across all lanes? Also curious about what the data yield is per lane in billions of reads (supposed to be total 25 B per flowcell?).
I examined nine 10B runs and five 25B runs (we just started getting 25B flowcells a few months ago). I did not pay close attention but it did not seem to me that the level of multiplexing varied much between the flowcells even though the 25Bs yielded many more reads (~18B reads per 10B flowcell and ~54B reads per 25B flowcell; that's reads, not pairs, so the numbers of active dots approximately match the flowcell names). Those numbers are across 8 lanes; so ~2B and ~7B reads per lane. But the lanes were multiplexed anywhere from 8-way to 400-way; mostly around 30-way, I think. Not all of them were dual unique; some used all the same index2, for example, due to the nature of the experiment.
As far as I know we don't do whole-flowcell super-pools; all lanes are independent with different barcodes. Occasionally we have 2 lanes of the same pool, often for platform validation, but normally the different lanes are used for totally different experiment types which may lead to one lane failing while the others look fine.
Hi everyone, NovaDemux is now available and free for use in the latest release, BBTools 39.08. Since NovaDemux itself is closed-source this means that I set up a server where the barcode-processing logic occurs; the client side just counts barcodes and sends counts to the server, which sends back a map of which observed barcode should be demultiplexed into which file. The command is the same, e.g.:
So please try it out on any lane (particularly NovaSeqX) that has a high unknown rate after demultiplexing.