There's another new error mode in Illumina's NovaSeqX line of sequencers. Fortunately, this mainly affects assembly rather than resequencing so it won't impact most people. Unfortunately, I work at JGI which mainly does assembly, so it does impact me.
The issue is artifactual poly-G sequences. These have been common since Illumina went to a 2-dye system where G is the 'dark' base (IIRC they had a platform with C being the dark base, but it's G for NovaSeqX). So in the past we dealt with it by adding a step to trim poly-Gs on the end of reads, which is where they occurred (unrelated to the poly-G that occurred due to short-insert/adapter-dimer reads sequencing off the end of the adapter). Now, however, they are occurring everywhere in the read, and are not pure - they are interspersed with non-G bases, making detection very difficult. Furthermore, they are so abundant - affecting ~2% of the reads in the library I'm currently examining - that they get added to assemblies.
Because they are so common, because they are not pure homopolymers, because Illumina no longer reports valid quality scores (in fact, most of this problem can be eliminated by recalibrating the quality scores), and particularly because SPAdes aggressively assembles erroneous low-depth stuff even in isolate mode, it's really hard to get rid of all of these artifactual reads. So I had to write a new tool and alter some of the existing ones to get rid of enough to ensure artifactual poly-Gs did not make it into the assembly (of an unknown organism, so you can't check against a reference), without any false-positive removals (which reduce assembly contiguity due to eliminating real genomic poly-G sequences).
First off, I wrote a new tool called PolyFilter. Usage: polyfilter.sh in=reads.fq.gz out=polyfiltered.fq.gz
This gets rid of reads with artifactual poly-Gs by using a 2-part classification system and 4 different classifiers. The classifiers are read entropy (aka complexity; generally these artifactual reads have low complexity), kmer depth (usually the error-mode reads have very low depth somewhere), quality score (this is only useful for recalibrated reads), and poly-G content (obviously, poly-G artifactual reads have a poly-G in them somewhere). Specifically, each classifier has 2 thresholds, low and high. If any read is considered "high" by one of the thresholds it is automatically discarded - so, for example, if a read has an average entropy below 0.25, that would be considered "high" under the entropy classification, and automatically discarded. This doesn't really remove many reads, though. The only classifier that removes a lot of reads based on the "high" threshold is the poly-G one, which uses 29 as "high" and 20 as "low". So, any read with a poly-G of at least length 29 will be removed summarily, as that is not genomic.
The "low" thresholds require another "low" threshold to be met, which is partly why I named it "polyfilter". Specifically, if a read meets the "low" threshold for quality, entropy, or depth (by default, average quality under 12.5, entropy under 0.67, or kmer depth of 2 or less for at least 15% of the read's kmers), then it will be discarded only if it also has a homopolymer meeting the "low" threshold (default length 20). This means that only suspicious reads that are suspicious from multiple angles are discarded, and prevents any real genomic sequences from being discarded (except in super-low-depth stuff like metagenomes, but depth-1 stuff mostly can't be assembled anyway).
Furthermore, the default polymer lengths 20 (low) and 29 (high) are allowed to contain non-polymer-bases, up to a certain percent, controlled by the "purity" flag (default 0.85, so up to 15% of the sequence can be non-G). Also, you can choose which bases you want to filter; by default "polymers=GC" so it will look for poly-G or poly-C (not poly-GC) but sometimes there are poly-A and poly-T artifacts too; I just did not set "polymer=GCAT" as the default because that interferes with RNA-seq, but as long as you are dealing with DNA it's best to look for all 4.
It turns out that you can simply run "bbduk.sh in=reads.fq out=clean.fq k=29 hdist=2 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" and that will get rid of most of the poly-G junk - around 95% of it. But that was not sufficient to keep artifactual poly-Gs out of our assemblies, which is why I wrote PolyFilter.
While PolyFilter is pretty good, and it removes ~97% of the poly-G artifactual reads in my tests - it is still not sufficient! So in the latest (39.09) release of BBTools you can look at assemble_polyg_isolate_v1.sh and assemble_polyg_meta_v1.sh (under pipelines) for the additional steps necessary to get a poly-G free assembly from NovaSeqX data. Essentially, the process is:
1) Adapter-trim reads. This is an essential first step because short-insert reads have poly-Gs for a different reason, and also, aligning reads with adapter sequence will give you incorrect results that ruin quality-score recalibration. 2) Recalibrate quality scores. This is essential. This requires a mapped sam file. It can be from pulling your spiked-in phix out of the lane and mapping it to the phix reference, or assembling an isolate from the library and mapping the library to it, or if you are doing resequencing, simply mapping the reads to the reference. The reads don't need to be from the same library but they DO need to be from the same run and lane. 3) FilterByTile, removes reads from bad areas of the flowcell. 4) PolyFilter, removes most of the poly-G reads. 5) BBDuk poly-G trimming (improved in v39.09), removes residual poly-Gs near the read ends (which is a large portion of them).
After that there are more steps that are only of interest to people doing genome assembly which is a tiny fraction of Biostars, but are necessary to ensure an optimal assembly free of nongenomic poly-G artifacts. The above steps should be of general relevance to anyone using a NovaSeqX and applicable to any library type. And a good idea idea for people using other Illumina platforms, too. The above steps won't completely remove all the poly-G reads - you still need to do some error-correction and depth-filtering to accomplish that - but it reduces them to a very low level that won't interfere with your results in most cases, and when doing resequencing (sequencing of an organism that already has a reference), it tends to not matter too much because the artifactual poly-G reads tend to just not map.
By the way, I've spent the last 3 weeks validating this approach. And even on a 6Mbp, 70% GC organism (which would be expected to have a lot of natural poly-G and poly-C sequences due to the GC content) it has zero false-positive read removals (from PolyFilter), so it should be quite safe (false-positives in step 3 and 5 don't matter because it is impossible to eliminate coverage of a genomic feature with those commands, by design). I'm a strong adherent to the philosophy of "First, do no harm" when preprocessing reads so I always ensure that they improve the assembly of multiple libraries before espousing them. In this case Illumina's data is so bad that some harm is warranted since you can't trust any raw NovaSeqX read that contains a poly-G, but I'm highly confident that this approach will not damage anything you put through it, and will generally lead to a better assembly. Or variant-calling, or whatever you want to do. It will not remove all of the artifactual poly-G reads, especially from from single-cell, metagenomic, or super-low-depth resequencing experiments because some steps are depth-sensitive, so either you need to have >20x average depth or use the "assemble_polyg_meta_v1.sh" as a reference, since that is depth-agnostic.
Thank you, Brian!
I just got my metagenomic sequencing data from a NovaSeqX run and I am gonna test this right now.
It would be interesting to see prevalence of this in data from other sites. Since there are not that many NovaX out in wild as of now this is a very specific issue. I am kind of surprised that Illumina has not addressed this yet.
I should receive more metagenomic sequencing data in two months from a different company that also uses the NovaSeqX.
Please let us know the rate of poly-Gs in your data; I've only tested our machine so far, but it has affected every single run. BBDuk with "k=29 hdist=2 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" is a good estimate, though it undercounts a little since it is a "safe" filter (no false positives). "k=25" gives a more accurate estimate but is no longer safe for filtering. Maybe I'll add a little more parsing so you can just say "literal=polyg" because typing all those Gs is getting annoying...
If you're going to assemble the data, take a look at assemble_polyg_meta_v1.sh. Because it uses less stringent depth-based filtering than the isolate version, I can't guarantee it will totally eliminate artifactual poly-Gs in the assembled contigs like (I think) I can for isolates, but it should greatly reduce it. Our current isolate assemblies (prior to the poly-G removal script) were often getting a majority of their contigs infected with poly-Gs.
I just added "literal=poly*" as a BBDuk flag. That will be in v39.10. It can be "literal=polyg" or "literal=polygc" (which is equivalent to "literal=GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC"). Not actually necessary, but it's kind of embarrassing to have to type really long command lines. So for all of them you would type "literal=polya,polyc,polyg,polyt". But actually you don't need to do all 4 because by default (unless you set rcomp=f) BBDuk will also look for the reverse-complement, so "literal=polyg,polya" is sufficient. But be careful about cleaning poly-As because they occur naturally in some RNA-seq libraries.
Hi Brian,
Thanks for all the great work.
Any info about whether this effect is related to the new XLEAP chemistry which IIRC still has G as the dark base but now switched C to the dual-color base instead of A? Maybe someone has a comparison to the old SBS.
Probably not. I have not looked specifically but on NextSeq 2K where XLEAP is now available we have not heard any complaints. This may be a NovaSeqX specific thing. Or perhaps specific to how JGI makes/runs their libraries. We may get a clue from @andres.
From multiQC the percentage of polyG (overrepresented sequences) range from 0% to 13%
I did a quick try with just two samples by running not the entire pipeline but only the following steps:
I can tell you that most of these polyG artifact are likely removed during the clumpify step because in step 3 and 4 the total number of reads removed are less than 0.26% and 0.07%
I will keep you updated after I run the entire pipeline, but if you need it I can share in private a couple of raw data to play around
Thanks! I would not have thought that Clumpify's optical duplicate removal would have much effect, but maybe it does, since we still don't know the mechanism of the poly-G error mode. I'll have to test that. It could also be that your library was not affected as badly as some. Our 2024 data shows poly-G rates varying from ~0.1% to ~10%, centered at around 1% on a log scale, so there is substantial run to run variability.
Incidentally, it seems to affect R2 at a much higher rate than R1 (~3x for the library I'm currently looking at).
This is from a single run and just from our 18 samples. The average polyG rate was 4.4% (multiQC report).
I did a co-assembly with Megahit without taking care of the artifacts, and the number of contigs affected by polyG was 2.36% (22050 contigs out of 934694).
The min contigs len. was set to 500 bp, so I believe more polyG artifacts can be found in contigs shorter than that. After a quick check of the bbduk.sh output, most of the polyG artifacts were located at the start or the end of the contigs.
2.36% is really good; I have not investigated Megahit but SPAdes loves to assemble these things. We only use Megahit when SPAdes runs out of memory.
It's true that most poly-Gs are located on the contig ends, but it's not entirely true, which makes it difficult. I'll investigate SPAdes vs Megahit in more detail. It could be that we need to move to Megahit for metagenomes. It would be very helpful if you could compare poly-G contig rates from both SPAdes and Megahit.
a co-assembly with spades in --meta mode run out of memmory (Max RAM available is 365 GB). But I can compare the two assembler on the sample that seems to be more affected by the poly-Gs
You can run BBCMS to clean up the reads a bit prior to assembly... we do that for all of our big metagenomes. Something like:
bbcms.sh in=reads.fq out=ecc.fq ecc tossjunk mincount=2 hcf=0.6
Of course there are other steps too, like adapter trimming and human removal, all of which reduce the kmer cardinality of the dataset (which is the driver of memory usage). You can see the kmer cardinality with "loglog.sh in=reads.fq", which is a direct predictor of the amount of memory Spades requires (I don't remember the exact multiplier though).
I ran some tests on the library I'm currently working with.
Clumpify optical duplicate removal on raw reads: 5.815% Clumpify optical duplicate removal on adapter-trimmed reads: 5.930%
Poly-G detection (k=29, hdist=2) on raw reads: 2.34% Poly-G detection on adapter-trimmed reads: 1.67%
Poly-G detection on deduped raw reads: 2.48% Poly-G detection on deduped adapter-trimmed reads: 1.77%
The rates of poly-Gs on deduplicated reads are higher, which indicates that for this library, deduplication does not mitigate the issue, and poly-G reads are more likely to have random sequence (not replicated in nearby dots).
My bad, you are actually right Brian. I was reading the log file wrong.
The other sample the total number of deduplicated reads discarded reads was 5.6%
Thanks for the update; all information is useful since we still don't know the origin of this problem.
There are a lot of poly-g reads. I have no idea why there are so many poly-c reads. There really shouldn't be any. You can do "polymers=GCAT" if you want, to detect all of them, and generally the A and T ones are low (unless you are doing some kind of RNA-seq poly-A pulldown) so it's a good indicator of artifactual problems.
In your case... That's a lot of poly-Gs. Like I said, our data varied between 0.1% and 10% on an exponential scale (so even if the mode is 1%, the average would be higher) but wow, that's high.
OMG, I've been unraveling this problem in my NovaseqX data for months and going crazy, so many failed QC / assemblies. I was just coming to the same conclusion about these polymer artifacts, everything is as you say- and then I find you posted all this some weeks ahead of me! Thank you so much Brian, I'm gonna try this asap, fingers crossed. Glad to know I'm not totally nuts here!
this post was pure gold for me.
All the seq. companies that I use for my projects now have a NovaSeqX, and these are big companies. So, there are hundreds of users out there with NovaSeqX data who are probably unaware of this problem.
Such users should be letting the sequencing companies/Illumina know about the issue directly. I know Brian has tried his best from JGI end but more people Illumina hears from (and more wide publicity this gets) will help nudge them toward creating a lasting solution.
While BBMap (actually @Brian) helps, this seems like a fundamental NovaSeqX technology issue that Illumina needs to address at source.
Well, I presume, several sequencing facilities may already be in contact Illumina regarding this matter without disclosing that. Such information typically falls under non-disclosure agreements.
Before publicly raising concerns, one wants to be confident that it is not an operator error or attributable to a single batch of bad consumables, but something general. There are also not too many operators that can underpin a generalized error by comparing more than one device, or are able to sequence the same library on...say...a NovaSeq 6000 and a NovaSeq X Plus.
Also mind that commercial operators, who need to offer competitive prices, likely prefer maintaining a good relationship with Illumina and favorable discounts, rather than stirring up allegations that may jeopardize Illuminas stock price?
If they are doing this then perfect. Was talking with Brian about this for over a year and originally it was unclear if this was an issue specific for metagenomic libraries (which is what they seem to mostly do on NovaX). But it does seem like others have this problem as well. It is unclear if people have been talking with Illumina/their sequencing provider or if many keep thinking that this is an issue with their machine/data and thus keeping it to themselves and try to find individual workarounds.