Oxford Nanopore reads from 10.4.1 flow cells have an accuracy of about 99%. After HERRO (https://www.biorxiv.org/content/10.1101/2024.05.18.594796v2) the percentage of errors additionally decreases by 1-2 orders of magnitude. I assembled several eukaryotic genomes using such reads (coverage ~40x) with Hifiasm and analyzed the contigs by BUSCO. Also, I polished the contigs using NextPolish (using Nanopore and Illumina reads) and repeated the BUSCO analyses. BUSCO results before and after polishing were the same. It seems like short reads are not required for genome polishing anymore. What do you think about this?
A small note: for the correction I didn't use HERRO itself, but "dorado correct", which is the module of Dorado (https://github.com/nanoporetech/dorado) that is based on HERRO.
Oxford Nanopore reads from 10.4.1 flow cells have an accuracy of about 99%.
Can you clarify with which model was used with dorado? HAC or SUP? SUP calling does take almost 2-3 x as long and afaik ONT recommends sticking with HAC for most use cases. SUP calling does appear to improve the quality of data though.
Thanks. All of this is likely going to be dependent on the genome being sequenced and its characteristics. I assume you are not referring to human genomes in this post.
colindaven : Have you ever checked to see if dorado correct generates better results (with say HAC data) than calling with "super accuracy" in first place? In other words are those two operations equivalent. If not is dorado correct able to add additional value to super accuracy calls.
Edit: Based on shelkmike note above it looks like correct can/should only be used with SUP data.
"dorado correct" undoubtedly adds additional value to SUP reads. The "99% accuracy" that I mentioned in the post was for SUP reads. However, SUP reads after "dorado correct" have accuracy around 99.9%. I tested this by aligning reads to the plastid genome of the species that I study now (alignment to the nuclear genome would give less accurate results because of heterozygosity).
The fraction of the genome that your analyse with BUSCO is limited, it only looks at the proteome (== the set of regions to be 'translated' into proteins), the majority of a typical genome is not included in that set!! So all that you will miss when only focusing on the BUSCO results.
Moreover, even 'errors' in the proteome you will not necessarily pick up, as single nucleotide differences might not affect the resulting protein (== wobble positions) and even if they do, BUSCO might still flag it as an OK (== present) protein and count it as such.
As stated by others, tools like QUAST do already a more comprehensive analyses.
Bottom line: yes sure the accuracy is high to very high nowadays but that will not eliminate the need, or better said the advantage of still doing it, of polishing,preferably with super accurate short reads. Depending on the goal or level of quality you pursue you might not / slightly / highly benefit from doing it anyway. (eg. if you're only interested in a global picture of the genome there is little gain of doing it)
EDIT: depending on your goals polishing has actually never been a strict requirement ...
Errors in Nanopore reads' consensus are predominantly indels (https://pubmed.ncbi.nlm.nih.gov/38978005/). So, when occuring in genes, they would usually cause frameshifts, thus affecting BUSCO results.
I've directly observed miniprot reading over indels in reading frames and reporting the results/protein as present in Pangene. So this miniprot lacks sensitivity in this use case.
A small note: for the correction I didn't use HERRO itself, but "dorado correct", which is the module of Dorado (https://github.com/nanoporetech/dorado) that is based on HERRO.
Can you clarify with which model was used with dorado? HAC or SUP? SUP calling does take almost 2-3 x as long and afaik ONT recommends sticking with HAC for most use cases. SUP calling does appear to improve the quality of data though.
SUP. The article about HERRO states that its neural network was trained on SUP reads. Thus, I doubt that it will work equally well on HAC reads.
Thanks. All of this is likely going to be dependent on the genome being sequenced and its characteristics. I assume you are not referring to human genomes in this post.