Question

Are short reads still required for polishing?

2

Entering edit mode

4 months ago

shelkmike ★ 1.5k

Oxford Nanopore reads from 10.4.1 flow cells have an accuracy of about 99%. After HERRO (https://www.biorxiv.org/content/10.1101/2024.05.18.594796v2) the percentage of errors additionally decreases by 1-2 orders of magnitude. I assembled several eukaryotic genomes using such reads (coverage ~40x) with Hifiasm and analyzed the contigs by BUSCO. Also, I polished the contigs using NextPolish (using Nanopore and Illumina reads) and repeated the BUSCO analyses. BUSCO results before and after polishing were the same. It seems like short reads are not required for genome polishing anymore. What do you think about this?

polishing assembly Nanopore Dorado Illumina • 1.1k views

ADD COMMENT • link updated 4 months ago by lieven.sterck 15k • written 4 months ago by shelkmike ★ 1.5k

0

Entering edit mode

A small note: for the correction I didn't use HERRO itself, but "dorado correct", which is the module of Dorado (https://github.com/nanoporetech/dorado) that is based on HERRO.

ADD REPLY • link 4 months ago by shelkmike ★ 1.5k

0

Entering edit mode

Oxford Nanopore reads from 10.4.1 flow cells have an accuracy of about 99%.

Can you clarify with which model was used with dorado? HAC or SUP? SUP calling does take almost 2-3 x as long and afaik ONT recommends sticking with HAC for most use cases. SUP calling does appear to improve the quality of data though.

ADD REPLY • link 4 months ago by GenoMax 150k

0

Entering edit mode

SUP. The article about HERRO states that its neural network was trained on SUP reads. Thus, I doubt that it will work equally well on HAC reads.

ADD REPLY • link 4 months ago by shelkmike ★ 1.5k

0

Entering edit mode

Thanks. All of this is likely going to be dependent on the genome being sequenced and its characteristics. I assume you are not referring to human genomes in this post.

ADD REPLY • link 4 months ago by GenoMax 150k

score 0 · Answer 1 · 2024-11-27

0

Entering edit mode

4 months ago

colindaven 7.4k

Yes, accuracy has improved massively this year with Q26 simplex reads and dorado correct. But are the assemblies perfect? And what is perfect ?

It depends on the genome. Check the gene, transcript or protein level accuracy of the assembled genomes. Your mileage will vary.

I've also found QUAST to be good for comparing pre and post polishing assemblies. BUSCO is not informative at this quality level, as you note.

ADD COMMENT • link 4 months ago by colindaven 7.4k

0

Entering edit mode

colindaven : Have you ever checked to see if dorado correct generates better results (with say HAC data) than calling with "super accuracy" in first place? In other words are those two operations equivalent. If not is dorado correct able to add additional value to super accuracy calls.

Edit: Based on shelkmike note above it looks like correct can/should only be used with SUP data.

ADD REPLY • link 4 months ago by GenoMax 150k

0

Entering edit mode

"dorado correct" undoubtedly adds additional value to SUP reads. The "99% accuracy" that I mentioned in the post was for SUP reads. However, SUP reads after "dorado correct" have accuracy around 99.9%. I tested this by aligning reads to the plastid genome of the species that I study now (alignment to the nuclear genome would give less accurate results because of heterozygosity).

ADD REPLY • link 4 months ago by shelkmike ★ 1.5k

0

Entering edit mode

Exactly - also for completeness I have never tried using dorado correct with HAC basecalling mode.

ADD REPLY • link 4 months ago by colindaven 7.4k

score 0 · Answer 2 · 2024-11-27

0

Entering edit mode

4 months ago

lieven.sterck 15k

The fraction of the genome that your analyse with BUSCO is limited, it only looks at the proteome (== the set of regions to be 'translated' into proteins), the majority of a typical genome is not included in that set!! So all that you will miss when only focusing on the BUSCO results.

Moreover, even 'errors' in the proteome you will not necessarily pick up, as single nucleotide differences might not affect the resulting protein (== wobble positions) and even if they do, BUSCO might still flag it as an OK (== present) protein and count it as such.

As stated by others, tools like QUAST do already a more comprehensive analyses.

Bottom line: yes sure the accuracy is high to very high nowadays but that will not eliminate the need, or better said the advantage of still doing it, of polishing,preferably with super accurate short reads. Depending on the goal or level of quality you pursue you might not / slightly / highly benefit from doing it anyway. (eg. if you're only interested in a global picture of the genome there is little gain of doing it)

EDIT: depending on your goals polishing has actually never been a strict requirement ...

ADD COMMENT • link 4 months ago by lieven.sterck 15k

0

Entering edit mode

Errors in Nanopore reads' consensus are predominantly indels (https://pubmed.ncbi.nlm.nih.gov/38978005/). So, when occuring in genes, they would usually cause frameshifts, thus affecting BUSCO results.

ADD REPLY • link 4 months ago by shelkmike ★ 1.5k

0

Entering edit mode

I've directly observed miniprot reading over indels in reading frames and reporting the results/protein as present in Pangene. So this miniprot lacks sensitivity in this use case.

metaeuk, the mapper used in BUSCO, appears to require intact reading frames though. https://github.com/soedinglab/metaeuk

In any case, most of your errors will be outside genes. Do you care about repeats? Promoters? Enhancers ?

What is your goal - just a "good enough" genome for comparison, or a real reference genome to be used extensively for SNP calling etc?

ADD REPLY • link 4 months ago by colindaven 7.4k

0

Entering edit mode

true to some extent indeed, but all indels of modulo 3 will still not be noticed :-)

and even truncated genes will be counted by BUSCO if they are not becoming too short ...

ADD REPLY • link 4 months ago by lieven.sterck 15k