Question

For high quality Telomere to telomere assemblies, is short read polishing still necessary

0

Entering edit mode

19 months ago

Mark ▴ 30

I'm new to DNA sequencing but our lab has bought a lot of nanopore equipment to sequence algae samples de novo. I'm not caught up on the state of DNA assembly software literature but I know that nanopore assemblies are plagued by homopolymer indel errors. Are the current assembly softwares/pipelines robust enough to mitigate homopolymer errors assuming you have a read depth of around 50X across the genome? Also I'm using R10 chemistry if that also makes a difference.

Thanks.

genome illumina sequencing nanopore wgs • 1.6k views

ADD COMMENT • link updated 19 months ago by Buffo ★ 2.4k • written 19 months ago by Mark ▴ 30

score 2 · Accepted Answer · 2023-09-14

2

Entering edit mode

19 months ago

colindaven 7.4k

Using Illumina to polish with tools like hypo and racon is still standard practice. Also use Medaka for initial long read polishing.

That said, ONT data has got massively better in the last two years and duplex will be another big advance.

ADD COMMENT • link 19 months ago by colindaven 7.4k

score 1 · Accepted Answer · 2023-09-14

In short, yes; Best practices for generation of T2T diploid genome assemblies presently employ HiFi and ONT-UL reads for building the genomic basis, then polish for instance using parental short-read data, Hi-C, or Strand-Seq (for phasing and repeat regions in the resulting graph). Verkko and Hifiasm are both achieving ~Q60 solely using these recipe, with a few assembly errors. Polishing utilizes shorter but more accurate read sets, in addition to the use of long-reads.

However, HPRC (formerly T2T) and other production consortia (VGP, ERGA) are moving away from the use of linked-reads and optical mapping. At this point HiFi or ONT R10 Duplex reads give better resolution than linked-reads, which also are reported to have less even coverage. Optical maps, Bionano for instance, have far fewer sites for nicking enzymes in centromeric regions, but do apparently still have some utility elsewhere (e.g. telomeric regions), where ONT UL read drops off anyway.

If things keep going like this, short read and optical genome mapping will have increasingly limited utility in another few years time.

score 1 · Accepted Answer · 2023-09-14

In my experience, it mainly depends on three variables.

First, is the sequencing technology; PacBio HiFi can produce high-quality reads, the same quality as Illumina in some cases. MinION is still a different story, I've attended a couple of workshops recently, and it seems like the recent chemistry has substantially improved. But not enough for de novo assemblies without the need for polishing.

Second, the coverage. In my experience, you might not find any difference in assembling 50X PacBio Hifi versus the polished version (bacterial, or short genomes). It is different for minION, for the reasons listed above.

Third, consider genome size and complexity. For instance, it's not the same to sequence any chromosome of the human genome as it is to sequence chromosome X or Y, which are full of repeats and have a very low %G+C. You'd need higher coverage and longer reads for those. So, yes, size (and complexity) matters.

I wouldn't say there is a general rule, nor it is a "best practice" to do it by default. Every case is different and is worth it to analyze them independently.

Hope it helps.