Question

How to handle multiple contig outputs from de novo assemblers when one contig is desired?

0

Entering edit mode

4.0 years ago

Trombone Engineer • 0

I am working with Oxford Nanopore Minion data for small genomes that I am trying to assemble with de novo assembly tools. For training, I have a few datasets with reference genomes and have been comparing various de novo assembly tools. So far I have the best performance from Unicycler, but have not been able to find much information on polishing or otherwise handling multiple separate contigs when one long contig is desired. Sometimes the same assembler tools will output 1 contig, and other times they will output many separate contigs - even though there is enough of an overlap to hypothetically connect these separate contigs.

I completed some genome polishing tutorials such as with NanoPolish, but realized that they may not do what I want: combining separate contigs into one draft genome sequence. What are the designated tools to accomplish this task? Should I expect to do it manually with a visualization or mapping tool? Is alignment or MSA helpful for this task?

Additionally, is there a reason why state of the art assembly tools are unable to complete these assemblies manually (into a single contig that is)? I do not believe I have any unsequenced stretches, since my genomes are so small.

Assembly Nanopore contig de novo • 2.5k views

ADD COMMENT • link updated 4.0 years ago by Mensur Dlakic ★ 28k • written 4.0 years ago by Trombone Engineer • 0

0

Entering edit mode

The vast majority of genome assemblies deposited to e.g. GenBank do not include complete chromosomes as continuous sequence. Is there some particular reason why contigs aren't good enough for you?

ADD REPLY • link 4.0 years ago by 5heikki 11k

0

Entering edit mode

If you have related genomes, potentially ref based scaffolding tools like this are useful.

https://github.com/malonge/RaGOO

https://github.com/combogenomics/medusa

ADD REPLY • link 3.9 years ago by colindaven 7.0k

score 2 · Answer 1 · 2020-12-09

There are many reasons why most genomes come out incomplete after the assembly. Some of them are: sequencing errors, sequence repeats, uneven coverage, inherent difficulty in cloning or amplifying certain genomic regions, sample contamination, poor technical handling, bad luck. By the way, deep coverage on its own is not enough, especially for NGS methods that use short reads.

Assembly programs are created to combine fragments in an intelligent way, which includes more than simply spotting an overlap between the fragments. The overlap needs to be long enough, without mismatches (especially in the middle part), and supported by a good number of reads. Even if you verified that one or more of these criteria are fulfilled, chances are that there isn't enough confidence to join the contigs reliably. You can inspect the assembly graphs if you wish to confirm or override assembler's decisions.