How to handle multiple contig outputs from de novo assemblers when one contig is desired?
1
0
Entering edit mode
4.0 years ago

I am working with Oxford Nanopore Minion data for small genomes that I am trying to assemble with de novo assembly tools. For training, I have a few datasets with reference genomes and have been comparing various de novo assembly tools. So far I have the best performance from Unicycler, but have not been able to find much information on polishing or otherwise handling multiple separate contigs when one long contig is desired. Sometimes the same assembler tools will output 1 contig, and other times they will output many separate contigs - even though there is enough of an overlap to hypothetically connect these separate contigs.

I completed some genome polishing tutorials such as with NanoPolish, but realized that they may not do what I want: combining separate contigs into one draft genome sequence. What are the designated tools to accomplish this task? Should I expect to do it manually with a visualization or mapping tool? Is alignment or MSA helpful for this task?

Additionally, is there a reason why state of the art assembly tools are unable to complete these assemblies manually (into a single contig that is)? I do not believe I have any unsequenced stretches, since my genomes are so small.

Assembly Nanopore contig de novo • 2.5k views
ADD COMMENT
0
Entering edit mode

The vast majority of genome assemblies deposited to e.g. GenBank do not include complete chromosomes as continuous sequence. Is there some particular reason why contigs aren't good enough for you?

ADD REPLY
0
Entering edit mode

If you have related genomes, potentially ref based scaffolding tools like this are useful.

https://github.com/malonge/RaGOO

https://github.com/combogenomics/medusa

ADD REPLY
2
Entering edit mode
4.0 years ago
Mensur Dlakic ★ 28k

There are many reasons why most genomes come out incomplete after the assembly. Some of them are: sequencing errors, sequence repeats, uneven coverage, inherent difficulty in cloning or amplifying certain genomic regions, sample contamination, poor technical handling, bad luck. By the way, deep coverage on its own is not enough, especially for NGS methods that use short reads.

Assembly programs are created to combine fragments in an intelligent way, which includes more than simply spotting an overlap between the fragments. The overlap needs to be long enough, without mismatches (especially in the middle part), and supported by a good number of reads. Even if you verified that one or more of these criteria are fulfilled, chances are that there isn't enough confidence to join the contigs reliably. You can inspect the assembly graphs if you wish to confirm or override assembler's decisions.

ADD COMMENT

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6