Question

Why aren't Hamiltonian paths used in genome assembly?

6

Entering edit mode

9.3 years ago

abetusk ▴ 90

My basic question is why aren't genome assemblers using an underlying Hamiltonian path algorithm?

My basic, high level understanding of genome assemblers is that they consider nodes in a graph created from short kmer sequences and connect a directed edge when the suffix of one node is the prefix of another. An Eulerian path is found in the resulting graph and this is what constitutes the final assembly. Since the genome has low entropy, redundant and repeated regions, methods like these are needed to resolve the ambiguity that comes with having read lengths that are smaller than the repeated regions.

I found a Nature article that I thought gave a nice overview. From the article:

What's more, there is no known efficient algorithm for finding a Hamiltonian cycle in a large graph with millions (let alone billions) of nodes. ... the computational burden of this approach was so large that most next-generation sequencing projects have abandoned it.

The article goes on to mention that the Hamiltonian path problem is NP-Complete.

This confuses me because there are known almost sure polynomial time algorithms that find Hamiltonian paths in Erdos-Renyi random graphs. Further, there are many solvers that use a variety of heuristics to aid in search.

I understand that genome assemblers using Eulerian paths probably do a "good enough" job so it's probably a moot point but is there any real reason why finding Hamiltonian paths hasn't been used in genome assembly? Did researchers hear that it was NP-Complete and get scared away from the prospect? Is there any work to create an assembler that use an underlying Hamiltonian path algorithm?

Assembly genome • 4.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by abetusk ▴ 90

Ram · Accepted Answer · 2015-09-10

In my view, reducing overlap assembly to the Hamilton Path problem is just an illusion. I could not find the full text of old literatures -- among the papers I know, such a formulation seems to first appear in the 2001 Euler paper, a paper that objects to OLC. Even if there are earlier papers on this formulation, the modern theory on OLC as is established by Myers et al has nothing to do with Hamilton Path reduction. It almost seems to me that the Hamilton reduction was introduced only to promote the Eulerian approach.

Anyway, on real data, there may be many dead ends, artifactual reads and missing or false overlaps. Strict Hamilton paths often don't make sense. In addition, we usually require very low misassembly rate. That is best achieved by breaking contigs whenever there are ambiguities -- rather than seeking "optimal" Hamilton Paths that may be sensitive to all kinds of errors. Furthermore, most heuristic overlap-based assemblers has average time complexity better than O(N^2). I guess those approximate Hamilton Path finders can't achieve this?

For long reads, OLC is still the king. To my limited knowledge, the vast majority (if not all) of large genomes sequenced with >=1kb reads were assembled with OLC-based assemblers. If you are interested in de novo assembly, I would recommend to read Myers et al's papers in 1995 (overlap graph), 2000 (celera assembler) and 2005 (string graph). These are the proven theory on OLC and are still used today for PacBio and nanopore assemblies.