My basic question is why aren't genome assemblers using an underlying Hamiltonian path algorithm?
My basic, high level understanding of genome assemblers is that they consider nodes in a graph created from short kmer sequences and connect a directed edge when the suffix of one node is the prefix of another. An Eulerian path is found in the resulting graph and this is what constitutes the final assembly. Since the genome has low entropy, redundant and repeated regions, methods like these are needed to resolve the ambiguity that comes with having read lengths that are smaller than the repeated regions.
I found a Nature article that I thought gave a nice overview. From the article:
What's more, there is no known efficient algorithm for finding a Hamiltonian cycle in a large graph with millions (let alone billions) of nodes. ... the computational burden of this approach was so large that most next-generation sequencing projects have abandoned it.
The article goes on to mention that the Hamiltonian path problem is NP-Complete.
This confuses me because there are known almost sure polynomial time algorithms that find Hamiltonian paths in Erdos-Renyi random graphs. Further, there are many solvers that use a variety of heuristics to aid in search.
I understand that genome assemblers using Eulerian paths probably do a "good enough" job so it's probably a moot point but is there any real reason why finding Hamiltonian paths hasn't been used in genome assembly? Did researchers hear that it was NP-Complete and get scared away from the prospect? Is there any work to create an assembler that use an underlying Hamiltonian path algorithm?
How good or bad the Hamiltonian path solver is depends on the algorithm and graph. I'm not sure if it could do better than O(N^2) but it might.
For those of us that aren't up on the lingo, OLC = "Overlap Layout Consensus".
And for those that are lazy, here are the links to some of the papers referenced:
Thanks lh3!