General Guides

[Assembly of large genomes using second-generation sequencing1 Genome Research (2010)
A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies in PLoS ONE (2010)

Biostar Tags

Back to the Table of Contents

assembly papers • 12k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 12.6 years ago by Istvan Albert 102k

Gjain · Answer 1 · 2013-02-19

Here are some historically important papers on the theory of de novo assembly (not in the practical aspect). It was edited from the note I took for writing the fermi paper. I was quite confused with the history and theory on de novo assembly at that time. The note is a little lousy but might be useful to someone in case.

Staden, 1979; Gingeras et al, 1979: first attempt to use a computer program to assist sequence assembly.
Peltola et al, 1984: first mathematical formulation sequence assembly and the emerge of the overlap-layout-consensus (OLC) paradidgm. OLC consists of three steps: 1) overlap: finding overlaps between all pairs of reads; 2) layout: arrange the reads based on overlaps; 3) consensus: derive the final sequence. In that paper, the authors tried to find a layout that minimizes the assembly.
Myers, 1995 (a legendary figure in computional biology); Kececioglu and Myers, 1995: probably the first graph representation of sequence assembly - the overlap graph. In an overlap graph, a vertex represents a read and an edge represents an overlap. We can find the layout by finding the optimal path through each vertex - an NP-hard Hamilton problem. However, Myers et al. did not stop there. They proposed transitive reduction, a procedure to dramatically reduce the redundancy of overlaps. And after transitive reduction, solving sequence assembly is not a Hamilton problem any more.
Idury and Waterman, 1995 (the Waterman in Smith-Waterman): the birth of the de Bruijn paradigm. I remember that a paper mentioned that both Myers and Waterman participated the 4th DIMACS Implementation Chanllenges in 1994. They proposed two distinct graph representations of sequence assembly, which have such a fundamental incluence even today.
Myers et al, 2000: publication of Celera assembler, arguably the best assembler for short-gun data at that time. This assembler is largely a practical implementation of the theory in their 1995 papers.
Pevzner et al, 2000: publication of the Euler assembler, the first assembler based on the de Bruijn graph. It solved two major problems with the original theory published in 1995 with spectrum error correction and read threading, a procedure to re-impose the full-length read information to the de Bruijn graph. The paper unfortunately spread a wrong notion which can still be seen in a few reviews nowadays: OLC is an NP-hard Hamilton problem, but de Bruijn graph can be solved in linear time and thus is more advanced. However, in fact, Gene Myers with his Celera assembler never tried to reduce OLC to a Hamilton problem. I will come back to the time complexity of de Bruijn later.
Myers, 2005: string graph. String graph is largely a different representation of overlap graph.
Medvedev et al, 2007: de novo assembly is NP-hard, which is true for both the OLC and the de Bruijn paradigm. A catch is that although constructing the de Bruijn graph and finding the optimal path is linear, resolving the optimal read threading is NP-hard and without threading, we cannot fully use all the information in reads.
Zerbino et al, 2009: the publication of velvet. This is a landmark paper on short-read assembly. Graph cleaning (e.g. tip trimming and bubble popping) was first introduced in this paper and is now widely used in essentially every short-read assembler.
Simpson and Durbin, 2010: the first application of FM-index in sequence assembly. This paper resolved a long-standing difficulty in applying OLC to short reads: finding the overlaps. After this paper, two other fast overlap finder were also published (Dinh and Rajasekaran, 2011; Gonnella and Kurtz, 2012) which are claimed to be much faster and more lightweight.
Earl et al, 2011: Assemblathon 1. In my view, this is undoubtedly the best paper on the evaluation of de novo assemblers.

Gjain · Answer 2 · 2013-02-19

5

Entering edit mode

12.1 years ago

fo3c ▴ 450

Sequence assembly demystified

Niranjan Nagarajan and Mihai Pop. 2013.

Nature Reviews Genetics, 14:157-167, doi:10.1038/nrg3367

ADD COMMENT • link updated 12.1 years ago by Gjain 5.8k • written 12.1 years ago by fo3c ▴ 450

0

Entering edit mode

neat title! - very important when writing any paper ... :-)

ADD REPLY • link 12.1 years ago by Istvan Albert 102k

score 2 · Answer 3 · 2012-09-27

2

Entering edit mode

12.5 years ago

Daniel Standage 4.1k

Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA. 2011. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research, 22:557-567, doi:10.1101/gr.131383.111.

ADD COMMENT • link 12.5 years ago by Daniel Standage 4.1k