Question

How To Assess The Quality Of An Assembly? (Is There No Magic Formula?)

28

Entering edit mode

12.3 years ago

diltsjeri ▴ 470

Hi,

I'm having a difficult time finding a consensus method for assessing the quality of an assembly.

Are there "best" methods to use based on the organism type, technology, and sequence quality? I know N50 is a value I should use to assess assembly quality, but is this only metric?

Thanks.

assembly quality next-gen • 48k views

ADD COMMENT • link updated 3.2 years ago by WANG ▴ 10 • written 12.3 years ago by diltsjeri ▴ 470

2

Entering edit mode

'Quality' can be a very subjective thing. The Assemblathons, as well as contests like GAGE and dnGASP, seem to indicate that assemblies can be high quality in a few areas of interest, but it is hard to make an assembly that excels in all aspects of quality. If you are only interested in one aspect of assembly quality, e.g. finding genes in a genome assembly, then it may not matter whether scaffolds are really long (e.g. > 10 Mbp), only that scaffolds mostly contain whole genes.

N50 can tell you something about the average length of scaffolds and/or contigs. It is meaningless to compare the N50 values of any two assemblies unless they are the same size. It is also possible to artificially raise N50 by deliberately excluding short contigs/scaffolds and/or increasing the padding of Ns within scaffolds. One of the figures we include in the Assemblathon 2 paper suggests that N50 can be a semi-useful predictor of assembly quality. Some of the most highly-ranked assemblies had high N50 values...but not all of them did, and some which had high N50 values did not rank as highly.

To give you a succinct, but somewhat disappointing, answer to your question, I would say:

There is no magic formula.

ADD REPLY • link 12.2 years ago by kbradnam ▴ 20

0

Entering edit mode

Lately I have been following the methods listed here:

BUSCO/CEGMA for checking the core genes
Map RNASeq reads and unigenes dervied from transcriptome assembly
Map Proteins from closely related species
Map constituent reads that were used to form the assembly and check their depth and mappability
Distribution of NGx (10,50,70,90 etc)
Distribution of contig lengths
Check presence of duplicate contigs and other contaminants (easiest way is to submit the genome to NCBI)
Bases constituting the assembly.

ADD REPLY • link 6.8 years ago by harishk0201 ▴ 130

score 18 · Answer 1 · 2013-01-24

18

Entering edit mode

12.3 years ago

zam.iqbal.genome ★ 1.9k

N50 is most definitely not the only thing to look at. How you should asses it basically depends on what you want to do with the assembly.

You could check out this paper recently submitted to the Arxiv

http://arxiv.org/pdf/1301.5406

"Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species"

Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)

and also the previous Assemblathon paper. Also check out papers by Steven Salzberg and Mihai Pop on this subject, plus the references within all of the above. There are many others which I can't think of off the top of my head, I'm sure others will suggest some

best Zam

ADD COMMENT • link 12.3 years ago by zam.iqbal.genome ★ 1.9k

3

Entering edit mode

As you mentioned GAGE, I am actually concerned with this evaluation. For small genomes, the authors intentionally mix 50% of short-insert reads and 50% of long-insert reads by thinning the source data. When assembling, they largely treat the two types of reads the same apart from orientation and insert size. If the assembler does not consider the exceptionally high chimeric rate of long-insert reads, the performance will be very bad, as is shown in the table. However, in practice, short-insert reads are cheaper and of much better quality than long-insert. An better approach would be to sequence more short-insert reads, assemble them first and then only use long-insert to build scaffolds. As such, GAGE might only be evaluating a scenario that may not represent the best practice.

Assemblathon 1/2 is truly amazing which I like a lot.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

Hi Heng. I didn't mention GAGE at all, I mentioned Steven Salzberg. I was thinking of papers like these

http://bioinformatics.oxfordjournals.org/content/21/24/4320.full http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0021400

http://books.google.co.uk/books?hl=en&lr=&id=UrKGLrmpRZAC&oi=fnd&pg=PA163&dq=info:vd-c54xEXwAJ:scholar.google.com&ots=tIw9P31XE9&sig=0L3wqezpwzlFJtcH28HEunu_ZHc&redir_esc=y#v=onepage&q&f=false

http://www.biomedcentral.com/content/pdf/gb-2008-9-3-r55.pdf

cheers

Zam

ADD REPLY • link 12.3 years ago by zam.iqbal.genome ★ 1.9k

0

Entering edit mode

Yeah, their reviews are very good. Thanks for the clarification.

ADD REPLY • link 12.3 years ago by lh3 33k

score 8 · Answer 2 · 2013-01-24

8

Entering edit mode

12.3 years ago

Madelaine Gogol 5.3k

I like the paper answer above, but if you're just looking for some additional measuring sticks besides N50, you could also think about:

number of contigs
Length of longest/shortest contigs
Average length of contigs
Total length of all contigs
Length of 10/100/1000/10000 longest contigs

ADD COMMENT • link 12.3 years ago by Madelaine Gogol 5.3k

score 8 · Answer 3 · 2013-01-24

8

Entering edit mode

12.3 years ago

Manu Prestat 4.1k

I would add: the number of annotations you can grab from your contigs or ORFs you can predict as "information content" estimates.

ADD COMMENT • link 12.3 years ago by Manu Prestat 4.1k

score 8 · Answer 4 · 2013-01-25

8

Entering edit mode

12.3 years ago

Rayan Chikhi ★ 1.6k

QUAST and FRCurve are two recent tools that should definitely be considered when evaluating assemblies.

QUAST computes a comprehensive set of classical metrics. It can reproduce the GAGE benchmark.

FRCurve computes newer metrics related to correctness.

ADD COMMENT • link 12.3 years ago by Rayan Chikhi ★ 1.6k

score 7 · Answer 5 · 2018-05-16

you can use Quast (QUality ASsesment Tool) , evaluates genome assemblies by computing various metrics, including:

N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
L50: The minimum number X such that X longest contigs cover at least 50% of the assembly
NG50: where length of the reference genome is being covered
NA50 and NGA50: where aligned blocks instead of contigs are taken
Number of N’s per 100 kbp and GC %
missassemblies: misassembled and unaligned contigs or contigs bases
genes and operons covered

A clear report will generate , and which helps you to ASSESS your genome assembly

Good Luck

Ram · Answer 6 · 2013-01-24

6

Entering edit mode

12.3 years ago

earonesty ▴ 250

I use a dup-mer-21 calculation to compare assemblies based on this conversaion:

http://www.homolog.us/blogs/2012/06/26/what-is-wrong-with-n50-how-can-we-make-it-better-part-ii/

Source code:

http://ea-utils.googlecode.com/svn/trunk/clipper/contig-stats

This lets you know if there is excessive chimerism ... a common error.

ADD COMMENT • link 12.3 years ago by earonesty ▴ 250

1

Entering edit mode

The article correctly points out that evaluating N50 only is frequently misleading, but the last paragraph is questionable. When there is ambiguity about whether A should be connected to B or to C, the right decision is not to perform any joining. If we force a join, we will get longer N50 at the cost of high error probability at the junction. An aggressive assembler will get longer N50 but more misassemblies in that case.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

Which is what the dup-mer-21 will detect... overaggressive assemblers. You should see the same kmer represented in multiple locations when the assembler is more aggressively calling connections in its graph than it should.

It's easy to produce a single contig. It's hard to get it right.

ADD REPLY • link 11.6 years ago by earonesty ▴ 250

0

Entering edit mode

@earonesty: Could i please know how to intrpret the dup-mer-cnt, dup-pct-21 when comparing assemblies? Should they be high or low?

ADD REPLY • link 11.3 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

They should be "comparable to expected". In other words...you should benchmark it to an existing quality assembly. Some k-mer duplication is, of course, expected. What the "correct" number is varies from organism to organism. As a rule, I would expect longer genome to have more.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by earonesty ▴ 250

score 3 · Answer 7 · 2013-01-24

Regardless of your biological question, I think looking at length statistics alone can be very misleading and uninformative because 1) the percentage of Ns in scaffolds may be very high and 2) there is always some level of contamination (from organelles, but also other species, possibly) in draft shotgun assemblies, in my experience. How you define "quality" is important to your assessment of the assembly, but the common goal is to try and represent the actual genomic sequence of an organism, so some things to check are:

Sequence content of contigs/scaffolds.
Levels of contamination (aside from sequence contamination, there are also assembly artifacts to be aware of, as others mentioned).
Gene content/accuracy.

The last two points can be assessed by looking at the reference genome or gene models, respectively, of your species or a closely related species. There are many recent papers on comparing genome assemblies so I won't list any paper or tools (too easy to google), but I will mention a method for inferring the gene content. CEGMA is a set of conserved genes in eukaryotes and may be biologically informative, especially if your organism is a non-model species and you have no transcriptome or even closely related species for comparison.

score 0 · Answer 8 · 2013-10-16

0

Entering edit mode

11.6 years ago

Prakki Rama ★ 2.7k

you can also check this Assessing The Quality Of De Novo Assembled Data

ADD COMMENT • link 11.6 years ago by Prakki Rama ★ 2.7k

score 0 · Answer 9 · 2018-07-03

0

Entering edit mode

6.8 years ago

alslonik ▴ 320

We also use BUSCO (https://busco.ezlab.org/), along with QUAST, already mentioned and statistics as size of scaffolds, percentage of gaps, N50 etc.

ADD COMMENT • link 6.8 years ago by alslonik ▴ 320

score 0 · Answer 10 · 2022-02-08

0

Entering edit mode

3.2 years ago

WANG ▴ 10

I have a question, how to assess the quality of each contig? I test an assembly method in a simulated sample, Is there any methods that compare the assemble contig and the ground truth straightforward, and afford measurement scores at the sequence level?

ADD COMMENT • link 3.2 years ago by WANG ▴ 10