Question

Assessing The Quality Of De Novo Assembled Data

25

Entering edit mode

12.8 years ago

Prakki Rama ★ 2.7k

Are there any other ways of assessing the quality of assembled data obtained from different assemblers apart from metrics like N50 and assembly size?

I know few like,

I can blast the contigs and check the % of the blast hits.
verifying the % of the reads mapped to contigs of different assemblies.

Any other ideas are appreciated.

assembly next-gen • 24k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 12.8 years ago by Prakki Rama ★ 2.7k

score 20 · Answer 1 · 2012-12-12

20

Entering edit mode

12.3 years ago

Nikolay Vyahhi ★ 1.3k

QUAST (QUality ASsesment Tool for Genome Assembly) can be used to assess the quality of genome assemblies (both de novo reference based):

http://bioinf.spbau.ru/quast
http://sourceforge.net/p/quast (code)
http://quast.bioinf.spbau.ru (QUAST server, beta)

ADD COMMENT • link 12.1 years ago by Nikolay Vyahhi ★ 1.3k

0

Entering edit mode

QUAST paper was published in Bioinformatics — http://bioinformatics.oxfordjournals.org/content/early/2013/02/18/bioinformatics.btt086.abstract

ADD REPLY • link 12.2 years ago by Nikolay Vyahhi ★ 1.3k

0

Entering edit mode

I also highly recommend QUAST for these tasks

ADD REPLY • link 11.4 years ago by Hayssam ▴ 280

Neilfws · Answer 2 · 2012-06-18

13

Entering edit mode

12.8 years ago

Markf ▴ 290

Not to beat my own drum (too much) - but - I've written a tool that can be useful for this. The idea is that you map all raw reads back to the assembled genome and then assess what read pairs map, and, more importantly, which map, but at an unexpected distance. The tools takes a BAM input file, processes it, and then allows you to generate plots.

See: https://github.com/mfiers/hagfish

cheers Mark

ADD COMMENT • link 12.8 years ago by Markf ▴ 290

0

Entering edit mode

I was just looking for a coverage-plotting library! Great coincidence, thanks!!!

ADD REPLY • link updated 12.8 years ago by Neilfws 49k • written 12.8 years ago by Philipp Bayer 8.8k

0

Entering edit mode

if you need help/advise - let me know. If you need fixes/features - you can add them to the issue list in github

ADD REPLY • link 12.8 years ago by Markf ▴ 290

0

Entering edit mode

Interesting. But what if paired-end are already mated?

ADD REPLY • link 12.3 years ago by Manu Prestat 4.1k

0

Entering edit mode

What do you mean by "already mated"?

If you align paired-end reads to your assembly, the insert-size shouldn't be too large or too small, if the size is too large then there's an indication that your assembly includes regions that do not exist, if the size is too small then there's an indication that your assembly misses a region. If a region in the assembly is not bridged by paired reads then that's an indication that the region doesn't exist in reality.

ADD REPLY • link 12.3 years ago by Philipp Bayer 8.8k

0

Entering edit mode

I meant (pre)-assembly of the 2 reads that belong to each pair (if read/insert sizes combination allows) , like this kind of tools: http://genomics.jhu.edu/software/FLASH/index.shtml does. In that case, relying on a tool that study the mapping of the pairs would be useless.

ADD REPLY • link 12.3 years ago by Manu Prestat 4.1k

score 5 · Answer 3 · 2013-02-12

Old topic, but this was just published. I'm curious how well it performs and will hopefully be testing it myself this week or soon (whenever time permits)

http://www.ncbi.nlm.nih.gov/pubmed/23303509

http://sc932.github.com/ALE/about.html

Abstract
MOTIVATION:
Researchers need general purpose methods for objectively evaluating the accuracy of single and metagenome assemblies and for automatically detecting any errors they may contain. Current methods do not fully meet this need because they require a reference, only consider one of the many aspects of assembly quality or lack statistical justification, and none are designed to evaluate metagenome assemblies.
RESULTS:
In this article, we present an Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

Ram · Answer 4 · 2015-08-28

5

Entering edit mode

9.6 years ago

Prakki Rama ★ 2.7k

One more to the list: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (in short BUSCO). It replaces discontinued CEGMA.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.6 years ago by Prakki Rama ★ 2.7k

Ram · Answer 5 · 2012-12-12

I don't have experience with any tools that estimate quality based on re-mapping reads to the de novo assembled sequence, and I'll have to check some of these out. I typically use the following metrics to compare the relative quality of my genome assemblies.

N50 and N90
number of contigs or scaffolds
length of the longest contig or scaffold
combined length of all contigs or scaffolds
% CEGs (conserved core eukaryotic genes) mapped

For this last one, I use the CEGMA method^[[1][1]] to identify genes that are highly conserved among all eukaryotes (implementation available at http://korflab.ucdavis.edu/datasets/cegma). The more of these conserved genes CEGMA is able to identify, the more confidence I have in the quality of the assembly and my ability to accurately annotate other genes in that genome.

Parra G, Bradnam K, Korf I. 2007. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23: 1061-1067, doi:10.1093/bioinformatics/btm071.

score 3 · Answer 6 · 2012-12-12

3

Entering edit mode

12.3 years ago

Manu Prestat 4.1k

In addition to other mentioned, I think that an ORF prediction step can give you a strong and fast comparative insight to compare several assemblies.

ADD COMMENT • link 12.3 years ago by Manu Prestat 4.1k

score 2 · Answer 7 · 2012-09-12

2

Entering edit mode

12.6 years ago

Ketil 4.2k

I also wrote a pipeline to assess de-novo assemblies. It's not particularly strong in the plotting department, but it will use a variety of data (454, Illumina, DNAseq, RNAseq, ESTs, proteomes, etc) and calculate a bunch of numbers - in addition to internal metrics like N50 sizes and nucleotide counts - that lets you compare your candidate drafts. More info on http://blog.malde.org/posts/assembly-evaluation.html

ADD COMMENT • link 12.6 years ago by Ketil 4.2k

0

Entering edit mode

The figures are very appealing. To install haskell and dependencies i had to sweat my blood without success. Finally, I am unsuccesful in using your pipeline.

ADD REPLY • link 9.6 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

I think I've found all main dependencies in conda repos (you can search them on anaconda.org) so it should be as simple as few conda commands. And if you don't use conda already (esp with Bioconda channel), you all should start right now :p

PS: I also found haskell in brew, but I dunno how to easily search the other packages (I don't have brew on my system to not collide PATHs with conda, so I can't just "brew search")

ADD REPLY • link 6.4 years ago by jena ▴ 320

Ram · Answer 8 · 2015-10-14

2

Entering edit mode

9.5 years ago

Prakki Rama ★ 2.7k

One can also assess number of misassembly errors in the genome using tools like REAPR and misSEQuel. That would also give nice gauge how good is the assembled genome.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by Prakki Rama ★ 2.7k

Ram · Answer 9 · 2018-08-28

This is an old topic but here is a list of the tool I currently use:

Quast, probably the most used tool to get assembly statistics, such as N50, #contigs, size of the largest contig etc... http://quast.bioinf.spbau.ru/manual.html
Busco, this is a set of manually curated othologous genes, useful to check if the gene content of your assembly is good https://busco.ezlab.org/
Speaking of gene space, if you have access to RNA-seq data for your species you can see if it align correctly to your assembly.
KAT, if you have access to the original reads, you can assess the completeness and duplication of your assembly (cf 4.5 here -> https://kat.readthedocs.io/en/latest/walkthrough.html#genome-assembly-analysis-using-k-mer-spectra). This tool use a kmer approach so you do not need a reference genome.
REAPR, I have not used this one, but it can be used to assess your assembly correctness https://www.sanger.ac.uk/science/tools/reapr
Mummer, to align your assembly to a closely related species (http://mummer.sourceforge.net/manual/). Bonus: can produce dotplots for a nice visualisation of the results.

Moreover, the very interesting paper for the assemblathon 2 (https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10) describes how they assessed the different assemblies.

score 0 · Answer 10 · 2018-08-27

Also a couple of other parameters to judge on are:

Mapping of a closer species or your own RNAseq data.
Duplicated contigs in your assembly. I have seen this in case of multiple genome where the unique genomic content is very low.
If you can predict ORFs, then one of the better approaches is to annotate the same against UniProt or InterProScan to see how many ORFs are getting annotated. This should be pretty close to the closest organism of your choice. Generally speaking most plants have about 25-30000 genes. Most bacteria have around 1000 genes per MB.