Question

What are the best approaches to evaluate a genome assembly using the 'intrinsic' data?

3

Entering edit mode

10.3 years ago

fhsantanna ▴ 620

I have assembled four bacterial genomes derived from MiSeq pair-ended sequencing data using the following steps:

Assembly using CLC Workbench;
Assembly using SPADES;
Assembly using A5 pipeline;
Merging of the three assembles using CISA;
Quality check of the assemblies using QUAST.

For checking the misassemblies, QUAST relies on a reference genome. However, for most of my draft genomes, I do not have a proper reference genome (too much genome differences in relation to those deposited in Genbank).

So, I ask you. How could I validate the genome assembly using intrinsic data? For example, using read mapping, what are the criteria to correct some regions? What is the best software for this purpose?

Thanks

assembly genome validation • 7.1k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by fhsantanna ▴ 620

6

Entering edit mode

10.3 years ago

Mikael Huss 4.8k

You could try FRCbam, ALE or REAPR. All of these are supposed to evaluate assemblies without the need for a reference genome. However, my experience with them is quite limited.

ADD COMMENT • link 10.3 years ago by Mikael Huss 4.8k

0

Entering edit mode

Don't (yet) know about the other two, but FRCbam needs actually _two_ libraries, a paired-end (PE) library and a mate-paired (MP) library. It seems that the original poster only has a PE-library. Don't know if there are then any "hacks" to get FRCbam to work correctly on such data.

ADD REPLY • link 10.3 years ago by cedric.laczny ▴ 50

2

Entering edit mode

10.3 years ago

lexnederbragt ★ 1.3k

First, the best assembly depends on your research question. Do you need just presence/absence of genes, or is this going to be the reference genome for a larger study?

Second, in addition to the other answers, you could do an annotation, and check which assembly seems to be more complete.

ADD COMMENT • link 10.3 years ago by lexnederbragt ★ 1.3k

Ram · Accepted Answer · 2014-12-11

3

Entering edit mode

10.3 years ago

Leszek 4.2k

I don't know any ad hoc solution. But you can try looking at:

fraction of reads that aligned - if many reads didn't aligned you probably miss some regions in your assembly
fraction of reads with concordant pairing (ie samtools flagstat) - if this is low, you have likely rearrangements or high genome fragmentation
pairwise genome alignments (ie. nucmer or lastal) of your assemblies to check for large inconsistencies between them

It's always good to compare vs chromosomes of some relative species to check whether your assembly make sense.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 10.3 years ago by Leszek 4.2k

0

Entering edit mode

Should I use corrected reads or brute ones? I have used the brute ones on the contigs and most of them were not mapped...

ADD REPLY • link 10.3 years ago by fhsantanna ▴ 620

0

Entering edit mode

I use raw reads, as modern aligners are quite good at aligning even poor quality reads. If a lot of your reads fail to align, it doesn't necessarily mean your assembly is wrong. You can check your reads quality ie with FastQC.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Leszek 4.2k