Question

How To Improve Whole Genome Assembly Quality

1

Entering edit mode

11.2 years ago

HG ★ 1.2k

Hi everyone, I am new in sequence assembly. I have stared a project of 50 ecoli whole genome sequencing illumina data set. I did all the assembly using Spades and quality checking by Quast. On an average i got around 100 contig of each genome. Can anyone suggest me how to improve the assembly quality, i mean how to reduce the contig number , increase the N50 value , reduce the gap?

Thank you advance for any suggestion.

• 9.3k views

ADD COMMENT • link updated 9.3 years ago by Antonio R. Franco ★ 5.2k • written 11.2 years ago by HG ★ 1.2k

0

Entering edit mode

Why don't you just map reads to reference and call variants?

ADD REPLY • link 11.2 years ago by Adrian Pelin ★ 2.6k

score 1 · Answer 1 · 2014-02-12

1

Entering edit mode

11.2 years ago

5heikki 11k

Do multiple assemblies with different kmer settings and merge them in the end for a final assembly. JGI has a decent pipeline for this, and it should be publicly available, though I couldn't locate any url following good 10 seconds in Google..

ADD COMMENT • link 11.2 years ago by 5heikki 11k

0

Entering edit mode

Spades does this already, although "only" with three values of K by default.

ADD REPLY • link 11.2 years ago by Mikael Huss 4.8k

score 1 · Answer 2 · 2014-02-12

1

Entering edit mode

11.2 years ago

Stroehli ▴ 40

I don't know how the scaffolding step in Spades works, but maybe trying an additional stand-alone scaffolder like SSPACE (using paired-end information) or Scaffold_builder (using a completed genome as a reference) could help. The latter should be relatively straight-forward for E.coli as a good genomic reference is available. By this you can get a better genomic structure (longer scaffolds, right order of scaffolds) which can also help in reducing gaps.

In addition to that, it is never a bad idea to analyze your data set with a bunch of different assemblers that are out there.

ADD COMMENT • link 11.2 years ago by Stroehli ▴ 40

0

Entering edit mode

Yes i appreciate your suggestion. After assembly i used contiguator to map all the contig with a good closed reference genome and i took only map contig to make a final pseudogenome . Any comment please about my approach.

ADD REPLY • link 11.2 years ago by HG ★ 1.2k

score 0 · Answer 3 · 2015-12-22

Did you perform any read quality trimming before the Spades assembly? For Ecoli I do not know of a better assembler than Spades, we usually get quite decent results with it. Do you run Spades with the additional BWA after initial assembly?

Also, depending on what you want to know, I would not advise to map against a reference. If you want to detect virulence genes, resistance genes, plasmids and you map against a reference, you will only detect those that are also present in the reference. Ecoli has a very 'mobile' genome with a lot of recombination, horizontal gene transfer and exchange of plasmids going on. If you want a low resolution phylogenetic relationship between your strains, mapping against a reference is a good approach, but for functional analysis, it is a definite no no.

Ram · Answer 4 · 2015-12-22

There are trusted E.coli genomes you can use to compare and move/order your assembled contigs. You can do it using a program like Mauve. There are tutorials showing how to do it

In my hands, and having a coverage of 100X in E. coli sequences, I got as many contigs that you got, and that even doing a nice trimming of the sequences by quality and getting rid of putative adaptors sequences

I think you need to test different k-mer values, compare each with a trusted genome, and if not satisfied, use different sequences like mate-paired sequences, long illumina sequences and even long sequences obtained through PacBio. A colleague of mine tried to close a Pseudomonas genome for 7 years without a full success, and eventually it made it using pacBio