Question

How I can increase the length of contig in denovo assembly

1

Entering edit mode

8.4 years ago

ebrahimiet ▴ 50

Hi all,

I am performing de novo assembly of NDV virus (15 kb negative RNA-type genome) by Illumina paired end 200 bp reads. How I can increase length of assembled contigs?

thanks

Assembly • 3.3k views

ADD COMMENT • link 8.4 years ago by ebrahimiet ▴ 50

1

Entering edit mode

A tiny genome like that should be easily assembled with PE 200 bp reads. You may actually have a problem of having too much data so you would need to sub-sample. How much data do you have (and it is all for this virus)?

You may want to give tadpole.sh from BBMap a try.

ADD REPLY • link 8.4 years ago by GenoMax 148k

1

Entering edit mode

I tried tadpole a few months ago on GAGE-B. It generated highly fragmented assembly because it seems not doing any graph pruning. I am not sure it is a good choice for OP.

ADD REPLY • link 8.4 years ago by lh3 33k

2

Entering edit mode

Tadpole produces much more fragmented assemblies than, say, SPAdes on most datasets, such as bacteria and more complex organisms. So I would never expect the current version to outperform SPAdes on a bacterial benchmark in terms of continuity. But for whatever reason, it has produced much better assemblies for some viruses, in situations where SPAdes produces a very poor assembly.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Have you tried to tune spades or feed tadpole-corrected reads to it? Lacking graph pruning still bugs me. If there is heterogeneity between strains, how will tadpole deal with that? It seems to me that a right combination should be an aggressive error corrector robust to ultra-high depth and an assembler capable of sophisticated graph cleaning.

ADD REPLY • link 8.4 years ago by lh3 33k

0

Entering edit mode

I think the problems were a result of SPAdes making multiple duplicate copies of polymorphic regions of the viruses where it thought there were repeats. The assemblies ended up many times larger than expected. This persisted despite attempts at both error-correction (using Tadpole) and normalization, so I assume it is due to the interplay of graph-processing heuristics and a high viral polymorphism rate, rather than errors. I did not try tuning SPAdes' parameters, though, as I have not found in the past that I was able to achieve better assemblies by doing so. I agree that sophisticated graph operations should result in a better assembly, as they do for bacteria, but such operations are always based on assumptions, and it appears the assumptions did not fit these viruses very well.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

1

Entering edit mode

That was the reason I had put it in the comments since I was not sure if it would work.

@ebrahimiet has another post that I guess is related to this. One problem could be that there is an excess of data in this case considering the small genome and 200 bp PE illumina reads.

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

I am using CLC Genomics Workbench

ADD REPLY • link 8.4 years ago by ebrahimiet ▴ 50

score 0 · Answer 1 · 2016-08-01

0

Entering edit mode

8.4 years ago

Sej Modha 5.3k

We use SPAdes for virus genome assemblies with recommended k-mer values and it produces really good assemblies.

ADD COMMENT • link 8.4 years ago by Sej Modha 5.3k

score 0 · Answer 2 · 2016-08-01

Contigs can be short for one of the two reasons: 1) a contig connects to too few contigs; 2) a contig connects to too many contigs. The first thing is to check which is the case. For your data, it is more likely that 2) is happening when the assembler is picking up strain differences. Then you need an assembler that can aggressively prune error/variant-containing subgraphs. What assembler are you using? SPAdes is always a good start for small genomes. Velvet and my fermi-lite might be worth trying.