Hi all,
I am performing de novo assembly of NDV virus (15 kb negative RNA-type genome) by Illumina paired end 200 bp reads. How I can increase length of assembled contigs?
thanks
Hi all,
I am performing de novo assembly of NDV virus (15 kb negative RNA-type genome) by Illumina paired end 200 bp reads. How I can increase length of assembled contigs?
thanks
Contigs can be short for one of the two reasons: 1) a contig connects to too few contigs; 2) a contig connects to too many contigs. The first thing is to check which is the case. For your data, it is more likely that 2) is happening when the assembler is picking up strain differences. Then you need an assembler that can aggressively prune error/variant-containing subgraphs. What assembler are you using? SPAdes is always a good start for small genomes. Velvet and my fermi-lite might be worth trying.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
A tiny genome like that should be easily assembled with PE 200 bp reads. You may actually have a problem of having too much data so you would need to sub-sample. How much data do you have (and it is all for this virus)?
You may want to give
tadpole.sh
from BBMap a try.I tried tadpole a few months ago on GAGE-B. It generated highly fragmented assembly because it seems not doing any graph pruning. I am not sure it is a good choice for OP.
Tadpole produces much more fragmented assemblies than, say, SPAdes on most datasets, such as bacteria and more complex organisms. So I would never expect the current version to outperform SPAdes on a bacterial benchmark in terms of continuity. But for whatever reason, it has produced much better assemblies for some viruses, in situations where SPAdes produces a very poor assembly.
Have you tried to tune spades or feed tadpole-corrected reads to it? Lacking graph pruning still bugs me. If there is heterogeneity between strains, how will tadpole deal with that? It seems to me that a right combination should be an aggressive error corrector robust to ultra-high depth and an assembler capable of sophisticated graph cleaning.
I think the problems were a result of SPAdes making multiple duplicate copies of polymorphic regions of the viruses where it thought there were repeats. The assemblies ended up many times larger than expected. This persisted despite attempts at both error-correction (using Tadpole) and normalization, so I assume it is due to the interplay of graph-processing heuristics and a high viral polymorphism rate, rather than errors. I did not try tuning SPAdes' parameters, though, as I have not found in the past that I was able to achieve better assemblies by doing so. I agree that sophisticated graph operations should result in a better assembly, as they do for bacteria, but such operations are always based on assumptions, and it appears the assumptions did not fit these viruses very well.
That was the reason I had put it in the comments since I was not sure if it would work.
@ebrahimiet has another post that I guess is related to this. One problem could be that there is an excess of data in this case considering the small genome and 200 bp PE illumina reads.
I am using CLC Genomics Workbench