Hi Biostar,
I am looking into assembly of 454 reads from a metagenomic sample into contigs for protein prediction (homology, de-novo gene finding).
In the papers and data sets that I looked at so far, people mostly focus on phylotyping and thus mostly rely on the raw reads. In cases where they assemble the reads, the assemblies are mediocre (huge percentage of singletons, only few contigs >2000bp). N50 barely exceeds the avg read length.
Now my question is, why are the assemblies so bad? I assume, that the coverage that is provided by a single 454 run (giving ~1m reads) is too low and together with 454's error model, newbler has a hard time to find enough overlaps.. I also tried mira assembler on one data set, but it's more or less the same result. Also velvet didn't work better at all on these reads.
So, does somebody has a suggestion on how to improve assembly? Another software? Have more runs and thus higher coverage?
I am grateful for your suggestions. Thanks!
What's your estimated coverage of the target genome? Generally, high coverage yields better assemblies
once you consider the complexity of the problem - assembling reads from an unknown number of potentially similar genomes - mapped via short and increasingly noisy reads - the classic style assembly is bound to not work correctly