Question

Assembly Strategy For Large(R) Genome

1

Entering edit mode

12.1 years ago

burkhart.joshua ▴ 30

I was recently tasked by my lab to assemble a genome estimated at 850,000,000 bp. This is de novo and I've only had experience with Velvet. I'm thinking about moving to SGA or SOAPdenovo but would first like to know what the community would recommend. I have some sense of Velvet's computational limits (RAM / Execution Time) but not of other assemblers. Any suggestions?

genome assembly velvet • 4.1k views

ADD COMMENT • link updated 12.1 years ago by rtliu ★ 2.2k • written 12.1 years ago by burkhart.joshua ▴ 30

0

Entering edit mode

This review might get you started

ADD REPLY • link 12.1 years ago by Irsan ★ 7.8k

score 6 · Answer 1 · 2012-11-05

6

Entering edit mode

12.1 years ago

Josh Herr 5.8k

The lab I work in is sequencing some very large plant genomes (mostly trees) that are all anywhere from 8.5 Gbp to 20 Gbp in length. It's not easy as when you have a genome that large you inadvertently have a lot of duplications, repeats, heterozygosity, and adding on sequencing errors this can make things pretty hellish. Good quality sequencing data is absolute key.

We've used Velvet on a large cluster, but we still have memory issues and de novo runs can take weeks and still end up crashing. We have used both SOAPdenovo and CLC and have tried those too, but honestly I don't know if I would recommend one versus the other at this point. We're far from solving the issue.

This issue is difficult (like another area I am working in Metagenomic shotgun sequencing assembly) and requires a ton of memory. We're just starting to use Titus Brown's Diginorm script (his blog post with links, github) to try to reduce the memory load.

Don't know if any of this helps, but it's what we're up against.

ADD COMMENT • link 12.1 years ago by Josh Herr 5.8k

2

Entering edit mode

Upvote for the diginorm-script - we use a similar script inhouse and it greatly reduces the amount of time needed for the assembly, and vastly improves the quality of the assembly. We mostly use Velvet (and VelvetOptimiser.pl) for Wheat and Brassica and have gotten some good results in the past.

Addendum: Trimming low-quality bases from reads (like, cut off the tail of the read if the quality drops below 20) gives you a faster and better analysis, as well.

ADD REPLY • link 12.1 years ago by Philipp Bayer 8.7k

0

Entering edit mode

Yes, Upvote for the quality control on the sequences. Trimming and purging bad reads is very important. Really important point.

ADD REPLY • link 12.1 years ago by Josh Herr 5.8k

score 1 · Answer 2 · 2012-11-06

We will be facing similar challenges in the near future. If I had time, I would like to try:

SGA - String Graph Assembler, assemble human genome in 54 GB of memory. Genome Research Paper
fermi - "Fermi is substantially influenced by SGA. It follows a similar workflow, including the idea of contrasting read sets. On the other hand, the internal implementation of fermi is distinct from that of SGA. Fermi is based on a novel data structure and uses different algorithms for almost every step. As to the end results, fermi has a similar performance to SGA for features shared between them, and is arguably easier to use" - Author Heng Li
MSR-CA : The MSR-CA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. Manual, The software is under active development. It is the main assembler for Pinus taeda genome (~24Gbp).