Question

Assembly Strategy

1

Entering edit mode

11.6 years ago

Panos ★ 1.8k

We're working on sequencing some big insect genomes (>2GBp) and as the first data comes out, I'm trying to find a way to tackle with them.

MaSuRCA crashed on just one lane of unfiltered sequences (~(140 x 2) mil reads) and I started looking for alternatives, since for some of our insects we have 10 lanes (paired-end and mate pairs).

So one of the suggestions was to split our reads into smaller subsets and assemble the subsets, separately. Then, move with assembling the assemblies and so on, until we get to the final assembly. One of the problems I see with this approach, however, is that you may end up assembling contigs coming from different assemblies that have very different sequencing coverage (hence different copy number).

What are your thoughts about this approach and also what are your thoughts about the assembly strategy, in general, that I should follow? I know that some plant genomes are a lot bigger than our insects so maybe there is already a solution!

Also, the machine I ran MaSuRCA on had 256GB of RAM, which I think is not small; maybe I can find a machine with 512GB, but definitely not more than this. So please have that in mind when suggesting solutions!

Last, I saw that there's a very similar question, but it was more than a year ago so some things may have changed since then...

Thanks!

illumina • 2.4k views

ADD COMMENT • link updated 11.6 years ago by Charles Warden 8.3k • written 11.6 years ago by Panos ★ 1.8k

score 1 · Answer 1 · 2014-02-13

Splitting up the reads is likely a necessary strategy. I remember having to do with with some herpesvirus sequences. You can use a secondary aligner (such as the one in Staden) to assemble contigs from different parameters and/or subsets of the data (or even different assembly programs). Here is an example workflow for that sort of strategy:

http://genomics-pubs.princeton.edu/prv/scripts.shtml

You can also see if velvetOptimiser can help:

http://bioinformatics.net.au/software.velvetoptimiser.shtml

I know that Oases automatically collects runs Velvet with different parameters, so you could use Oases and just use the final velvet contigs (and ignore the transcripts - which I would recommend, even if you were working with RNA-Seq data; I've actually found the normal assembly tools, like CLC Bio de novo, to be more accurate).

That said, I think you are going to have gaps and issues with repetitive / homologous sequences no matter what. So, I just wanted to make sure that you weren't expecting to get full chromosomes out of the assembler.