Question

De novo genome assembly strategy

0

Entering edit mode

9.4 years ago

joneill4x ▴ 160

Assembling a genome de novo. I have:

10X coverage with PAC-BIO reads
100X coverage with Illumina short reads (150 bp paired-end reads)
20X coverage with long MiSeq reads (max length 800 bp)

Given what I have to work with, what would be the best strategy to assemble the genome and why?

Thank you,
Joe

edit - genome size ~ 1Gb

Assembly sequencing genome • 5.4k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 9.4 years ago by joneill4x ▴ 160

2

Entering edit mode

You should specify the genome type. Some tools will not be able to work on big genomes.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Juke34 9.2k

0

Entering edit mode

We have similar sets of data and I was wondering what you have decided to use at the end? Will also appreciate if you tell about your experience. Thanks

ADD REPLY • link 8.9 years ago by s-writes • 0

0

Entering edit mode

I ended up using DBG2OLC

What lead me there: https://github.com/PacificBioscience...Bio-Long-Reads

The publication: http://arxiv.org/ftp/arxiv/papers/1410/1410.2801.pdf

The code: http://sourceforge.net/projects/dbg2olc/

I'm quite pleased with the results of DBG2OLC.

I corresponded with the authors, managed to closely replicate the results from their paper, and made some pretty decent draft assemblies of my own with minimal data. Fast performance and good results.

ADD REPLY • link 8.9 years ago by joneill4x ▴ 160

Ram · Answer 1 · 2015-11-27

3

Entering edit mode

9.4 years ago

Adrian Pelin ★ 2.7k

SPAdes should provide very nice results for your dataset. It will assemble your 100x using a multi k-mer approach, then it will resolve some repeats using your long MiSeq reads and it will scaffold additionally using PacBio.

http://bioinf.spbau.ru/spades

So you can use their suggested guidelines for 150bp reads:

spades.py -k 21,33,55,77 --careful <your reads> -o spades_output

You can specify pacbio as: --pacbio

Your 100x as: --pe1-1 and --pe2-1

and your single end MiSeq as --s2

ADD COMMENT • link 9.4 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

A nice tool. But it will work only for smaller genomes.

ADD REPLY • link 9.4 years ago by Juke34 9.2k

0

Entering edit mode

I have used it up to 150mb. Then again the OP did not mention what the genome size is.

ADD REPLY • link 9.4 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Thanks Adrian. Using SPAdes was my first thought too. However, my genome size is large, ~ 1GB, so I don't think I can use it.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by joneill4x ▴ 160

0

Entering edit mode

I found SPAdes and dipSPAdes to run extremely slow when using PacBio reads as input.

ADD REPLY • link 8.9 years ago by joneill4x ▴ 160

Ram · Answer 2 · 2015-11-25

1

Entering edit mode

9.4 years ago

Juke34 9.2k

Allpaths-LG can be a solution, it will perform the assembly from illumina short reads and then a scaffolding using the PacBio data.

For illumina reads, it needs a high coverage (100x), so for your case it's fine, but in other hand it needs very specific libraries (3 kbp matepair ?). You should check.

ADD COMMENT • link 9.4 years ago by Juke34 9.2k

0

Entering edit mode

Thanks Juke.

ADD REPLY • link 9.4 years ago by joneill4x ▴ 160

0

Entering edit mode

IIRC ALLPATHS-LG requires overlapping PE and one short mate-pair library. So it may not work if the above libraries don't fit this specification.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Chris Fields ★ 2.2k

Ram · Answer 3 · 2015-11-27

ALLPATHS‐LG requires a minimum of 2 paired‐end libraries - one short and one long. The short library average separation size must be slightly less than twice the read size, such that the reads from a pair will likely overlap - for example, for 100 base reads the insert size should be 180 bases. The distribution of sizes should be as small as possible, with a standard deviation of less than 20%. The long library insert size should be approximately 3000 bases long and can have a larger size distribution. Additional optional longer insert libraries can be used to help disambiguate larger repeat structures and may be generated at lower coverage

EDIT: Copied from the manual

Ram · Answer 4 · 2015-11-28

1

Entering edit mode

9.4 years ago

Juke34 9.2k

You also can use MaSuRCA mega-reads.

Masurca in general gives relatively good results.

It is one of the rare real hybrid assembler (De Bruijn/OLC)

ADD COMMENT • link 9.4 years ago by Juke34 9.2k

1

Entering edit mode

Thanks Juke. However, I don't think I should use it for my task because "We note that the modified version of CABOG 6.1 used in MaSuRCA is not capable of supporting the long high-error-rate reads generated by the PacBio technology."

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by joneill4x ▴ 160

Ram · Answer 5 · 2015-12-07

0

Entering edit mode

9.4 years ago

joneill4x ▴ 160

*Deleted

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by joneill4x ▴ 160