Question

Looking For A Recommendation To Perform A De Novo Assembly With Miseq Data Of Lengths 2X250Bp

4

Entering edit mode

12.2 years ago

Leszek 4.2k

I have got overlapping MiSeq 2x250bp reads (after merging single-end 400-450bp). The genome size is ~20Mb. I think de Bruijn graph based assemblers is not the way to proceed with such dataset, isn't it?
Have you had some experienced assembling this kind of data? Maybe some 'good-old-times' (overlap-based) assembler can handle it better?

assembly miseq denovo • 7.7k views

ADD COMMENT • link updated 12.2 years ago by 14134125465346445 ★ 3.6k • written 12.2 years ago by Leszek 4.2k

1

Entering edit mode

What is the sequencing library insert size? If 500-600 or below, you may try to find overlaps within pair of reads with Quake. Also you may get better overlaps if you error correct prior to Quake.

ADD REPLY • link 12.2 years ago by Darked89 4.7k

1

Entering edit mode

I did, so what I'm playing with is single reads of 350-450bp (100x) and paired reads (2x250bp) that didn't merge correctly (50x).

ADD REPLY • link 12.2 years ago by Leszek 4.2k

0

Entering edit mode

What organism do you have that is 20Mb? Small end of the eukaryotes? If you have good coverage (this is key) then any of the suites of assemblers will do. I like velvet for a genome this size. Someone else might like another assembler. Depending on your sequencing depth and gene space you'll probably have to do some post assembly clean up. I like PAGIT for that.

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

0

Entering edit mode

it's average fungal genome. thing is, the genome is quite heterozygous, so de Bruijn graph assemblers (Velvet, SOAP, ABySS) are having hard times and shattering it a lot... I'm more into older-style assembler like Newbler or Celera. Anyone tried it with MiSeq?

ADD REPLY • link 12.2 years ago by Leszek 4.2k

0

Entering edit mode

I work with fungi too. Sounds like 20Mb is in yeast territory, so it's on the smaller size. I am working on assembly of a few in the 40 to 60 Mb range and they also have high heterozygosity. We still use de Bruijn style assemblers mainly, but I also use Newbler on occasion. I think coverage is key. My suggestion is to try Newbler and see how the assembly compares to a de Bruijn like Velvet. Good luck and let me know if you want to commiserate with me about it!

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

1

Entering edit mode

Thanks Josh. I have quickly tried SOAPdenovo sometime ago and it performed below my expectations... this is why I want to try something old style. Maybe I will give a try to ALLPaths as BROAD made it in overlapping reads in mind... Anyway, I will keep you posted.

ADD REPLY • link 12.2 years ago by Leszek 4.2k

1

Entering edit mode

@Leszek, I don't think you can use ALLPATHS this way (with just one library), to my knowledge. Unless, there is some hack I don't know about. With a genome this size you should be able to benchmark numerous methods in a reasonable amount of time. I agree with Josh in the approach, I'd run Newbler and VelvetOptimser and see how they compare, given your read lengths.

ADD REPLY • link 12.2 years ago by SES 8.6k

score 5 · Answer 1 · 2013-02-01

5

Entering edit mode

12.2 years ago

Bach ▴ 550

For trying overlap based assemblers: absolutely do reduce the data set. Maybe to something like ~80x, but not really much more. Say, 40x from your merged reads, 40x from still paired reads. Then try out any of the usual suspects.

My first try would be with MIRA, but that is just because I wrote it. In case you use MIRA: make sure the merged reads have all adaptors clipped away. The unmerged reads should not be preprocessed at all, MIRA will clip them just right (adaptors, quality, simple sequencing errors, etc.)

ADD COMMENT • link 12.2 years ago by Bach ▴ 550

0

Entering edit mode

Hi Bastien. I'm trying MIRA. It's running since yesterday - we'll see. Right now I'm running ~60x (the reads that successfully merged). Good point, I was also considering running reads that didn't merge correctly as second lib (these likely have too big insert size or low quals toward the ends so didn't merge correctly).

ADD REPLY • link 12.2 years ago by Leszek 4.2k

0

Entering edit mode

MIRA is running since Thursday (third pass) on 10 cores and using 50GB of RAM. Is that fine? Reported coverage is 51x. Another point, how well MIRA handles heterozygous regions (3-4% divergence). It reports 24Mb, while I'm expecting ~13Mb...

ADD REPLY • link 12.2 years ago by Leszek 4.2k

score 2 · Answer 2 · 2013-02-02

2

Entering edit mode

12.2 years ago

14134125465346445 ★ 3.6k

SGA is an overlap-based assembler that works well with Illumina datasets: http://github.com/jts/sga

ADD COMMENT • link 12.2 years ago by 14134125465346445 ★ 3.6k