Question

plant genome assembler?!

1

Entering edit mode

8.5 years ago

Prasad ★ 1.6k

hi all,

this might be a repeat question, couldnt find a better solution. I am working on a aromatic rice genome (~500MB genome). got the illumina hiseq data (~300 M reads of 150*2). So far i have tried Abyss, IDBA-UD, platanus, SOAP and MaSuRCA. So far IDBA-UD (1.29M scaffolds, N50- 1857, 598MB) has given better result compared to rest. MaSuRCA which performed well(paper), is not working in my case (Not necessarily has to). Here my question is are there any other tools which i could use[tried few from Assemblathon2]

Any suggestions are appreciated.

Thanks

plant Assembly • 3.6k views

ADD COMMENT • link updated 8.5 years ago by colindaven 7.4k • written 8.5 years ago by Prasad ★ 1.6k

1

Entering edit mode

Not an answer to your question, just an idea about assembling "small" genomes: why don't you get yourself a MinION (Oxford Nanopore) and get a better assembly with some nice long reads? Initial investment is quite small, one sequencing run (about 600 dollar) will give you about 15-20x coverage of this genome. Depends obviously how often you would need to do this and which quality of your assembled genome is required.

[Disclaimer: I'm a customer of Oxford Nanopore sequencing but have no other links to the company]

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

0

Entering edit mode

I think maybe the cost and/or error rate?!

ADD REPLY • link 8.5 years ago by Medhat 9.8k

0

Entering edit mode

Scaffolding genome with longer reads combined with short high quality read is not an uncommon approach. Besides, read accuracy is ~95% which is quite okay.

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

0

Entering edit mode

I know this specially in case of highly repetitive or complex genome combining LR (PacBio or Nanopore) with SR gives you best result , but again this all depends on the fund and the project scope (as u said if I have enough money I will get 20X coverage of any long read and every thing will be ok)

ADD REPLY • link 8.5 years ago by Medhat 9.8k

0

Entering edit mode

@ WouterDeCoster - thanks for the suggestion. at given situation of mine no option for nanopore as of now. I have read error rate is bit high.

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

0

Entering edit mode

Mira and w2rap-contigger

ADD REPLY • link 8.5 years ago by Medhat 9.8k

0

Entering edit mode

If you need a decent assembly, you would need other sequencing strategies as Wouter mentioned. I have had some experience with PacBio and it gave some pretty decent assemblies. The actual problem would be in cleaning and finishing the final assembly. Usually you need mate-pair sequences with different insert sizes along with paired-end data for a start.

ADD REPLY • link 8.5 years ago by Rohit ★ 1.5k

0

Entering edit mode

i do have 60M matepair data (5-7Kb NextSeq data). I was hoping to get better result at contig level as quite a good coverage in terms of short reads.

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

0

Entering edit mode

The discovar assembler works well if you have overlapping PE-libraries. Give it a try. Also, some more pre-processing steps would help to acheive better N50's but not higher than 1kb from where you already are.

Did you try to error correct the sequences and try to merge the overlapping paired-end data. This gives you much more information for better contiging.

ADD REPLY • link 8.5 years ago by Rohit ★ 1.5k

0

Entering edit mode

Thanks i will try discovar

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

0

Entering edit mode

If possible you should try MIRA too, it performs really well at upto 400MB genome sizes, but I have to say that the memory consumption too is high

ADD REPLY • link 8.5 years ago by Rohit ★ 1.5k

0

Entering edit mode

memory was the reason i did not try. I will give it a shot. hope it works.

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

score 1 · Answer 1 · 2016-10-25

1

Entering edit mode

8.5 years ago

colindaven 7.4k

SOAPdenovo2 should work quite well if you have long range information, for example LJD or Mate pair libraries. If not, assembly will always be a major struggle with plant genomes.

Long reads are quite challenging to use in scaffolding plant genomes (eg SSPACE long read is decent but very slow), but have a lot of potential. Hybrid assembly approaches are also challenging, one of the best in my experience being DBG2OLC + RACON.

I can't see you getting very much better assemblies with paired end data alone.

Best of luck, Colin

ADD COMMENT • link 8.5 years ago by colindaven 7.4k

0

Entering edit mode

i do have 60M matepair data (5-7Kb NextSeq data). I was hoping to get better result at contig level as quite a good coverage in terms of short reads.

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

0

Entering edit mode

Ah, great. Well, make sure your insert sizes are ok on the mate-paired data and you are configuring the algorithms with the correct orientation. This gets messed up a lot with mate pairs.

You can check the mate pair insert size distribution by aligning eg with bwa or bowtie to the reference genome, then checking the resultant BAM with bamtools
bamtools stats -in x.bam -insert

It makes sense to do read trimming with your favourite tool and duplicate removal (i.e. with bbmaps dedupe.sh ) to carefully curate your reads. If the mate pair library was poor then you will have many (>80%) duplicates.

ADD REPLY • link 8.5 years ago by colindaven 7.4k