Question

Combined assembly analysis (short reads + long reads)

2

Entering edit mode

8.6 years ago

XC ▴ 20

Dear NGS Experts,

I have a question about combined genome assembly.

We have 75X Hiseq sequencing of an animal species genome (about 3Gb genome size) together with 50X Pacbio Sequel system, now, we would like to make a combined assembly analysis of these 350Gb data. Anybody knows any tools for this kind of analysis?

Many thanks.

Assembly sequencing genome • 3.7k views

ADD COMMENT • link updated 8.6 years ago by shwethacm ▴ 240 • written 8.6 years ago by XC ▴ 20

0

Entering edit mode

With 50x PacBio data you should be able assemble that on its own (provided it is good quality). Based on PacBio's recommendation that should be enough to do a good assembly. You can try to assemble the HiSeq data independently and then see if you can combine the two later.

Can you comment on what the sequel data looks like? There is a dearth of real datasets for Sequel.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

Hi genomax2, thanks a lot for your quick answer. We have similar workflow plan. If there is a tool which can do assembly at same time, that would be great, because shorter reads can correct the errors on the long reads to make them more reliable.

We are waiting for the sequel data from sequencer, once we got them, we can try to make comment.

Thank you again.

ADD REPLY • link 8.6 years ago by XC ▴ 20

1

Entering edit mode

FALCON is one option. I think this was used for gorilla genome recently. There are plenty of other options on the Wiki page I had linked in the previous post.
Since you are going to have plenty of PacBio data you may not need to error correct using Illumina (not finding the post from Dr. Hall from PacBio but will update if I do).
Is this a diploid genome?

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

Is there update about the Sequel data? The quality and price?

ADD REPLY • link 7.9 years ago by pengchy ▴ 450

score 2 · Answer 1 · 2016-04-20

From my own experience with a diploid animal genome, error correction of PacBio with Illumina takes time and resources. Proovread worked well for our data at lower coverages (15X Pacbio, 20X HiSeq) but it demands many nodes, 1800 in our case with run-time of 4 days per node - But it is worth the wait, the developer too (Thomas Hackl) is very responsive.

CANU works well for a combined approach, from what I have heard.

At 50X PacBio you could go for self-error correction (PBcR) and then use Quiver to polish the genome with PacBio data alone. In the end, you could use the HiSeq data to finish the genome with the Pilon pipeline.

score 1 · Answer 2 · 2016-04-19

1

Entering edit mode

8.6 years ago

Biomonika (Noolean) 3.2k

I recently have heard good reviews about CANU.

ADD COMMENT • link 8.6 years ago by Biomonika (Noolean) 3.2k

score 0 · Answer 3 · 2016-04-20

Falcon for denovo assembly + Quiver for base error correction is a good combination. I haven't tried the other approaches, but they constantly come up when we do assemblies.

You can also do error correction of the PacBio reads using Illumina and then assemble using Celera. https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/pacBioToCA Although I think the reverse (like Rohit mentioned - assemble first, then do Pileon) is a more popular choice.

Here's a question: Do you have mate pair data? Most denovo PacBio assemblers give you contigs that you can place into scaffolds if you have mate pairs.