I have done a lot of denovo assembly with NGS data (Illumina NextSeq and MiSeq) and expect to only get a "pretty good" final assembly. However with PacBio I was under the impression this improved greatly. I'm struggling to finalize assemblies though.
Currently I have tried the following assemblers:
- CANU
- HGAP4/Whatever the pbsmrtpipe de novo assembly pipeline is
- SOAPdenovo with hybrid mode (pacbio+illumina)
I generated my data from a multiplexed run on a PacBio Sequel machine and demulitplexed with lima
.
Of the assemblies the hybrid did the best. The overrall assembly contained ~500 conitgs and was twice the expected genome size. However if I filtered out conitgs <10,000 base pairs I ended up with 80 contigs whose length is extremely close to the expected genome size.
What do I do from here? I've tried circlator
which seems to only try to circularize the contigs themselves. My next step is to considered quickmerge
to possibly finalize.
Has anyone else hit a similar stumbling block in trying to finish a genome using PacBio reads?
What is the organism? I suppose it is a bacteria, as you were trying
circlator
. It is really strange the final assembly being twice the expected genome size, did you check for contaminants?Not in depth but I have performed Illumina sequencing on this same sample and there were no contaminants.
I think the problem here is that the genome is diploid or possibly polyploid. In the case of Diploid or polyploidic genomes the assembly size can be generally more than the haploid genome size, which is what OP wants.
OP, I think you can filter the same by genome vs genome alignments. The Diploidic sequences will show a pretty high identity. You can subsequently filter the same.
If you can post the parameters that you have used we can probably suggest better sets for your assembly.