Hi,
I have been trying to assemble a genome that, based on flow cytometry, should be between 1.2 to 1.5 GB. I have Illumina (more than 100x), Pacbio (50x) 10x and Hic reads. The bests assembly that we have got so far is 450 MB and 93 percent complete. It seems that the assembly collapses despite having all the newest sequencing technology data. We have tried CANU, MASURCA, FALCON, SuperNova, and MINIassem. I should mention that the flow cytometry suggests that the genome is a diploid but bioinformatic analysis suggests both diploid and tetraploid. So, my question is, does anyone have any idea why the genome assembly collapses and is there any assembly software available which can handle all these different types of reads and perform better assembly which I might not be aware of? Any suggestions and directions are greatly appreciated.
All the bests, Pezhman Safdari
Genome sizing by flow citometry is an approximation as ploidy. But also, the assembly would depend on the genome complexity (at sequence level), even if you have 100X in all the technologies, definitely it is not a guaranty to obtaining a complete genome sequence.
polyploidy will certainly influence your assembly result. Have you checked (did you analyse the data to see if it might indeed be polyploid and if so, how did you do it?).
An interesting and quite straightforward approach to estimate genome size (and even polyploidy to some extent) is to make those Kmer-frequency plots. A useful website for this is genomescope .
If it turns out to be polyploid, go an have a look in the literature to see how other people have tackled this (eg. cotton genome, soybean, wheat, .... ). I now there is also specific software around to assembly highly heterozygous genomes, I think platanus is one of them and Falcon-unzip (from the top of my head to be honest)
What do you mean by 93 percent complete (BUSCO complete? BUSCO complete + fragmented)?
One of the best approach nowadays is pacbio+Hi-C with Falcon-phase. Then you can correct with illumina. Otherwise did you try scaffolding you pacbio assembly with your 10X data?