Hello,
I have an insect genome to assemble (max size: 500 MB) with illumina data composed of paired end and mate pair.
I'm thinking to use SOAPdenovo and Spades.
Do you have any recommendation of better assembler for my data ?
Hello,
I have an insect genome to assemble (max size: 500 MB) with illumina data composed of paired end and mate pair.
I'm thinking to use SOAPdenovo and Spades.
Do you have any recommendation of better assembler for my data ?
My suggestion would be to do some preliminary QC on the sequence data first, which may help dictate which assemblers you may want to look into. Run a k-mer analysis to determine the level of actual coverage and complexity of the data (you could use Jellyfish, khmer, and a whole slew of tools to generate this data). Also, we run preQC to give a more complete assessment.
This, plus what library types you have, normally helps dictate which assemblers may work best. If you have overlapping shotgun libraries and a genome with low heterozygosity, ALLPATHS-LG or DISCOVAR are great (with the latter you would need to scaffold with a separate tool). Which one depends on the length of the sequence data you have.
If the het. rate is pretty high you could give Platanus a go; we've had fairly reasonable luck with it on a few troublesome genomes. You can also use SOAPdenovo, though I believe it's now deprecated in favor of MEGAHIT (we haven't tried this one yet).
"Best assembler" is in the eye of the beholder. What are your requirements? Longest NG50? Most comprehensive gene coverage? Accurate resolution of heterozygosity? Best long range connectivity? Most reads remapping to your assembly?
There is no single best assembler, or single best metric for determining the best assembly. I recommend the Assemblathon 2 paper for its discussion of assembly evaluation, as well as challenges posed by heterozygosity, repetitive sequences, etc.
Those vertebrate genomes were only 2X-3X larger than your insect (1.0-1.6 GB vs 500MB), so the sizes are comparable. And no single assembler gives consistently best NG50 across all data sets. That metric is strongly dependent upon the degree of heterozygosity and repetitive DNA, which varies by genome.
Here is a recent paper discussing using DISCOVAR for insect assembly: http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2531-7
Might be helpful
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
There is no clear answer to that question, But you are adviced to use different assembler, I would suggest Abyss and SOAPdenovo; after that you can use as suggested by the answer of @harold.smith.tarheel N50 and/or align your read to the assembly to see how it behave (if many reads didn't aligned you probably miss some regions in your assembly) as you have paire-end and mate pair if you have concordant align reads low then you have rearrangements in your assembly, use relative specious to see how your assembly looks.
Also you can use tools like REAPR (for de novo assembly) , misFinder (identify mis-assemblies in an unbiased manner using reference and paired-end reads), QUAST