Hello,
I need some advice.I never realized a genome assembly before. I have to make a de novo genome assembly on a large genome (2.5 gb) with short illumina paired end reads of 150 pb.
I inquired about the different assemblers but none match my needs. there is always a criterion that blocks (for example Abyss, ALLPATHS-LG and SOAPdenovo work with much shorter reads while others like Spades are not working for the genomes of this size).
Do you have an idea of what short-read de novo assembler I could use? which would give the best results?
cordially
A. GUYOMARD
France, Lyon
not Allpaths, it requires at least two libraries, one paired-end and one mate-pair (see B1 and B3 in https://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=215)
Actually, we use Allpaths routinely here with only one library. You can feed Allpaths the short library again instead of a long library, or you can assemble the short library with something like Velvet and generate synthetic LMP reads from the contigs, which is the approach we take. It seems silly, but it works and gives good results.
Thanks for your comment. Just curious.. did you control for misassemblies?
When testing a new assembler or assembly method, we use data of known organisms and run the assembly through Quast, which counts misassemblies, to verify that the approach is valid.
QUAST with a ref genome is indeed a very good approach to evaluate an assembly. If you had no or little misassemblies, then I'd be inclined to think it's fine.
I'd be curious to hear from Allpaths developers what they think of this usage of their tool.
Have you a script to generate synthetic LMP reads from a contig file to share please ? :)
You can use the BBMap package for that:
They come out in "innie" orientation; you can use reformat.sh with the rcomp or rcompmate flag to transform them to a different orientation if you need to.
Is there a recommendation for which option might work better - short library again Vs. synthetic LMPs?
OR does it differ on a case by case basis, and if so, how does not determine which option might better serve one's genome assembly goals?
AND I wonder if ALLPATHS-LG, for a medium sized eukaryotic, haploid genome (~50MB), has been empirically shown to be any better or worse than a5miseq, or SPAdes, or ABySS. I'm comparing assemblers to pick one, but I've got to stop my comparative analyses to move on with the "chosen" one. Hence this question.
I hope you do not mind me tagging you two here: Rayan Chikhi, and Brian Bushnell. Thanks!
Actually, we don't do that anymore as far as I know :) I'm not sure if it's a good idea or not, or what the procedure was for validating that it did not lead to misassemblies (if any validation was performed). So if you do go that route, I suggest you validated it on genomes with finished references first.
We have extensively tested AllPaths versus other assemblers multiple times, but assembly results can be very version-specific and Spades especially has changed a lot since the last test.
Spades tends to be our best microbial assembler but I'm not sure how it does on fungi.