Entering edit mode
5.1 years ago
nsapoval
▴
10
I am currently using abyss-pe unitigs
to construct unitigs from a set of reads. I have validated that the performance is great in the case of a single genome and simulated reads on a few genomes together. However, I am currently trying to use it on a large metagenomic dataset and I wonder if someone has validated the use of ABySS with metagenomic data. Does the approach used by this software scale correctly to the case of metagenomic data?
I'm an ABySS fan myself but why do you intent to use it for this specific case? It's not really developed for it, moreover there exists specific software for these use cases.
I am looking for efficient ways to generate unitigs for further downstream processing. I am also considering using metaSPAdes which definitely supports metagenomes. However, when validating on synthetic data (1), ABySS seems to give the most reliable set of unitigs among other softwares I have tried. Hence, I was curious if it still retains this behavior in the other scenario. Would you possibly recommend any particular software that is known to be reliable for this task?
(1): I have created a few random backbone + standardized repeats synthetic datasets and simulated reads on them. Then I build unitigs and validate if all non-repeated regions have been correctly resolved as unitigs, i.e. no splitting of an unambiguous unitig. ABySS tends to perform the best on these tests. That's what I mean by reliability of unitig recovery. Since I clearly cannot do this kind of validation on metagenomic read set, I want to use a software that both behaves correctly on my synthetic data, and has some guarantees of correct behavior on metagenomes.
Two good metagenome assemblers are metaSPAdes and MEGAHIT. metaSPAdes is better, but uses more resources and takes longer to run.
I would not recommend using ABySS for metagenomes - as lieven.sterck noted, it was specifically developed to assemble single genomes, and it probably has some assumptions about the data that will not hold for metagenomes. Also, I would not trust simulation results as indicative of best performers: biological data is a lot more messier and noisier than simulated data.
I think you can use the data generated in: https://www.nature.com/articles/s41467-018-05555-0 to test this. They spiked-in a metagenomic sample with the DNA of 86 known organisms (sequin). The N50 they got for the sequin was <3Kb. You can see if Abyss does better or get long unitigs.