Entering edit mode
7.3 years ago
anin.gregory
▴
110
I have 7 billion paired-end reads from multiple microbiome studies that I want to run a cross-assembly across using metaSPAdes.
Background:
- I need to use metaSPAdes
- I have access to a 1.5TB memory node, where it can run almost indefinitely, but I have a deadline of October for the assembly to be done
- All reads have been error-corrected using bbnorm.sh
- I have started a cross-assembly on the 1.5TB node using the '--only-assembler' flag that has been running for 3 weeks
The current assembly has been running and for the last 1.5 weeks it has been stuck on the 'post-simplification step' of 'Running Disconnecting edges with relatively low coverage'. I have looked online to see if this is a slow step for others on the SPAdes website and different forums, but could not find any discussions about this. Have you had this problem for anyone else? Does anyone have any tips to speed up the assembly?
Thanks!
My recommendation would be to use Megahit in this case; it is much less resource-intensive than SPades.
If you download the latest version of BBMap, there is now a file at:
That shows my suggested method of preprocessing data prior to assembly. It includes various trimming, filtering, and error-correction operations to minimize the number of erroneous kmers than increase time and memory consumption of large metagenomes, so it may be helpful in this case.
Did you also normalize the reads?
No, some benchmarking we did in our lab has shown that normalization reduces our contig lengths because SPAdes as it uses differential coverage to resolve ambiguities.