I need to assemble a large metagenomics dataset from Illumina NextSeq reads. My read depth is approximately 20 million reads per sample (28 samples) and the concatenated R1 and R2 reads are 130 GB each. I'm using 64-threads and it's still not enough.
I've been using metaspades which has been doing a great job. This is the command I ran:
python /usr/local/packages/spades-3.9.0/bin/metaspades.py -t 64 -m 1000 -1 ./paired_1.fastq -2 ./paired_2.fastq -o . > spades.log
It crashed and here's the end of the output log:
==> spades.log <==
576G / 944G INFO General (distance_estimation.cpp : 226) Processing library #0
576G / 944G INFO General (distance_estimation.cpp : 132) Weight Filter Done
576G / 944G INFO DistanceEstimator (distance_estimation.hpp : 185) Using SIMPLE distance estimator
<jemalloc>: Error in malloc(): out of memory. Requested: 256, active: 933731762176
It's obviously a memory issue. Has anyone had any success: (1) using either another assembler; (2) a method to collapse the data before hand; or (3) data processing that could give unbiased assemblies?
I do not want to assemble in stages because it is difficult to collapse the data into a single dataset.
We thought about randomly selecting R1 and R2 reads but is there another method?
This method seems interesting to do unsupervised clustering of the reads before hand but I haven't seen any application-based implementations.
SPAdes manual says this about metaSPAdes
There might be a chance to reduce data based on duplication of reads using Fastuniq. But if you don't have too much duplication, it doesn't help much. Also, pre-filtering of low quality reads might help. Another option to reduce the data is normalisation through coverage (not recommended for metagenomes). I usually use bbnorm, but this handles only single-end reads.
currently bbnorm (v38.84) seems to take paired reads as input and generate interleaved data as output.