Question

Advice on assembling very large metagenomic dataset?

0

Entering edit mode

8.3 years ago

O.rka ▴ 740

I need to assemble a large metagenomics dataset from Illumina NextSeq reads. My read depth is approximately 20 million reads per sample (28 samples) and the concatenated R1 and R2 reads are 130 GB each. I'm using 64-threads and it's still not enough.

I've been using metaspades which has been doing a great job. This is the command I ran:

python /usr/local/packages/spades-3.9.0/bin/metaspades.py -t 64 -m 1000 -1 ./paired_1.fastq -2 ./paired_2.fastq -o . > spades.log

It crashed and here's the end of the output log:

==> spades.log <==
576G / 944G  INFO    General                 (distance_estimation.cpp   : 226)   Processing library #0
576G / 944G  INFO    General                 (distance_estimation.cpp   : 132)   Weight Filter Done
576G / 944G  INFO   DistanceEstimator        (distance_estimation.hpp   : 185)   Using SIMPLE distance estimator
<jemalloc>: Error in malloc(): out of memory. Requested: 256, active: 933731762176

It's obviously a memory issue. Has anyone had any success: (1) using either another assembler; (2) a method to collapse the data before hand; or (3) data processing that could give unbiased assemblies?

I do not want to assemble in stages because it is difficult to collapse the data into a single dataset.

We thought about randomly selecting R1 and R2 reads but is there another method?

This method seems interesting to do unsupervised clustering of the reads before hand but I haven't seen any application-based implementations.

metagenomics assembly genomics big-data large • 3.8k views

ADD COMMENT • link updated 8.3 years ago by yzzhang ▴ 30 • written 8.3 years ago by O.rka ▴ 740

0

Entering edit mode

SPAdes manual says this about metaSPAdes

Currently metaSPAdes supports only a single library which has to be paired-end

ADD REPLY • link 8.3 years ago by GenoMax 151k

0

Entering edit mode

There might be a chance to reduce data based on duplication of reads using Fastuniq. But if you don't have too much duplication, it doesn't help much. Also, pre-filtering of low quality reads might help. Another option to reduce the data is normalisation through coverage (not recommended for metagenomes). I usually use bbnorm, but this handles only single-end reads.

ADD REPLY • link 8.3 years ago by Rohit ★ 1.5k

0

Entering edit mode

currently bbnorm (v38.84) seems to take paired reads as input and generate interleaved data as output.

ADD REPLY • link 4.5 years ago by Stephane Plaisance ▴ 460

score 2 · Answer 1 · 2017-03-01

Hi, Megahit is more memory efficient. Previous I trimmed my reads from soil samples using trimal with quality score 20. And for reads around 260 gb, the memory ~500 gb was used. If you don't have enough memory, khmer may be a good choice for partitioning the reads http://khmer.readthedocs.io/en/v2.0/user/scripts.html