Question

SPAdes is running out of memory

1

Entering edit mode

7.3 years ago

Lina F ▴ 200

Hi all,

I tried running a hybrid assembly for a fungal genome. I have one set of PE reads from Illumina and a MinIon dataset. It should be about 45 Mbases when complete. I ran it on a machine with 35 cores and 250 Gb of RAM.

I am running SPAdes v 3.10.1 and it errorred out with error code "-6" and the message:

<jemalloc>: Error in malloc(): out of memory. Requested: 94287658224, active: 32942063616

In the manual it states: "SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. If you set memory limit manually, SPAdes will use smaller buffers and thus less RAM."

For this run, I specified 35 cores. 35 cores * 512 Mb = 18 Gb.

However, my machine has 250 Gb of RAM, so this should be well within its limits?

Thanks for any suggestions on how to modify my approach!

The log is below:

Command line: 
/home/lina/SPAdes-3.10.1-Linux/bin/spades.py \
  -1 /lina/analysis/nanopore/t111680/illumina/Sample_3701022/3701022_S3_R1_001_paired.fastq.gz \
  -2 /lina/analysis/nanopore/t111680/illumina/Sample_3701022/3701022_S3_R2_001_paired.fastq.gz \
  --nanopore /lina/analysis/nanopore/t111680/1d2/1d2.fastq \
  --threads 35 -o /lina/analysis/nanopore/t111680/spades_out    

System information:
  SPAdes version: 3.10.1
  Python version: 2.7.12
  OS: Linux-4.4.0-83-generic-x86_64-with-Ubuntu-16.04-xenial

Output dir: /lina/analysis/nanopore/t111680/spades_out
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/lina/analysis/nanopore/t111680/illumina/Sample_3701022/3701022_S3_R1_001_paired.fastq.gz']
      right reads: ['/lina/analysis/nanopore/t111680/illumina/Sample_3701022/3701022_S3_R2_001_paired.fastq.gz']
      interlaced reads: not specified
      single reads: not specified
    Library number: 2, library type: nanopore
      left reads: not specified
      right reads: not specified
      interlaced reads: not specified
      single reads: ['/lina/analysis/nanopore/t111680/1d2/1d2.fastq']
Read error correction parameters:
  Iterations: 1
  PHRED offset will be auto-detected
  Corrected reads will be compressed (with gzip)
Assembly parameters:
  k: automatic selection based on read length
  Repeat resolution is enabled
  Mismatch careful mode is turned OFF
  MismatchCorrector will be SKIPPED
  Coverage cutoff is turned OFF
Other parameters:
  Dir for temp files: /lina/analysis/nanopore/t111680/spades_out/tmp
  Threads: 35
  Memory limit (in Gb): 250


======= SPAdes pipeline started. Log can be found here: /lina/analysis/nanopore/t111680/spades_out/spades.log


===== Read error correction started. 


== Running read error correction tool: /home/lina/SPAdes-3.10.1-Linux/bin/hammer /lina/analysis/nanopore/t111680/spades_out/corrected/configs/config.info

  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  83)   Starting BayesHammer, built from N/A, git revision N/A
  0:00:00.019     4M / 4M    INFO    General                 (main.cpp                  :  84)   Loading config from /lina/analysis/nanopore/t111680/spades_out/corrected/configs/config.info
  0:00:00.022     4M / 4M    INFO    General                 (memory_limit.hpp          :  47)   Memory limit set to 250 Gb
  0:00:00.022     4M / 4M    INFO    General                 (main.cpp                  :  93)   Trying to determine PHRED offset
  0:00:00.023     4M / 4M    INFO    General                 (main.cpp                  :  99)   Determined value is 33
  0:00:00.024     4M / 4M    INFO    General                 (hammer_tools.cpp          :  36)   Hamming graph threshold tau=1, k=21, subkmer positions = [ 0 10 ]
  0:00:00.024     4M / 4M    INFO    General                 (main.cpp                  : 120)   Size of aux. kmer data 24 bytes
     === ITERATION 0 begins ===
  0:00:00.026     4M / 4M    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 428)   Building kmer index
  0:00:00.026     4M / 4M    INFO   K-mer Splitting          (kmer_data.cpp             :  91)   Splitting kmer instances into 560 buckets. This might take a while.
  0:00:00.026     4M / 4M    INFO    General                 (file_limit.hpp            :  30)   Open file limit set to 1024
  0:00:00.026     4M / 4M    INFO    General                 (kmer_index_builder.hpp    : 108)   Memory available for splitting buffers: 2.38092 Gb
  0:00:00.026     4M / 4M    INFO    General                 (kmer_index_builder.hpp    : 116)   Using cell size of 119837
  0:01:15.991    19G / 19G   INFO   K-mer Splitting          (kmer_data.cpp             :  98)   Processing /lina/analysis/nanopore/t111680/illumina/Sample_3701022/3701022_S3_R1_001_paired.fastq.gz
  0:02:30.115    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 8671797 reads
  0:03:52.462    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 17757244 reads
  0:05:14.826    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 26847635 reads
  0:06:38.972    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 35937553 reads
  0:08:08.889    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 45073098 reads
  0:09:40.280    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 54072450 reads
  0:11:06.362    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 63129760 reads
  0:12:31.079    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 72145772 reads
  0:14:03.454    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 81288864 reads
  0:15:29.876    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 90331193 reads
  0:16:59.401    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 99462762 reads
  0:18:25.410    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 108476084 reads
  0:22:48.227    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 135849570 reads
  0:23:35.285    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             :  98)   Processing /lina/analysis/nanopore/t111680/illumina/Sample_3701022/3701022_S3_R2_001_paired.fastq.gz
  0:50:39.101    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 108)   Processed 271916466 reads
  0:54:06.993    19G / 20G   INFO   K-mer Splitting          (kmer_data.cpp             : 113)   Total 278480422 reads processed
  0:54:08.662   140M / 20G   INFO    General                 (kmer_index_builder.hpp    : 252)   Starting k-mer counting.
  1:04:54.515   140M / 20G   INFO    General                 (kmer_index_builder.hpp    : 258)   K-mer counting done. There are 3928652426 kmers in total.
  1:04:54.515   140M / 20G   INFO    General                 (kmer_index_builder.hpp    : 260)   Merging temporary buckets.
  1:11:48.688   140M / 20G   INFO   K-mer Index Building     (kmer_index_builder.hpp    : 437)   Building perfect hash indices
  1:11:48.688   140M / 20G   WARN   K-mer Index Building     (kmer_index_builder.hpp    : 451)   Number of threads was limited down to 24 in order to fit the memory limits during the index construction
  1:15:39.878     1G / 98G   INFO    General                 (kmer_index_builder.hpp    : 276)   Merging final buckets.
  1:20:35.719     1G / 98G   INFO   K-mer Index Building     (kmer_index_builder.hpp    : 483)   Index built. Total 1283564560 bytes occupied (2.61375 bits per kmer).
  1:20:35.724     1G / 98G   INFO   K-mer Counting           (kmer_data.cpp             : 359)   Arranging kmers in hash map order
  1:29:42.975    59G / 98G   INFO    General                 (main.cpp                  : 155)   Clustering Hamming graph.
  2:52:58.884    59G / 98G   INFO    General                 (main.cpp                  : 162)   Extracting clusters
  3:51:36.260    59G / 129G  INFO    General                 (main.cpp                  : 174)   Clustering done. Total clusters: 1071603427
  3:51:39.094    30G / 129G  INFO   K-mer Counting           (kmer_data.cpp             : 381)   Collecting K-mer information, this takes a while.
<jemalloc>: Error in malloc(): out of memory. Requested: 94287658224, active: 32942063616


== Error ==  system call for: "['/home/lina/SPAdes-3.10.1-Linux/bin/hammer', '/lina/analysis/nanopore/t111680/spades_out/corrected/configs/config.info']" finished abnormally, err code: -6

======= SPAdes pipeline finished abnormally and WITH WARNINGS!

=== Error correction and assembling warnings:
 * 1:11:48.688   140M / 20G   WARN   K-mer Index Building     (kmer_index_builder.hpp    : 451)   Number of threads was limited down to 24 in order to fit the memory limits during the index construction
======= Warnings saved to /lina/analysis/nanopore/t111680/spades_out/warnings.log

=== ERRORs:
 * system call for: "['/home/lina/SPAdes-3.10.1-Linux/bin/hammer', '/lina/analysis/nanopore/t111680/spades_out/corrected/configs/config.info']" finished abnormally, err code: -6

In case you have troubles running SPAdes, you can write to spades.support@cab.spbu.ru
Please provide us with params.txt and spades.log files from the output directory.

spades hybrid assembly fungi memory • 17k views

ADD COMMENT • link updated 4.3 years ago by carisdak • 0 • written 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

Not sure what unit that 94287658224 is in, but even if it is kilobytes it is still a lot more RAM than you have available.

ADD REPLY • link 7.3 years ago by GenoMax 147k

0

Entering edit mode

yes, that number is very large! I am not entirely sure what the units are but the math doesn't seem to work out either way :-/

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

It's actually not that large. It's probably in Bytes, which would correspond to about 88 GB. This is manageable by high memory cloud servers or even HPC servers

ADD REPLY • link 4.3 years ago by carisdak • 0

score 1 · Answer 1 · 2017-08-11

1

Entering edit mode

7.3 years ago

h.mon 35k

From my experience, I don't thinks SPAdes scales efficiently beyond something 4-8 threads, so I cap --threads at 8.

What are the sizes (and predicted coverage) of your Illumina fastq files? Maybe you have too much sequencing, try digital normalization or simple down-sampling.

edit: SPAdes authors are really helpful, and reply back really quickly when reached for help.

ADD COMMENT • link 7.3 years ago by h.mon 35k

0

Entering edit mode

Thanks for the suggestions!

My Illumina data is 139,240,211 read pairs, and I have 100,661 reads from the minIon. I am not sure about predicted coverage -- do you have a suggestion on how to calculate that?

I will try capping SPAdes at 8 threads and will also take a look at downsampling my input data.

I have reached out to the SPAdes authors but am still waiting on a reply.

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

1

Entering edit mode

From the top of my head coverage is calculated by (read length * number os reads) / genome size. In your case (assuming 2x100bp HiSeq 2500): (200 * 139240211) / 46000000 = 619x coverage. Use BBNorm (guide and examples) from BBTools to target 100x coverage.

ADD REPLY • link 7.3 years ago by h.mon 35k

0

Entering edit mode

2x150bp from a NextSeq... I just started running bbnorm ;-)

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

You should probably clean out edge-duplicates too (C: Duplicates on Illumina ).

ADD REPLY • link 7.3 years ago by GenoMax 147k

0

Entering edit mode

Had not seen this -- thanks for sharing!

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

I checked my illumina data and this does not seem to be a problem with this sample. However, I will spot check samples from some other runs and from our other NextSeq machine as well.

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

FYI: I was able to run bbnorm using the example you linked to and reduce the input illumina dataset to about 25 million read pairs. SPAdes was then able to assemble the data (using eight threads).

Thanks again for the feedback and suggestions, it was very helpful!

ADD REPLY • link 7.3 years ago by Lina F ▴ 200