Question

Minia crashes during kmer counting

0

Entering edit mode

9.2 years ago

shammond ▴ 10

Hi guys, I'm having some problems assembling a 2 x 250 bp, 76x coverage data set using Minia 2.0.3:

[DSK: Collecting stats on read sample   ]  100  %   elapsed:   5 min 59 sec    estimated remaining:   0 min 0  sec   cpu:  297.9 %   mem: [ 844,  844,  844] MB
[DSK: Pass 1/1, Step 2: counting kmers  ]  50.3 %   elapsed: 115 min 10 sec    estimated remaining: 113 min 36 sec   cpu:  404.8 %   mem: [6337, 6430, 6430] MB Warning: forced to allocate extra memory: 14650 MB
EXCEPTION: Pool allocation failed for 1682 bytes (bank ids alloc). Current usage is 15362380148 and capacity is 15362381814

Or, sometimes it fails with this exception:

EXCEPTION: Pool allocation failed for 2808456 bytes (kmers alloc)

I ran Kmergenie (1.7016), and I was surprised that it recommended a coverage cut-off of 1 for the best k of 64. As your manual recommends, I instead used a cut-off of 2, and also tried higher thresholds (3, 4, 10, even 100 and above). Unfortunately I kept getting this error. The machines I've been using have > 1.5 TB RAM, so I wouldn't expect to be running out.

I'm running minia like so:

minia -in read-files.txt -abundance-min 4 -kmer-size 64 -nb-cores 32 -max-memory 0

Thanks in advance!

minia • 3.9k views

ADD COMMENT • link 9.2 years ago by shammond ▴ 10

0

Entering edit mode

Hi, I'm not sure about this "-max-memory 0". Could you perhaps try with a higher memory setting, e.g. "-max-memory 20000" ?

ADD REPLY • link 9.2 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Hi, I've tried values up to 2200000 for -max-memory, and at the top end I will get an error like this:

[DSK: Pass 1/1, Step 2: counting kmers  ]  50.3 %   elapsed: 116 min 11 sec    estimated remaining: 114 min 34 sec   cpu:  408.0 %   mem: [308387, 308436, 308436] MB Warning: forced to allocate extra memory: 14650 MB

and a pool allocation exception.

Is it possible that it's not utilizing all of the memory specified in -max-memory? What is the significance of the three values noted after "mem:" in the log?

ADD REPLY • link 9.2 years ago by shammond ▴ 10

0

Entering edit mode

The three values are: 1. current memory usage measured by system, 2. maximum of the values ever seen in field 1 3. maximum memory usage as measured by the system (ru_maxrss)

It's possible that not all memory is used.

Until we release a new official minia version, could you try with this unreleased beta version of minia 3.0? It's a linux 64 bits binary. https://github.com/GATB/gatb-pipeline/raw/master/minia/minia

ADD REPLY • link 9.2 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Thanks for that beta, Rayan. It was able to assemble my data at k64, and reported max memory usage of 872 GB. Are there any parameters I could tweak in 2.0.3 to get around this memory issue?

ADD REPLY • link 9.2 years ago by shammond ▴ 10

0

Entering edit mode

Hi, it is quite unusual to see such a large memory usage. I wonder what is special about your data.. Can you please tell me the number of files and total size of the read dataset files, and the number of distinct kmers reported by minia 3 (possibly the full log of the output stats at the end).

ADD REPLY • link 9.2 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Hi Rayan, I have one set of PET reads, so two gzipped read files, 59 GB and 63 GB. Minia 3 reports 3045237911 solid kmers.

Full output stats from the run:

-in                                      : hsapiens-files.txt
-abundance-min                           : 4
-kmer-size                               : 64
-max-memory                              : 1200000
-nb-cores                                : 32
-traversal                               : contig
-starter                                 : best
-contig-max-len                          : 10000000
-bfs-max-depth                           : 500
-bfs-max-breadth                         : 20
-fasta-line                              : 0
-abundance-max                           : 2147483647
-abundance-min-threshold                 : 3
-histo-max                               : 10000
-solidity-kind                           : sum
-max-disk                                : 0
-out-dir                                 : .
-out-tmp                                 : .
-out-compress                            : 0
-minimizer-type                          : 0
-minimizer-size                          : 8
-repartition-type                        : 0
-bloom                                   : neighbor
-debloom                                 : cascading
-debloom-impl                            : minimizer
-branching-nodes                         : none
-topology-stats                          : 0
-mphf                                    : BooPHF
-verbose                                 : 1
-integer-precision                       : 0
-verbose                                 : 1
-verbose                                 : 1
stats                                   
    traversal                                : contig
    using_mphf                               : 1
    nb_contigs                               : 8768399
    nb_small_contigs_discarded               : 1273937
    nt_assembled                             : 3269543180
    max_length                               : 59555
    max_length_left                          : 39588
    max_length_right                         : 51372
    graph simpification stats               
        tips removed                             : 5537389 + 789022 + 199426 + 50897 + 19640 + 270277 + 61958 + 14651
        bubbles removed                          : 2206094 + 35335 + 3622 + 19419 + 2011 + 381
        EC removed                               : 1958894 + 230553 + 38031 + 7340 + 20765 + 10528 + 4010
    assembly traversal stats                
        no extension                             : 303106
        out-branching                            : 7132783
        in-branching                             : 10096260
time                                     : 39423.807
    assembly                                 : 18694.892
    graph construction                       : 20728.915

Thanks for your ongoing help.

ADD REPLY • link 9.2 years ago by shammond ▴ 10

0

Entering edit mode

When I set max-memory to 0 and min-abundance to "auto", minia reports peak memory of 24.9 GB.

ADD REPLY • link 9.2 years ago by shammond ▴ 10

0

Entering edit mode

Hi, thanks for those details. 24.9 GB seems more in line with what Minia 3 typically uses.

What I think was going on, is:

Minia version 2 had the "Pool allocation failed" that will be fixed in version 3. I told you to try with a higher memory limit but that didn't seem to be a valid workaround.
Minia version 3 seems to have completed the assembly just fine (in 11 hours) with default parameters. (Were you happy with the resulting contigs quality by the way?)
When you set a high memory limit in Minia 3 (or even 2), the k-mer counting step uses all this memory just because it thinks it can. But it's not necessary to specify -max-memory in Minia 3, except perhaps for very large genomes (> 5 Gbp).

ADD REPLY • link 9.2 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Hi, the contigs were a bit shorter than I was hoping for. My N50 was about 3kb. I tried several k up to 128, where my N50 reached 4.2kb. k larger than 128 failed with this error:

[DSK: nb solid kmers found : 3242850977  ]  101  %   elapsed: 746 min 28 sec   remaining:   0 min 0  sec   cpu: 238.0 %   mem: [1395, 14635, 14663] MB  
[Building BooPHF]  100  %   elapsed:  69 min 1  sec   remaining:   0 min 0  sec
[MPHF: populate                          ]  100  %   elapsed:  45 min 41 sec   remaining:   0 min 0  sec   cpu:  96.9 %   mem: [5621, 5621, 14663] MB 
EXCEPTION: kmer size 144 too big for cascading bloom filters

I ran into this error with Minia 2 as well, even when I compiled it to use higher k according to the instructions in the manual (and this post). Is there a detail I'm missing?

About the contig quality, would you recommend a particular way to assess this?

Also, is it alright for me to use these results in a conference?

Thanks again, Austin

ADD REPLY • link 9.2 years ago by shammond ▴ 10

0

Entering edit mode

You have a point here: for kmer size >= 128, the default algorithm of one part of the de Bruijn algorithm (cascading BLoom filters) can't work.

In such a case, a specific option has to be used, ie. you should add -debloom original in your minia command line (the consequence is a bigger memory peak). Could you confirm if it works on your example ?

As a matter of fact, we have to correct this and go back to this alternative algorithm as soon as kmer size is >= 128.

ADD REPLY • link 9.2 years ago by edrezen ▴ 730

0

Entering edit mode

Hi Erwan, Austin,

I've implemented "-debloom original" in Minia for k>128. The change is now effective if you compile the source from Github, and it will be included in the next release of Minia 3.

Austin, sure, you can use those results in a conference. Thanks for checking with us.

Regarding assessment of contigs quality, I recommend the QUAST software and taking NG50's instead of N50's. In the absence of a reference, it isn't easy to evaluate an assembly. One approach is FRCbam.

ADD REPLY • link 9.1 years ago by Rayan Chikhi ★ 1.6k

0

Entering edit mode

Hi guys, Using "-debloom original" with the binary you provided did the trick. And I'll check out those analysis tools. Thanks for your help! Austin

ADD REPLY • link 9.1 years ago by shammond ▴ 10