vg autoindex memory issues Signal 6 has occured
0
0
Entering edit mode
5 months ago
l.dammer • 0

Hi everyone,

I'm trying to run vg autoindex, in order to later on map my data to the genome graph provided by the HPRC. After having some starting issues (see here). The function now crashes after roughly 4h. The error messages/ stack traces are all similar but not exactly the same (even when I ran some of the commands twice to recreate errors) but all runs crash after around 4h saying either a segmentation error, Signal 6 or Signal 7 have occurred and to contact the vg team for help. Could you help with this here on Biostar or do you want me to file a separate issue on git detailing every step I tried + adding the error messages I got since this appears to be a vg related? An example of such a message/Stack trace is shown below (I took the shortest example for simplicity but I could provide others that are longer/provide the error logs as files if this would be more fitting on git):

Crash report for vg v1.56.0 'Collalto'
Crash report for vg v1.56.0 'Collalto'
Stack trace (most recent call last) in thread Stack trace (most recent call last) in thread 164370:
164370:
Crash report for vg v1.56.0 ‘Collalto’
Stack trace (most recent call last) in thread 16436
Crash report for vg --v1.56.0 ‘Collalto’
Crash report for vg Stack trace (most recent call last)v1.56.0 ‘Collalto’
 in thread 164370:
#14   Object ‘‘, at Stack trace (most recent call last) in thread 164361:
0xffffffffffffffff, in 
#14   Object ‘‘, at 0xffffffffffffffff, in /var/tmp/slurmd.spool/job2174553/slurm_script: line 13: 163354 Segmentation fault  

I have tried running the script with a vcf (before I was told that the vcf would not be necessary) and without the vcf which slightly changes the error messages. I am using vg 1.56 via conda but I also tried the dockerized 1.57 version to no avail.

Given the error messages, which where either a segmentation error, terminate called after throwing an instance of 'std::bad_alloc' or the stack trace returned by vg mentioning something about allocation problems, my best guess is that it is associated with memory. But I am starting these jobs with 400GB + 50CPUs via slurm as attempts with less RAM just caused slurm to crash saying the job exceeded the allowed memory limit. I’ve also made sure to try and run a job over the weekend when the cluster was unused so there should not have been any overlap in memory usage with other jobs, however this also failed. Here is an example of a whole slurm script:

#!/bin/bash
#SBATCH -J ‘vg_autoindexing_cpus50_mem400_-t50_3’
#SBATCH -p allNodes
#SBATCH --cpus-per-task 50
#SBATCH --mem 400G
#SBATCH --time=84:00:00
#SBATCH -e mem_logs/vg_autoindexing_cpus50_mem400_-t50_3_err.log
#SBATCH -o mem_logs/vg_autoindexing_cpus50_mem400_-t50_3_out.log
source bfx/user_folders/leon.dammer/mambaforge/bin/activate test
vg autoindex --workflow mpmap --prefix hprc-v1.1-mc-grch38_new -g hprc-v1.1-mc-grch38.gfa --tx-gff gencode.v46.annotation.gtf -t 50 

I’m using the following files:

Graph: provided on github by HPRC graph

Annotation: Most current Gencode version (at the time of posting v.46). This was modified using the code mentioned hereto prevent the ‘Chromosome path not found in graph or haplotypes’ error

How can I fix this error/ what is it based on? And if my prediction of it being RAM related is correct, how much RAM would I (roughly) have to allocate for the job to work?

Thank you in advance for your help

vg • 586 views
ADD COMMENT
1
Entering edit mode

That's already a somewhat unusually large amount of memory. If it takes more than that, my guess is that it wouldn't be by much. If you want to restrain memory a bit, you could try running with fewer threads. There are a few points where the memory use is roughly proportional to the number of active threads. Another thing to check is the available disk space. The final step of that indexing pipeline can generate pretty large temporary files.

ADD REPLY
0
Entering edit mode

The final part appears to be have been the problem. I stumbled upon that when my Sys admin asked me if I could delete the large files in the tmp folder. Now after specifying the tmp folder using -T to somewere with enough storage space the job has been running for 22h instead of crashing after 4h so I'd say that is a good sign. If I have any more errors I'll try reducing the threads.

Thank you for your help

ADD REPLY
0
Entering edit mode

Sorry to disturb again. While the error I had cleared up, I now have a different error which still appears to be RAM related. The xg and dist indices are created but the gcsa and gcsa.lcp are empty files. I tried using -M set to 400 and got:

error:[IndexRegistry] Child process 127199 signaled with status 9 representing signal 9

I then tried reducing the threads (to 30) and increased the RAM (to 450) and got the following error:

PathGraphBuilder::write(): Memory use of file  GB) exceeds memory limit (450 GB)
0 of kmer paths (
451.642 GB) exceeds memory limit (
 GB)
0 of kmer paths (451.526 GB) exceeds memory limit (450 GB)
450 GB)

I have also tried the fix suggested here: Indexing the human pangenome draft (removing the lines starting with W) but that gave me an "Insufficient input" error.

It seems the issue is something similar to vg autoindex - write_gcsa_kmers() size limit exceeded But I'm using v 1.56 so the Bug should no longer exist.

My output files also look similar to what's described here but the RAM issues described in that post appear to have been fixed without further explanation?

The tmp dir is on a server with 1.4 PB of storage so that should not be the issue.

Is there anything else I can do (reduce threads even further for example)? I have almost reached the amount of RAM I have available.

An example code is here, in further attempts to fix it I only adapt the -M or -t flags:

 vg autoindex --workflow mpmap --prefix hprc-v1.1-mc-grch38_new_2 -g hprc-v1.1-mc-grch38.gfa --tx-gff gencode.v46.annotation.gtf -t 30 -V 2 -T /mc_graph/tmp -M 450G
ADD REPLY
0
Entering edit mode

Sorry for the long delay in responding to this. I don't think that the error directly has to do with the warning from PathGraphBuilder::write(). When that particular error occurs (in one of VG's library dependencies), VG handles it without raising a signal. A signal 9 will generally be raised by the OS, not the program itself, typically because you hit some resource limitation. It could very possibly be RAM. Did you try running under /usr/bin/time -v to measure the memory use?

ADD REPLY

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6