STRUCTURE runs failing: core dumping and slurm exit code 10?
0
0
Entering edit mode
4 days ago

I'm running STRUCTURE v 2.3.4 on an HPC cluster. I've run it successfully many times before, but I've run into a recurring problem that I can't fix.

I used StrAuto to set up command lists in order to run replicates and send the outputs to specific directories, and I have run scripts that look like this:

#!/bin/sh
## 2024-08-21
## using: Humber_PGDstr_fixed, under admixture model where alpha is allowed to vary
## burnin = 1,000,000
## iterations = 500,000

#SBATCH --account=<name>
#SBATCH --time=28-00:00:00
#SBATCH --nodes=1
#SBATCH --mem=8000
#SBATCH --ntasks-per-node=1

module load StdEnv/2020
module load gcc/9.3.0
module load nixpkgs/16.09
module load python/2.7.14
module load intel/2018.3
module load structure/2.3.4

set -eu

cat commands_run01 | parallel -j 8

mv k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 results_f/
mkdir harvester_input
cp results_f/k*/*_f harvester_input
echo 'Your structure run has finished.'
# Run structureHarvester
./structureHarvester.py --dir harvester_input --out harvester --evanno --clumpp
echo 'structureHarvester run has finished.'
#Clean up harvester input files.
zip Humber_PGDstr_fixed_Harvester_Upload.zip harvester_input/*
mv Humber_PGDstr_fixed_Harvester_Upload.zip harvester/
rm -rf harvester_input

Most of the time (but not always) the runs fail at about ~15 days in, with the slurm output reporting: Segmentation fault (core dumped). When I check with seff, it returns:

ยง Checking seff for this job:
Job ID: 44471783
Cluster: <name>
User/Group: <name>
State: FAILED (exit code 10)
Cores: 1
CPU Utilized: 15-11:51:11
CPU Efficiency: 99.49% of 15-13:46:12 core-walltime
Job Wall-clock time: 15-13:46:12
Memory Utilized: 136.53 MB
Memory Efficiency: 1.71% of 7.81 GB

I've looked around for what "exit code 10" means for slurm, but I can't find anything beyond "some error".

I'm guessing that this is likely a memory problem at the writing step, but I can't figure out what part of my instructions are incorrect, and why it fails sometimes but not always (given that all my run scripts are functionally the same).

Any ideas on what is going wrong here?

slurm linux STRUCTURE • 264 views
ADD COMMENT
0
Entering edit mode

Most of the time (but not always) the runs fail at about ~15 days

Have you looked at the logs for STRUCTURE. It appears that the jobs are failing because of an error there.

why it fails sometimes but not always

Since you are using a job scheduler why are you using parallel? Looks like you are only using a single node but not specifying how many CPU cores? How many are you allowed to use by default?

ADD REPLY
0
Entering edit mode

Thanks for your reply. Yes I checked the logs for STRUCTURE, there are no errors there, it just stops writing at some point.

I'm using parallel because we have so many replicates to run, it is generally more efficient (I have also set up the scripts using StrAuto, a helper tool that automatically sets things to run in parallel.

I tried specifying the number of cores before but I must've been doing it incorrectly, it failed within a couple days. I'm not sure if there is a default, but I know I can request a 48-core node for for big jobs. It's a bit of a dance because asking for a lot (e.g., a whole node) can tank my research group's scheduling priority.

Any ideas on how to more efficiently ask for the memory in the scheduling? I'm guessing that's the issue ...

ADD REPLY

Login before adding your answer.

Traffic: 1119 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6