Question

STRUCTURE runs failing: core dumping and slurm exit code 10?

0

Entering edit mode

10 months ago

katherinedrotos • 0

I'm running STRUCTURE v 2.3.4 on an HPC cluster. I've run it successfully many times before, but I've run into a recurring problem that I can't fix.

I used StrAuto to set up command lists in order to run replicates and send the outputs to specific directories, and I have run scripts that look like this:

#!/bin/sh
## 2024-08-21
## using: Humber_PGDstr_fixed, under admixture model where alpha is allowed to vary
## burnin = 1,000,000
## iterations = 500,000

#SBATCH --account=<name>
#SBATCH --time=28-00:00:00
#SBATCH --nodes=1
#SBATCH --mem=8000
#SBATCH --ntasks-per-node=1

module load StdEnv/2020
module load gcc/9.3.0
module load nixpkgs/16.09
module load python/2.7.14
module load intel/2018.3
module load structure/2.3.4

set -eu

cat commands_run01 | parallel -j 8

mv k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 results_f/
mkdir harvester_input
cp results_f/k*/*_f harvester_input
echo 'Your structure run has finished.'
# Run structureHarvester
./structureHarvester.py --dir harvester_input --out harvester --evanno --clumpp
echo 'structureHarvester run has finished.'
#Clean up harvester input files.
zip Humber_PGDstr_fixed_Harvester_Upload.zip harvester_input/*
mv Humber_PGDstr_fixed_Harvester_Upload.zip harvester/
rm -rf harvester_input

Most of the time (but not always) the runs fail at about ~15 days in, with the slurm output reporting: Segmentation fault (core dumped). When I check with seff, it returns:

§ Checking seff for this job:
Job ID: 44471783
Cluster: <name>
User/Group: <name>
State: FAILED (exit code 10)
Cores: 1
CPU Utilized: 15-11:51:11
CPU Efficiency: 99.49% of 15-13:46:12 core-walltime
Job Wall-clock time: 15-13:46:12
Memory Utilized: 136.53 MB
Memory Efficiency: 1.71% of 7.81 GB

I've looked around for what "exit code 10" means for slurm, but I can't find anything beyond "some error".

I'm guessing that this is likely a memory problem at the writing step, but I can't figure out what part of my instructions are incorrect, and why it fails sometimes but not always (given that all my run scripts are functionally the same).

Any ideas on what is going wrong here?

slurm linux STRUCTURE • 954 views

ADD COMMENT • link updated 10 months ago by GenoMax 153k • written 10 months ago by katherinedrotos • 0

0

Entering edit mode

Most of the time (but not always) the runs fail at about ~15 days

Have you looked at the logs for STRUCTURE. It appears that the jobs are failing because of an error there.

why it fails sometimes but not always

Since you are using a job scheduler why are you using parallel? Looks like you are only using a single node but not specifying how many CPU cores? How many are you allowed to use by default?

ADD REPLY • link 10 months ago by GenoMax 153k

0

Entering edit mode

Thanks for your reply. Yes I checked the logs for STRUCTURE, there are no errors there, it just stops writing at some point.

I'm using parallel because we have so many replicates to run, it is generally more efficient (I have also set up the scripts using StrAuto, a helper tool that automatically sets things to run in parallel.

I tried specifying the number of cores before but I must've been doing it incorrectly, it failed within a couple days. I'm not sure if there is a default, but I know I can request a 48-core node for for big jobs. It's a bit of a dance because asking for a lot (e.g., a whole node) can tank my research group's scheduling priority.

Any ideas on how to more efficiently ask for the memory in the scheduling? I'm guessing that's the issue ...

ADD REPLY • link 10 months ago by katherinedrotos • 0

0

Entering edit mode

I'm using parallel because we have so many replicates to run, it is generally more efficient

I doubt that a proper job scheduler is going to be more efficient at running jobs than including parallel in the mix.

I tried specifying the number of cores before but I must've been doing it incorrectly, it failed within a couple days

It feels like there is some interaction between SLURM/parallel/STRUCTURE that is causing the problems you are having.

Memory Efficiency: 1.71% of 7.81 GB

Based on this it would appear that memory is not the issue but again I am not sure how parallel is figuring in this mix.

ADD REPLY • link 10 months ago by GenoMax 153k