Question

snakemake on slurm cluster - jobs not updating/submitting after checkpoints? (Error submitting jobscript (exit code 1):)

1

Entering edit mode

5.7 years ago

Berghopper ▴ 20

Dear Biostars,

I have a pretty complicated pipeline I need to run on a slurm cluster, but am not able to get it to work.

For some reason, the pipeline works for smaller jobs, but as soon as I add more input files for more rigorous testing, it doesn't want to finish correctly.

I don't have a minimal example (yet) as it's the end of my workday, I will add one if this question isn't easily resolved.

So, what happens is the following:

I submit my main snakemake "daemon" job with sbatch ../slurm_eating_snakemake.sh, aka the following script:

#!/usr/bin/env bash

# Jobname
#SBATCH --job-name=SNEKHEAD
#
# Project
#SBATCH --account=nn3556k
#
# Wall clock limit
#SBATCH --time=24:00:00
#
# Max memory usage:
#SBATCH --mem-per-cpu=16G

## set up job environment
source /usit/abel/u1/caspercp/Software/snek/bin/activate
module purge   # clear any inherited modules
#set -o errexit # exit on errors (turned off, so all jobs are cancelled in event of crash)

## copy input files
cp -R /usit/abel/u1/caspercp/nobackup/DATA/ $SCRATCH
cp -R /usit/abel/u1/caspercp/lncrna_thesis_prj/src/snakemake_pipeline/ $SCRATCH
#cp -R $SUBMITDIR\/OUTPUTS/ $SCRATCH

## Do some work:
cd $SCRATCH\/snakemake_pipeline
echo $(date) >> ../bash_tims.txt
# run pipeline
snakemake --snakefile start.snakefile -pr --runtime-profile ../timings.txt --cluster "sbatch -A nn3556k --time=24:00:00 --mem-per-cpu=4G -d after:"$SLURM_JOB_ID -j 349 --restart-times 1
echo $(date) >> ../bash_tims.txt

## Make sure the results are copied back to the submit directory:
cp -R $SCRATCH\/OUTPUTS/ $SUBMITDIR
cp -R $SCRATCH\/snakemake_pipeline/.snakemake/ $SUBMITDIR
mkdir $SUBMITDIR\/child_logs/
cp $SCRATCH\/snakemake_pipeline/slurm-*.out $SUBMITDIR\/child_logs/
cp $SCRATCH\/OUTPUTS/output.zip $SUBMITDIR
cp $SCRATCH\/timings.txt $SUBMITDIR
cp $SCRATCH\/bash_tims.txt $SUBMITDIR

# CANCEL ALL JOBS IN EVENT OF CRASH (or on exit, but it should not matter at that point.)
scancel -u caspercp

I am using the abel cluster if you want to know specifics: https://www.uio.no/english/services/it/research/hpc/abel/

This job spawns more jobs via snakemake.
My rules for sample preperation finish up.
Then I have to split up certain work in order for it to parallelize properly, for this I abuse snakemake by making it invoke certain rules the n amount of times by implementing checkpoints (see: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution).

This is where I feel the whole thing falls apart. For some reason, when the checkpoints are finished, snakemake can't submit new jobs. I get the following error (subset of the snakemake output):

[Thu Mar 14 17:46:27 2019]
checkpoint split_up_genes_each_sample_lnc:
    input: ../OUTPUTS/prepped_datasets/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING.txt
    output: ../OUTPUTS/control_txts/custom_anno/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING-human-BP/
    jobid: 835
    reason: Missing output files: ../OUTPUTS/control_txts/custom_anno/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING-human-BP/; Input files updated by another job: ../OUTPUTS/prepped_datasets/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING.txt
    wildcards: expset=expression_table_GSEA_Stopsack, geneset=HALLMARK_IL6_JAK_STAT3_SIGNALING, organism=human, ontology=BP
Downstream jobs will be updated after completion.

Error submitting jobscript (exit code 1):

Updating job 655.
[Thu Mar 14 17:46:43 2019]
Finished job 896.
95 of 1018 steps (9%) done
Updating job 539.
[Thu Mar 14 17:47:24 2019]
Finished job 780.
96 of 1022 steps (9%) done
Updating job 643.
.......
[Thu Mar 14 17:51:35 2019]
Finished job 964.
203 of 1451 steps (14%) done
Updating job 677.
[Thu Mar 14 17:51:46 2019]
Finished job 918.
204 of 1455 steps (14%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /work/jobs/26276509.d/snakemake_pipeline/.snakemake/log/2019-03-14T172923.764021.snakemake.log

Roughly speaking what the checkpoint does, is split up an output txt with genes (so a geneset file) into seperate files called {gene}.txt for each sample. So I can feed it to my analysis algorithms.

But I am really confused with this error "Error submitting jobscript (exit code 1):", it doesn't really give a clear direction for troubleshooting.

Thanks in advance for any input!

extra info:

The pipeline runs fine outside of the cluster.
I suspect I have to do a group my jobs in a specific way, although I am not sure

I am using the following snakemake setup:

(snek) -bash-4.1$ pip freeze --local
appdirs==1.4.3
attrs==19.1.0
certifi==2019.3.9
chardet==3.0.4
ConfigArgParse==0.14.0
Cython==0.29.6
datrie==0.7.1
docutils==0.14
gitdb2==2.0.5
GitPython==2.1.11
idna==2.8
jsonschema==3.0.1
numpy==1.16.2
pandas==0.24.1
pyrsistent==0.14.11
python-dateutil==2.8.0
pytz==2018.9
PyYAML==3.13
ratelimiter==1.2.0.post0
requests==2.21.0
six==1.12.0
smmap2==2.0.5
snakemake==5.4.3
urllib3==1.24.1
wrapt==1.11.1
yappi==1.0

snakemake slurm software error • 5.6k views

ADD COMMENT • link 5.7 years ago by Berghopper ▴ 20

0

Entering edit mode

What happens if you remove -d after:$SLURM_JOB_ID? Usually it's most convenient to let snakeMake handle starting jobs.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

I don't know yet, I added this more as a safety feature for if the main "daemon" job terminates and the child jobs are still pending. I'll try remove it and report back.

I also added another comment, you may be on the right track actually.

ADD REPLY • link 5.7 years ago by Berghopper ▴ 20

0

Entering edit mode

I think I resolved my issue, see my latest comment.

ADD REPLY • link 5.7 years ago by Berghopper ▴ 20

0

Entering edit mode

Bump,

It was actually not a mistake in the documentation...

ADD REPLY • link 5.7 years ago by Berghopper ▴ 20

0

Entering edit mode

5.7 years ago

Berghopper ▴ 20

Ok, update time:

What I've noticed is that, if I dial down the -j option (jobs submitted at the same time) the pipeline is a lot more stable for some reason. I don't know why this is, but I imagine this is more a bug/something regarding slurm rather than snakemake...

ADD COMMENT • link 5.7 years ago by Berghopper ▴ 20

score 0 · Accepted Answer · 2019-03-14

0

Entering edit mode

5.7 years ago

Berghopper ▴ 20

The reason this was happening was because of slurms job scheduler... Sadly, UiO's documentation listed that you can max utilize ~400 jobs. (https://web.archive.org/web/20190314224204/https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/queue-system.html#General_Job_Limitations) When I went out and measured it though, it was only 40 at max!

This creates a bit of a problem for pipelines that rely on running rules for their parrallelization, I might have to tweak things a bit...

Edit: This was not a mistake in the documentation, I contacted slurm admins and they verified that you actually CAN run 400 jobs on the cluster. This makes me wonder wether this is a slurm or snakemake bug...

ADD COMMENT • link 5.7 years ago by Berghopper ▴ 20

0

Entering edit mode

As OP, can you check your own answer as solved? You're question, though very well written and with many appreciated details, is quite a bit of text to read and it took me a moment until I realized it's solved.

ADD REPLY • link 5.7 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Actually, it is still not solved sadly, I have contacted the slurm cluster admins and they state that I CAN use 400 jobs after all. So either this is a slurm bug or snakemake failing to handle this many jobs.

Either way, as a workaround for now I'll probably make a separate intermediary script that handles detailed multithreading per node.

ADD REPLY • link 5.7 years ago by Berghopper ▴ 20