Entering edit mode
2.3 years ago
Matteo
▴
10
Hi everyone,
I have been running ELAI on an HPC, which I have successfully done in the past, but now I am getting failed SLURM reports (Exit code = 1). However, the log file of the submitted job seems to be ok and no error is printed (see below). The output file seems ok as well and has an appropriate size, comparable to those obtained from past analyses. Does anyone know what might create such discordance between SLURM reports and log files? Is it safe to rely on the output files that have been generated? Thanks in advance for the help!!
Matteo
Log file output:
## COMMAND: /home/vonholdt/VONHOLDT/BIN/elai/elai-lin -g ref_extoni_for_elai_chr05.recode.geno.txt -p 10 -g ref_pusillus_for_elai_chr05.recode.geno.txt -p 11 -g chrysopus_to_infer_for_elai_chr05.recode.geno.txt -p 1 -pos chrysopus_to_infer_for_elai_chr05.recode.pos.txt -s 30 -o chr05_run3_mg15 -C 2 -c 10 -mg 15
## randseed = 1661469556
## warning: number of position files = 1
## warning: position files contain 1157298 records.
## warning: File 0 has 26 ind's and 1157298 SNPs
## warning: File 1 has 7 ind's and 1157298 SNPs
## warning: File 2 has 103 ind's and 1157298 SNPs
### m_morgan = 0.482452
### total number of individuals 136
## number of panel individuals = 33
## number of cohort = 103
## number genotype files = 3
## number phenotype files = 3
## number of diploid = 136
## number of haploid = 0
## number of individuals = 136
## number of snp = 1157298
### estimated total genetic distance 0.482452
### constrained upper layer switches 7.23678
### constrained lower layer switches 482.452
### constrained ancillary switches 1
### 0 0 -80954696.237 -80954696.237 500.507 7.397 0.935
### 0 1 -42135163.282 38819532.955 766.024 8.044 0.750
### 0 2 -29799695.683 12335467.599 2875.696 34.144 197.866
### ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
trk is nan in rk upate
...
0 28 -109855861.541 8652.939 36316.739 24.034 672.992
### ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
ta is nan in beta update
trk is nan in rk upate
0 29 -110010505.507 -154643.966 36622.642 24.250 673.884
## EM seconds used = 65041
## ELAI generate following files in the output directory.
## chr05_run3_mg15.snpdata.txt
## random seed = 1661469556
To me, this looks most like you are losing that output because you are getting output from each slavescript, but not necessarily from the master script. Could be wrong, though, it is hard to tell.
How do the relevant portions of the masterscript look, relating to error handling, stderr and stdout? Are you using the masterscript to kickoff many slave processes? it may be that the log files are generated correctly for each of these, but not for the master ? can you comment on this?
see also the below for ideas
example:
Consider also this snippet I used to use for an old sungrid parallel submission script