Entering edit mode
16 months ago
kylakell
•
0
I am running oma to find orthologs between two species of cyanobacteria (Croco and Tricho). I'm using the cheat sheet to run oma-part2.sh on an HPC, but am confused about if/when the job has finished. Only one job array is still running. Some of the log files from this entire job say the following:
only_run_allall := true
Starting database conversion and checks...
Process 54664 on b01-10: job nr 90 of 500
*** All all-vs-all jobs successfully terminated. ***
*** terminating after AllAll phase due to "-s" flag. ***
*** if you see this message at the end of one job, ***
*** this means that all jobs successfully finished. ***
But the file that is still running (and some others), say the following:
only_run_allall := true
Starting database conversion and checks...
Process 34548 on b01-10: job nr 496 of 500
1689104692.380848 - 1 - [pid 34548]: Computing croco vs croco (Part 1 of 1) Mem: 0.016GB
I'm not sure if this means the job has finished or not. Any advice would be greatly appreciated!
For references, here is my code from the part2 script (as I have adjusted some of the time and cpu parameters.
#!/bin/bash
#SBATCH --array=1-500
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2GB
cd /project/dahutch_441/kylakell/oma/OMA.2.5.0
export NR_PROCESSES=500
./bin/oma -s -W 7000
if [[ "$?" == "99" ]] ; then
scontrol requeue \
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0
Hi, I would say it is not yet finished as you mentioned that some slurm jobs are still working. You could run oma-status to make sure. Best, Sina.
Hi, thank you for your response. My understanding is that if the time limit for the job is out, it keeps re-submitting. Is this correct?
You're welcome. That's true. it keeps re-submitting because of this part in the slurm script, I believe "if [[ "$?" == "99" ]] ; then scontrol requeue"
An update is that the program keeps getting stuck/not finishing the all v all blast for croco v croco. All the other files in the cache are gzipped, but the croco vs croco file "part_1-1" never finishes no matter how many times I run the program. I've tried clearing the cache and re-running. I've tried re-running without clearing the cache. And it never finishes.
The log for the file that is trying to run this blast says:
bin/oma-status always reports the following (the %s never change):
Any suggestions would be much appreciated!
Hi!
What do you mean you cleared the cache? Did you use oma-cleanup for this?
I had a similar issue due to very long proteins, which I doubt is your problem as you are dealing with bacteria, but might be worth a try (maybe you can check how long your proteins are to see if this is a potential problem).
What I did was changing some of the part 2 script's parameters to give it more time to compute this long proteins.
With your provided script it would look like:
What I did here is:
Increasing the time OMA has to compute alignments in this chunk (controlled by the -W) to 10000 seconds. This is in case it is stuck in some long protein, as for the next step, it needs to be between alignments. I believe when the 10000 seconds are almost over, it writes a ckpt (checkpoint) file to know where to continue the chunk's alignment in a following re-submitted job. I think it sometimes uses the extra time (allowed by the #SBATCH --time, so in this case 4 hours - 2.77 hours --> ~1 hour) to do this, so it's usually good to give it some extra time.
I gave it an extra RAM GB (just in case)
Changed the NR_PROCESSES to a different number than the job-array size (for example 499). If there are more unprocessed chunks, they would get reassigned and finish faster.
Hope it helps