oma orthology part 2 script
0
0
Entering edit mode
16 months ago
kylakell • 0

I am running oma to find orthologs between two species of cyanobacteria (Croco and Tricho). I'm using the cheat sheet to run oma-part2.sh on an HPC, but am confused about if/when the job has finished. Only one job array is still running. Some of the log files from this entire job say the following:

only_run_allall := true
Starting database conversion and checks...
Process 54664 on b01-10: job nr 90 of 500
*** All all-vs-all jobs successfully terminated.     ***
*** terminating after AllAll phase due to "-s" flag. ***
*** if you see this message at the end of one job,   ***
*** this means that all jobs successfully finished.  ***

But the file that is still running (and some others), say the following:

only_run_allall := true
Starting database conversion and checks...
Process 34548 on b01-10: job nr 496 of 500
1689104692.380848 - 1 - [pid 34548]: Computing croco vs croco (Part 1 of 1) Mem: 0.016GB

I'm not sure if this means the job has finished or not. Any advice would be greatly appreciated!

For references, here is my code from the part2 script (as I have adjusted some of the time and cpu parameters.

#!/bin/bash
#SBATCH --array=1-500
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2GB
cd /project/dahutch_441/kylakell/oma/OMA.2.5.0
export NR_PROCESSES=500
./bin/oma -s -W 7000
if [[ "$?" == "99" ]] ; then
scontrol requeue \
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0
oma orthologous-groups orthology • 955 views
ADD COMMENT
0
Entering edit mode

Hi, I would say it is not yet finished as you mentioned that some slurm jobs are still working. You could run oma-status to make sure. Best, Sina.

ADD REPLY
0
Entering edit mode

Hi, thank you for your response. My understanding is that if the time limit for the job is out, it keeps re-submitting. Is this correct?

ADD REPLY
0
Entering edit mode

You're welcome. That's true. it keeps re-submitting because of this part in the slurm script, I believe "if [[ "$?" == "99" ]] ; then scontrol requeue"

ADD REPLY
0
Entering edit mode

An update is that the program keeps getting stuck/not finishing the all v all blast for croco v croco. All the other files in the cache are gzipped, but the croco vs croco file "part_1-1" never finishes no matter how many times I run the program. I've tried clearing the cache and re-running. I've tried re-running without clearing the cache. And it never finishes.

The log for the file that is trying to run this blast says:

only_run_allall := true
Starting database conversion and checks...
Process 34548 on b01-10: job nr 496 of 500
1689104692.380848 - 1 - [pid 34548]: Computing croco vs croco (Part 1 of 1) Mem: 0.016GB

bin/oma-status always reports the following (the %s never change):

Summary of OMA standalone All-vs-All computations:
--------------------------------------------------
Nr chunks started: 1 (6.25%)
Nr chunks finished: 15 (93.75%)
Nr chunks finished w/o exported genomes: 15 (93.75%)

Any suggestions would be much appreciated!

ADD REPLY
0
Entering edit mode

Hi!

What do you mean you cleared the cache? Did you use oma-cleanup for this?

I had a similar issue due to very long proteins, which I doubt is your problem as you are dealing with bacteria, but might be worth a try (maybe you can check how long your proteins are to see if this is a potential problem).

What I did was changing some of the part 2 script's parameters to give it more time to compute this long proteins.

With your provided script it would look like:

#!/bin/bash
#SBATCH --array=1-500
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=3GB
cd /project/dahutch_441/kylakell/oma/OMA.2.5.0
export NR_PROCESSES=499
./bin/oma -s -W 10000
if [[ "$?" == "99" ]] ; then
scontrol requeue \
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0

What I did here is:

  1. Increasing the time OMA has to compute alignments in this chunk (controlled by the -W) to 10000 seconds. This is in case it is stuck in some long protein, as for the next step, it needs to be between alignments. I believe when the 10000 seconds are almost over, it writes a ckpt (checkpoint) file to know where to continue the chunk's alignment in a following re-submitted job. I think it sometimes uses the extra time (allowed by the #SBATCH --time, so in this case 4 hours - 2.77 hours --> ~1 hour) to do this, so it's usually good to give it some extra time.

  2. I gave it an extra RAM GB (just in case)

  3. Changed the NR_PROCESSES to a different number than the job-array size (for example 499). If there are more unprocessed chunks, they would get reassigned and finish faster.

Hope it helps

ADD REPLY

Login before adding your answer.

Traffic: 1547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6