Question

oma orthology part 2 script

0

Entering edit mode

16 months ago

kylakell • 0

I am running oma to find orthologs between two species of cyanobacteria (Croco and Tricho). I'm using the cheat sheet to run oma-part2.sh on an HPC, but am confused about if/when the job has finished. Only one job array is still running. Some of the log files from this entire job say the following:

only_run_allall := true
Starting database conversion and checks...
Process 54664 on b01-10: job nr 90 of 500
*** All all-vs-all jobs successfully terminated.     ***
*** terminating after AllAll phase due to "-s" flag. ***
*** if you see this message at the end of one job,   ***
*** this means that all jobs successfully finished.  ***

But the file that is still running (and some others), say the following:

only_run_allall := true
Starting database conversion and checks...
Process 34548 on b01-10: job nr 496 of 500
1689104692.380848 - 1 - [pid 34548]: Computing croco vs croco (Part 1 of 1) Mem: 0.016GB

I'm not sure if this means the job has finished or not. Any advice would be greatly appreciated!

For references, here is my code from the part2 script (as I have adjusted some of the time and cpu parameters.

#!/bin/bash
#SBATCH --array=1-500
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=2GB
cd /project/dahutch_441/kylakell/oma/OMA.2.5.0
export NR_PROCESSES=500
./bin/oma -s -W 7000
if [[ "$?" == "99" ]] ; then
scontrol requeue \
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0

oma orthologous-groups orthology • 957 views

ADD COMMENT • link updated 16 months ago by Silvia Prieto ▴ 30 • written 16 months ago by kylakell • 0

0

Entering edit mode

Hi, I would say it is not yet finished as you mentioned that some slurm jobs are still working. You could run oma-status to make sure. Best, Sina.

ADD REPLY • link 16 months ago by sina.majidian ▴ 20

0

Entering edit mode

Hi, thank you for your response. My understanding is that if the time limit for the job is out, it keeps re-submitting. Is this correct?

ADD REPLY • link 16 months ago by kylakell • 0

0

Entering edit mode

You're welcome. That's true. it keeps re-submitting because of this part in the slurm script, I believe "if [[ "$?" == "99" ]] ; then scontrol requeue"

ADD REPLY • link 16 months ago by sina.majidian ▴ 20

0

Entering edit mode

An update is that the program keeps getting stuck/not finishing the all v all blast for croco v croco. All the other files in the cache are gzipped, but the croco vs croco file "part_1-1" never finishes no matter how many times I run the program. I've tried clearing the cache and re-running. I've tried re-running without clearing the cache. And it never finishes.

The log for the file that is trying to run this blast says:

only_run_allall := true
Starting database conversion and checks...
Process 34548 on b01-10: job nr 496 of 500
1689104692.380848 - 1 - [pid 34548]: Computing croco vs croco (Part 1 of 1) Mem: 0.016GB

bin/oma-status always reports the following (the %s never change):

Summary of OMA standalone All-vs-All computations:
--------------------------------------------------
Nr chunks started: 1 (6.25%)
Nr chunks finished: 15 (93.75%)
Nr chunks finished w/o exported genomes: 15 (93.75%)

Any suggestions would be much appreciated!

ADD REPLY • link 16 months ago by kylakell • 0

0

Entering edit mode

Hi!

What do you mean you cleared the cache? Did you use oma-cleanup for this?

I had a similar issue due to very long proteins, which I doubt is your problem as you are dealing with bacteria, but might be worth a try (maybe you can check how long your proteins are to see if this is a potential problem).

What I did was changing some of the part 2 script's parameters to give it more time to compute this long proteins.

With your provided script it would look like:

#!/bin/bash
#SBATCH --array=1-500
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=3GB
cd /project/dahutch_441/kylakell/oma/OMA.2.5.0
export NR_PROCESSES=499
./bin/oma -s -W 10000
if [[ "$?" == "99" ]] ; then
scontrol requeue \
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0

What I did here is:

Increasing the time OMA has to compute alignments in this chunk (controlled by the -W) to 10000 seconds. This is in case it is stuck in some long protein, as for the next step, it needs to be between alignments. I believe when the 10000 seconds are almost over, it writes a ckpt (checkpoint) file to know where to continue the chunk's alignment in a following re-submitted job. I think it sometimes uses the extra time (allowed by the #SBATCH --time, so in this case 4 hours - 2.77 hours --> ~1 hour) to do this, so it's usually good to give it some extra time.
I gave it an extra RAM GB (just in case)
Changed the NR_PROCESSES to a different number than the job-array size (for example 499). If there are more unprocessed chunks, they would get reassigned and finish faster.

Hope it helps

ADD REPLY • link 16 months ago by Silvia Prieto ▴ 30