Question

Fail job after all vs all : Error, 'unable to ReadProgram on SLURM cluster

0

Entering edit mode

7.1 years ago

gusa_10 • 0

While running OMA 1.1.2 I got an error message while running my single job after my parallelize allvsall job - on slurm cluster:

Error, 'unable to ReadProgram(Cache/AllAll/id01/id04/part_3032-3127)

The single bash job is still running on the cluster but it has been almost two days already and no update in my job.out file.

Meanwhile, the job.err file shows:

rm: cannot remove ‘Cache/conversion.running’

Could you suggest me what to do? Should I let it run anyway?

Thank you

OMA • 2.2k views

ADD COMMENT • link 7.1 years ago by gusa_10 • 0

1

Entering edit mode

This post is lacking sufficient detail. Please include information about the program being used, exact command line with options and the kind of analysis that is being done with type of data.

ADD REPLY • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

Tagging: adrian.altenhoff

ADD REPLY • link 7.1 years ago by GenoMax 147k

score 0 · Answer 1 · 2017-10-23

0

Entering edit mode

7.1 years ago

Adrian Altenhoff ★ 1.1k

Hi Gusa

I'm one of the OMA maintainers. The version you are using is already a bit out-dated. The way the parallel processing of jobs has since been improved quite a bit. If possible, I suggest to upgrade OMA to the latest version.

Most likely the referred chunk is corrupted, something that could happen on slow filesystems on older versions of OMA. Best is to abort the run, remove this chunk and restart the job. It should only need to redo this single chunk, so should not take long and then it should continue with the inference of the orthologs.

About the conversion.running problem: I assume that this file has already been removed by another job. if not, remove it prior to restat oma.

Good luck with the run! Best wishes Adrian

ADD COMMENT • link 7.1 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

Hi Adrian,

Thank you very much. I am encountering new problems with the latest version of OMA. After getting several message of this type on my slrum job.out (except for the last line):

You specified to stop after the database conversion step (i.e. you set the "-c" flag). Database conversion successfully finished.

I got an error message on the job.err:

OMA.2.1.1/bin/../darwinlib/../data/GOdata.drw-20171023: 76.7% -- replaced wit ../data/GOdata.drw-20171023

While the last line of the job.out says:

: waiting for too long. abort. It seems that your parallelisation ...

I started my job with the options: ..oma -n 20 -c

Any suggestions?

Thank you so much in advance!

ADD REPLY • link 7.1 years ago by gusa_10 • 0

0

Entering edit mode

Problem solved! I just need more memory

ADD REPLY • link 7.1 years ago by gusa_10 • 0

0

Entering edit mode

wrong thread, sorry,

ADD REPLY • link 7.0 years ago by andrespara ▴ 30

score 0 · Answer 2 · 2017-11-01

0

Entering edit mode

7.1 years ago

gusa_10 • 0

Hi Adrian,

I re-run the analysis with OMA 2.1.1 and still have the same error message "Error, 'unable to ReadProgram(Cache/AllAll/sp1/sp2/part_1042-1106)" Should keep deleting these corrupted files and re-run again?

Thank you!

ADD COMMENT • link 7.1 years ago by gusa_10 • 0

0

Entering edit mode

yes. might be useful to check why they are failing in the scheduler's log (e.g. too little memory allocated to the process, or too little runtime reserved?). Cheers Adrian

ADD REPLY • link 7.1 years ago by Adrian Altenhoff ★ 1.1k