Trying to hack OMA to do specific All-v-All searches
2
0
Entering edit mode
6.3 years ago
jason.kwan • 0

Hi Everyone,

I'm trying to run OMA 2.2.0 on my university's HPC environment. They have a shared gluster file system and have told me that they don't want users to run jobs that write directly to this system. Instead they want me to run jobs that write their output to each compute node's local filesystem, to be copied to somewhere at the end of the job. Their reasoning is that if all users wrote directly to the gluster system, then everyone's reads/writes would slow down. Anyway, this means I can't run parallel OMA jobs on many CPUs that point to the same directory. I have 70 genomes to run through OMA, so I want to use hundreds of CPUs if possible.

With a bit of tinkering, I was able to figure out that if I create zero-sized files in the Cache/AllAll/<genome1>/<genome2> directory, that OMA will skip that particular genome1-genome2 comparison. I notice that the gz files are named things like part_X-Y.gz. My question is, how does OMA determine how many parts/gz files there should be in a particular directory? It doesn't seem to be constant.

My plan is to start many different jobs with specific "gaps" in the AllAll directory structure, to coerce each job to do a particular part of the All-v-All stage. However, I'd like to have some way of knowing which filenames are expected ahead of time (i.e. the values of X and Y in part_X-Y.gz). After deleting the zero-sized files, the directories will be combined with rsync. Convoluted, I know, but those are the constraints I am under.

I would be interested to hear your thoughts,

Jason

OMA orthologs • 1.9k views
ADD COMMENT
0
Entering edit mode

Tagging: adrian.altenhoff

jason.kwan : I have tagged Adrian (OMA dev). He does not regularly check in to Biostars but he eventually will.

ADD REPLY
3
Entering edit mode
6.3 years ago

Hi Jason,

You found a workaround, but for the use case you describe, I think that you don't need the hack. If you run OMA standalone as a job array (see https://omabrowser.org/standalone/#schedulers), each job will get a distinct subset of chunks to process, even if there is no shared filesystem. So you could run each process in its own local filesystem and then manually copy back all the chunks at the end. This will work as long as you keep the number of parallel jobs declared in NR_PROCESSES constant (else the division of labour will be different!)

Note that the command to initiate a job array varies from scheduler to scheduler. If the one used in your HPC environment is not currently supported, please get in touch with us and we will help. It's also possible to manually set the environmental variables NR_PROCESSES to the total number of jobs and THIS_PROC_NR to the job number of any given process.

Best wishes Christophe

ADD COMMENT
0
Entering edit mode

Hi Christophe,

Thanks for your reply. The HPC environment at my university uses HTCondor (https://research.cs.wisc.edu/htcondor/), so I'm not sure if the job array functionality will work with OMA. From reading your link, I'm not sure I understand how the different OMA jobs in arrays submitted by supported schedulers know which chunk to do. In HTCondor, you do arrays by changing the "queue 1" line in the submit file to something like "queue 100". Each job then has a different value for $(Process) (in this case 0-99), that can be passed onto the executable. Is there some way to pass $(Process) to OMA?

Thanks again,

Jason

ADD REPLY
1
Entering edit mode

Hi Jason, you could simply have a wrapper shell script that assigns the environment variables THIS_PROC_NR=$Process + 1 and NR_PROCESS=100:

#!/bin/bash
export NR_PROCESSES=100
export THIS_PROC_NR=$(($Process+1)
oma

you would then submit this wrapper script.

This would be a simple hand-made solution. I will check if we could natively integrate htcondor in the future.

ADD REPLY
0
Entering edit mode

Hi Adrian,

Thanks - that worked. For whatever reason, I couldn't understand how this works from the documentation.

Jason

ADD REPLY
0
Entering edit mode
6.3 years ago
jason.kwan • 0

OK, I managed to figure it out. In case this information is useful to anyone else, I'll leave it here. If you set "AlignBatchSize" large enough, then only one file will be created per directory, called "part_1-1.gz". The size you should set AlignBatchSize to achieve this result is equal to the possible number of combinations of 2 for any two genomes being compared. If you count the number of proteins in the largest fasta file in your DB directory (n), then the number of combinations will be n!/((n-2)! * 2!). I then made a script so that an array of jobs each has different gaps in the directory structure and therefore does different searches.

Apologies for answering my own question - I guess I should have thought about it harder!

Jason

ADD COMMENT

Login before adding your answer.

Traffic: 2183 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6