Hello,
I have a cluster composed by 7 nodes (20 threads), running on SLURM (Ubuntu 20.04). I am trying to run the diamond blastx process on the cluster, but it just runs N-times the same process ...
Please, can someone tell me what I am missing?
Thank you
That does not make sense. Most modern CPU's have multiple cores (and each core in turn supports 2 threads example of intel Xeon CPU's). Each server node (commonly) has 2 sockets/CPU's (4 socket servers are available and are significantly more expensive).
So if you actually have 7 physical nodes (servers) then they are unlikely to be limited to 20 threads.
but it just runs N-times the same process
What does that mean? DIAMOND job above should run on -p threads/cores.
Let me try to be more clear. On my cluster 4 machines are dual-core (they can run 8 processes) and 3 machines are quad-core (they can run 12 processes), so the total of my CPUs on the cluster is 20. This is the reason why I set ntask=20 (SBATCH section) and -p 20 (diamond blastx parameter). As usual, I was expecting the process executed by srun would "distribute" the workload over the available resources (the mentioned 20 cpus), but it is not so. I monitored the job execution node by node, and all of them seem to be the same process running on every cpu. And also, if I run the process on one node (nodes=1) with four threads (p=4), it works (slowly, but works). If I run the process over more than 1 node, finally it will fail, killed by srun because of OUT OF MEMORY, even if the b parameter remained unchanged (-b 1.0).
On 4 machines I have AMD G-T56N Processor, where
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
It means only 2 threads are allowed to run simultaneously.
On 3 machines I have AMD Embedded G-Series GX-420GI Radeon R7E, where
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
It means only 4 threads are allowed to run simultaneously.
The total of threads allowed to run simultaneously on my cluster is 20.
Number of threads is one thing but DIAMOND requires a significant amount of RAM as well. I am not sure how large your database is but this sounds like a difficult task for the set up your describe. I am not sure what kind of interconnect (are you using a GB ethernet) you are using to connect these machines but that would also be a major bottleneck.
Note ntasks will start the search 20x, which is not what you want
Note that if you have tiny machines with 4 or 8 cores, reduce threads to like 4 to see if they will start running. 20 means 20 threads on one server!
You seem to have misconception that SLURM will allow you to split up a single big diamond job onto many PCs. This is not the case. SLURM will let you run, say 50 jobs on your architecture. Those that cannot run now (no threads available) will wait until a machine becomes available.
Also note the RAM requirements. Also experiment with sleep jobs to get a feel how slurm works.
put this in your script and play with it to test slurm
That does not make sense. Most modern CPU's have multiple cores (and each core in turn supports 2 threads example of intel Xeon CPU's). Each server node (commonly) has 2 sockets/CPU's (4 socket servers are available and are significantly more expensive).
So if you actually have 7 physical nodes (servers) then they are unlikely to be limited to 20 threads.
What does that mean? DIAMOND job above should run on
-p
threads/cores.Let me try to be more clear. On my cluster 4 machines are dual-core (they can run 8 processes) and 3 machines are quad-core (they can run 12 processes), so the total of my CPUs on the cluster is 20. This is the reason why I set ntask=20 (SBATCH section) and -p 20 (diamond blastx parameter). As usual, I was expecting the process executed by srun would "distribute" the workload over the available resources (the mentioned 20 cpus), but it is not so. I monitored the job execution node by node, and all of them seem to be the same process running on every cpu. And also, if I run the process on one node (nodes=1) with four threads (p=4), it works (slowly, but works). If I run the process over more than 1 node, finally it will fail, killed by srun because of OUT OF MEMORY, even if the b parameter remained unchanged (-b 1.0).
I hope you can help me.
Thank you
To be more precise:
The total of threads allowed to run simultaneously on my cluster is 20.
Number of threads is one thing but DIAMOND requires a significant amount of RAM as well. I am not sure how large your database is but this sounds like a difficult task for the set up your describe. I am not sure what kind of interconnect (are you using a GB ethernet) you are using to connect these machines but that would also be a major bottleneck.