I'd like to use Galaxy for my cluster pipelines. It should make it easier for less tech savy team members to run pipelines.
It looks like Galaxy starts ALL processes inside of a python wrapper (when running top I see python instead of bwa). Will this be a speed issue? Speed is important for me, and I need to use all (many) threads effectively.
Why oh why does Galaxy start things in python wrappers? Will this hurt my speed?
Additional data:
I'm currently doing tests myself and have searched for this question. I apologize if I missed the answer. I also know that Galaxy duplicates intermediate data, but HDD reads aren't a bottleneck for me so this is no problem and I'll automate the deletions later. This question is CPU targeted.
Galaxy is not running everything in python wrappers. Most of the wrappers are bash-like scripts.
However, a few of them are, but this is not a speed limitation. All what these wrappers are doing is abstracting the inputs and outputs (tempfiles etc.).
In that case the program is usually invoked through subprocess, so there are no speed issues.
Btw. deletion of intermediate data can also be handeled by galaxy and you do not need to care about it.
Thanks, that's good to know and saves me a lot of time. I'm glad that this is the case. It makes much more sense! What confused me as well is the "load balancing" documentation that also makes it sound like an issue. They must be talking about for 100+ users at a time.
Correct. The Galaxy application itself is subject to the Python Global Interpreter Lock. You can bypass this by specifying multiple instances. It's a little tricky but definitely doable, and you definitely won't need to do it until you regularly have multiple simultaneous users.
The main Galaxy server gets a lot of use. So, I would consider it slow due to the number users.
This is why some institutions set up their own galaxy mirror (where user access can be limited, decreasing the total number of users). If you had a local mirror, you could benchmark NGS tasks and definitely see a difference. I wouldn't consider speed a problem for a mirror installation.
I've deployed a local installation of Galaxy on a cluster. If you examine the Python wrappers carefully, you'll see that they're constructing and then executing a command line. Thus the tools they're wrapping are not subject to the Python Global Interpreter Lock. Galaxy won't run tools any slower than they would run on a pure command-line execution if you're submitting jobs to a cluster.
Galaxy also includes scripts to automatically delete datasets according to parameters you specify. More information here.
Thanks, that's good to know and saves me a lot of time. I'm glad that this is the case. It makes much more sense! What confused me as well is the "load balancing" documentation that also makes it sound like an issue. They must be talking about for 100+ users at a time.
Correct. The Galaxy application itself is subject to the Python Global Interpreter Lock. You can bypass this by specifying multiple instances. It's a little tricky but definitely doable, and you definitely won't need to do it until you regularly have multiple simultaneous users.