Entering edit mode
6.4 years ago
caggtaagtat
★
1.9k
Hi there,
the cluster at my university often makes me wait for days until my jobs get from the queue to execution. I was therefore wondering, if you have experiences with AWS or other clouds for scientific purposes and if it's a financially reasonable alternative.
I don't need more than 10TB storage and only do medium sized RNA-seq data processing, which doesn't require to much computational power.
Or would you stay at university owned enviroments?
That depends on whether your lab is willing to pay an additional cost when there are free resources provided by the university, and whether the additional cost justifies the urgency of the analyses. Dynamic monthly costs can also complicate the billing matters. In addition to price, there's potentially a really steep learning curve to refactor existing code to work in the cloud environment. Then that's security concerns. Do you have the resources to manage the cloud's infrastructure yourself or is your university's IT team supportive of the idea of going cloud? On paper, everything might look extremely similar: launch some machines with a scheduler, schedule jobs, and voila, the job is done. In practice, things can be quite different and demanding.
My experience with shared HPC is that jobs with 1cpu/2gb with short wall time should be scheduled quicker. If your HPC is maintained correctly, most issues are derived from users asking way more resources than the jobs need. You should raise your concerns to the appropriate parties and hopefully, something can be done to improve things.
If you have the opportunity to explore cloud computing, I'd highly recommend you to do it. It isn't going anywhere and that skill set can be helpful for future opportunities.
Maybe I could talk to the IT team, if it would be generally possible. It's probably more practical to stay at the universities HPC. The HPC team wrote a mail a few weeks ago, that the cluster is full because some users asked for more resources they needed.
I guess this could also still come from the damages at the cooling system of the HPC back then. Nevertheless, I'm curious to work with something like AWS and would maybe also try it out if the waiting periods get shorter again.
Did you check with the HPC facility about the delays?
Maybe the choice of queues, amount of requested cores and memory might be causing the HPC to schedule the job with such a delay.
It's general long waiting times, due to high demand, I guess.
These delays happen frequently with jobs, which need 1 cpu and 2GB RAM.
Maybe check priority queue ?
If your cluster uses "fair share" principles you should not need to wait for days so I will assume it does not. What scheduler does your cluster use?
The clustere uses PBSPro.
I'm no informatician, but when I worked at another universities HPC, I didn't have to execute scripts with qsub, but could also just login to a free node and execute my scripts directly in the terminal, if that makes sense.
You should make some inquiries to see why your jobs pend that long with IT admins. Perhaps something is incorrectly setup and your account has been given low priority. In general, on shared compute infrastructure all users should have the same basic priority. So a user starting 5 jobs should have them start reasonably soon compared to someone who submits a 1000 at one time.
Ok thank you, I will wait and see if the situation maybe improves by itself any time soon and then talk with the HPC team of my university. Since you mention it, it can be, that I maybe have lower priority, since i was told by the IT admins, that people of the medical department of the university get needlessly throttled in the downloading/uploading speed, because some other departement decided this apparently. There this already a collective complaint on its way, but formal matters of the university tend to take forever.
I've interacted with a few HPC teams and they usually are sympathetic to users. Most HPC teams are being constrained by university policy/funding shortages as well, and building a relationship with the team will always work out in your favor.
Yes they are great and helped me a lot! They also arranged the collective complaint, to change restictions for medical institutions to the HPC.
I really hate it when people do that. Just because it's free right now, doesn't mean that it's free five minutes from now. SGE wouldn't see your work thou so our jobs would end up competing for resources. It's ok for very small stuff. Everything else, absolute NO!
Agreed. The line to be drawn involves a bit of trial and error for a few tasks though. For example, I've run
tar
jobs on both a login node+screen as well as a job. One has to estimate the amount of time and resources required and take a call based on that.Is it possible to change a job after it has been submitted to a PBS queue? If yes, you could set up e.g. crontab to submit a
echo "hello"
job every few hours. Then whenever you need to run something you could modify the submitted job that is next up. A nice (pun sort of) admin wouldn't bother you about it :)Usually, HPC systems allow you to change most operational parameters except the actual job script and in some cases, the wall time. Even if they do allow you to change wall time after submission, in all probability you cannot change it once the job starts.
Admins can add/change the wall time. I have had to do that a few times with SLURM.
True, admins can do most stuff - I'm referring to user level permissions :-)
AWS usually does $100/TB/month I think, so storage would end up costing you a lot of money.
Ok, its probably not very wise to switch then, just for occasional faster job execution
Yes, unless you've thought out all the details. Cloud AFAIK has a ton of hidden costs and needs an expert to manage infrastructure allocations/requests.
Ok definitly staying with the university HPC then :)
Did you look into the interactive mode? This is what you have done in the past when you login into HPC and log to one node. ....You can use interactive mode to login to the node and run commands there.
Yes I sometimes work in interactive mode, but it usaully takes some time to be able to log in and I therefore just submit jobs
In that case, talking to the HPC staff about your scheduling problem might help
Yeah, maybe this is connected to the throtteling of acess from medical facilities
might very well be possible. The HPC staff can clarify this.