We have a handful of large servers that are shared by a number of users for NGS data analysis. Currently there is no scheduling system setup on these machines. Users are generally expected to 'play nice' with others on a particular system.
We are looking for ways to better monitor the usage of these machines. Specifically, we would like to know both short-term usage:
- UserA is consuming 50% of the cpu power on BoxB right now
And also (perhaps more importantly) long-term usage trends:
- In the past 6 months, on average we are using 20% of BoxB's CPU power and 40% of its RAM. UserC is the top CPU user at 50% total usage.
How are other groups answering these monitoring questions?
I saw a few related BioStar questions:
- BAS costs: http://biostar.stackexchange.com/questions/16129/big-ass-servers-storage
- Blast Machine specs: http://biostar.stackexchange.com/questions/9782/machine-spec-for-running-a-blast-service-for-50-users/9798
- NGS workstations: http://biostar.stackexchange.com/questions/8246/workstations-for-ngs-analysis/8250
But so far, nothing on how these resources are monitored.
We have Ganglia and Nagios setup for these servers - but nobody has yet taken the time to configure them much past their defaults.
Are other groups using Graphite? Is it worth investigating over Ganglia? (The Etsy programmers sure make it sound good).
These tools also don't seem to be very good for long term trends (please correct me if I'm wrong). What are people using to monitor resources to know when processing power is being saturated?
Thanks!
Do you use a batch job submission system like PBS (maui)?
We do not use any batch job submission. Each user logs in and runs code in an unscheduled and unrestricted manner.
Also, just to clarify, I'm not talking about a cluster - but individual servers (the BAS concept mentioned in one of the related posts and discussed here: http://jermdemo.blogspot.com/2011/06/big-ass-servers-and-myths-of-clusters.html
there is a clear need for apps that sit in the spectrum between "top" and "valgrind". Unix is much better at dealing with CPU contention than memory and yet memory stats are often totally inaccurate.
This question is somewhat borderline for BioStar; more sysadmin than bioinformatics. You might want to ask at http://serverfault.com/.