Cluster Monitoring Tools

shmget · March 15, 2021, 2:04am

We’ve decided to re-evaluate our cluster monitoring tools, and are wondering what current (and new tools) are out there. For instance, there are tools that have been around for some time, like Nagios. There are other tools that monitor a cluster so users can see how nodes are performing, such as Ganglia. What other tools are used in your infrastructure to monitor HPC cluster health? Are there better tools than some of the ones I mentioned?

Does anyone have experience with new tools, and if so, do they offer a big enough improvement to warrant a change; is it best to wait for implementation on a new machine; are there new approaches to the whole monitoring challenge?

Thank you!

dchin · May 11, 2021, 3:48pm

Re Ganglia: the project does not have any active maintainers, as of late October 2019.

jkhilmer · March 29, 2021, 7:03pm

Nagios and Ganglia both have their place, but I really like influxdb. There are so many components of HPC systems or workflows that may be slightly unique to your environment, and it’s so trivial to get data into a timeseries DB and start using it with very little effort.

BobFreemanMA · March 29, 2021, 9:31pm

There are multiple ways to answer this…

It’s not clear that Nagios/Ganglia will help you at the application level or above, so scheduler and application health will be absent. Do users care about this? Or only infrastructure health?

I’ll add that this is a great question, and something that we are also considering. Since our site uses LSF, RTM (real-time monitoring) is a product also from IBM that integrates with the scheduler with some but minimal extra work. And we also get some of the same hardware info that Ganglia/Nagios would report (e.g. disk space). When we transition to SLURM, we’ll need to re-evaluate this. So very interested in what people are using.

Best,
Bob

Mike_Renfro · March 29, 2021, 9:53pm

Depends on the level you want to monitor at. On our current list:

XDMoD (historical performance of jobs/queues, who got how much resources, invaluable for ROI measurements and demonstrating who got what resources when)
Slurm-web (current state of jobs/queues, reservations, node consumption)
Ganglia (node-level performance and history, not real-time)
Netdata (real-time node-level performance, not much history)

Later, for the next cluster, and/or a rebuild of the current one on OpenHPC:

XDMoD SUPReMM (for job-level performance)
Probably a useful selection of Prometheus, Telegraf, Influx, and Grafana, not sure of the scope just yet. I seem to recall some plugins that could do some of what XDMoD or Slurm-web have been doing for us.