Using Prometheus and Grafana to collect and display Slurm statistics

edhall · June 7, 2022, 3:09pm

Is anyone using Prometheus and Grafana to collect and display Slurm statistics on their HPC cluster?

joseph.f.guzman · June 30, 2022, 1:07am

We have a graphite+grafana setup where we collect slurm data. We gather the slurm metrics with slurm_exporter.py which we adapted from a slurm.py script made by stanford-rc.

We looked into a prometheus+grafana setup last year but liked the ability to downsample in graphite for 2 year long data retention times.

Edit: included screenshots

Jack_Swope · June 8, 2022, 3:08pm

I have set up a proof of concept system but haven’t gotten much buy in from my own co-workers.

Jack_Swope · June 8, 2022, 7:42pm

It is an internal only web service. I could post some screenshots.

edhall · June 8, 2022, 5:01pm

Do you have a link so I can see what it looks like?
Thanks, Ed

edhall · June 8, 2022, 8:55pm

Sure, that would be useful.

Jack_Swope · June 9, 2022, 2:49pm

Here are some screenshots.