I know of individual research support staff that have experience optimizing code, or have proactively reached out to specific users to do so, and it seems like something that should be so easy to provide as a service - e.g., - the user can add a flag to some job, the job will be run as normal but with some recording of overall usage metrics, it gets sent to support staff to analyze, and then a report (and possible suggestions are returned, given that the user is able to share code after some back and forth).
Is anyone doing this? If not, why not, and how could we automate it? Assume we have a large group of users, and some cluster resource with support staff and a job manager. What tools could be used, and how might it work?
Just brainstorming here, but maybe sys admins could create a post execution script that returns some data on usage on all nodes, such as the output of “sar” in Linux. I’ve used that before to determine what is happening with CPU, RAM, etc. during different phases of a code running. It is a bit tedious to try to stitch all the information together. Your post has me thinking as I’ve wanted to profile some of what gets run to determine if we can optimize how the nodes are running.
I’m frustrated about this topic, because unlike a lot of things, I’m not actually empowered to do or try anything. It comes down to a game of trying to convince those that are, and I always lose. Does anyone have a cluster resource that is open to experimenting / trying new things that I can help with?
I had some quick discussion with folks from my team, and while we don’t have bandwidth to develop this, I wanted to share the knowledge in the case / in hopes that if someone did have interest and time, they could have the idea!
Technically, if nodes are between jobs, this makes profiling a bit more complicated but not impossible. We also must take into account having different filesystems. For the major filesystems, it should be done at the Lustre server side, using job statistics provided by Lustre
On the backend , we would need (1) a scalable way to gather these metrics from all Lustre servers (a few tens of servers max, so not big deal), then (2) feed a database with the data;
On the frontend , we need a SSO-enabled web interface to display the data per group/user, likely controlled by Stanford Workgroup (or in your case, whatever identity management / groups your institution uses).
A user could then see their job execution time and all I/O stats performed during the job. That’s the general idea. We likely need a small (virtualized) Slurm+Lustre test platform to develop this.
There are a variety of profiling and optimization tools that could be brought to bear on this kind of problems and I’ve used a lot of them, but what I see mostly is not application optimization bottlenecks such as in the link you provided, but application bottlenecks due to simple IO, often caused by specifying the wrong device for temp files or not pre-staging data to an appropriate device (running tiny, but heavy IO back to a main cluster filesystem instead of using /dev/shm or a local NVME FS). The other problem I see quite often is interfering resource usage where one user’s app is hosing a multi-CPU server, or indeed the entire cluster’s FS via unthinking launching of their app with default options (which is not surprising since there’s no support for teaching any of this to incoming grad students on an official basis).
I’ve come up with profilemyjobs, an unholy goulash of bash and gnuplot that allows users to record long-running processes and also visualize it simultaneously. It records about 30 performance parameters (not all visualized at the same time) and plots them on the same screen so you can see which ones interact without switching screens back and forth, which is a failing of some of the more sophisticated tools available like PCP, XDMoD, etc. It also logs those variables, and autogenerates the gnuplot file to review the plot afterwards. There’s a separate gnuplot pane for detailed disk IO if wanted. Also a wrapper called pmj to submit the whole thing in a batch job (obviously not for same-time viz).
It was an experiment to see how easy it would be to use bash instead of a real programming language to do something useful. I won’t do that again. But it is straight bash, so it ought to be quite portable. Not meant for nanosecond profiling, but for minutes/hours/days profiling.
I came to depend on it for cluster admin as well as debugging overloaded apps, slow IO, memory overflows/constrictions, bad SGE params, etc.