It occurred to me, as I was engaged in a task the other day, that there must be many tools in the HPC ecosystem whose sole purpose is to make HPC life easier. I’m thinking that Ask.CI presents an optimal space for community members to share this inventory; even 10 posts could result in a useful reference catalogue of tips and tricks to boost our efficiency! If we stick to open-source software, maybe originating from github repository, we’d optimize accessibility.
This is a really good question! There are several tools that I’ve found useful over the years, specifically working on a SLURM cluster.
job-maker is a small interface that a center can deploy, customized to their slurm.conf (and actually a user can generate it too because the slurm.conf has to be readable by all nodes, and is readable by the user) to generate submission scripts.
smanage.sh is a small bash script that can help with managing job arrays (status, submitting, etc.) created by Eric Surface from Harvard.
sampler is a really fun way to generate some kind of dashboard to monitor things, etc.
watchme has command line, python decorators, and general functions for monitoring or running tasks.
singularity-compose would be a way to run container orchestration (more for services) if your center doesn’t have Open OnDemand or similar, and of course Singularity containers are a big part of that!
I wouldn’t tend to use HPC and easy in the same sentence, but maybe the above can make it “less painful.”
We wrote some great user tools for sites that running slurm:
jobstats - make it easy for users to see status of jobs, and what resources were actually utilized from the resources requested. We use this to help users create more efficient jobs.
– https://github.com/nauhpc/jobstats
doppler - complementary webapp to jobstats showing users, and account job efficiency/resource wastage
– https://github.com/nauhpc/doppler
We use several custom in house bash and Python scripts to make administration easier. For job management and reporting we use XDmod. And we are about to put into production the Open Ondemand portal.
I would be very interested in working together to compile these results into a list organized by tool “class” or “tags” somehow. I’m not sure if that would be best as a spreadsheet, bullet list, or something else.
Although I’m new to HPC, here are some tools I use that might be relevant:
I agree that compiling the results listed here by tag would be a very useful way to represent the information. We can add a framework for this in the “resources” section that is in early stages of development on the northeast cyberteam site.. Will update here when it is there.