HPC job schedulers: Community needs & wishes

wirawan0 · March 5, 2021, 6:14pm

HI folks,

I want to stir up a potentially opinionated discussion here. In the mid February there were two related threads on Twitter on popular HPC schedulers:

Looks like many people are using SLURM (the post popular scheduler today) either by choice or by coercion (e.g. no choice). What triggers me to post here is this one: “Despite all of these shortcomings, it’s our current HPC scheduler of choice and we are working to try to remedy some of these defects by selective bug reporting and building our own interfaces where needed (along with others for preference). Community work encouraged here!” (ref).

I am not too plugged in to systems-facing communities [like CaRCC systems-facing network, HPCSYSPROS,…] so bear with me if there was discussion already in the past on job scheduler. This looks like an interesting discussion topic for this network. In particular, (1) what are common things that are sought by community [ users and syspros ] re: job scheduler, and (2) how can the community [ syspros ] band together to provide useful tool for our common good in a sustainable way?

(Cross-posting: originally posted on CARCC systems-facing forum–https://carcc.slack.com/archives/CFNABPDFZ/p1613400781011900)

vsoch · March 5, 2021, 7:59pm

I engaged very briefly with one of the major job schedulers, wanting to ask about if it would be possible to have a more user friendly API to customize, and develop, but the response was fairly discouraging. it’s similar to what the case was for container / linux virtualization - it didn’t take off in terms of people being empowered until Docker. We don’t really have a Docker yet for schedulers. I’d say that it’s a bit of a monopoly, but there are other schedulers out there that have a more open source, developer friendly attitude - check out flux flux-framework · GitHub and GitHub - hpc/kraken: Kraken is a distributed state engine for scalable system boot and automation but for the latter I think the use case is not exactly the same.

Ben · March 5, 2021, 9:49pm

I have experience with Torque/PBS Pro/HTCondor and Slurm.

We have standardized on Slurm in our department. It has performed wellI; fast to schedule, and very robust.

I do wish Slurm had improved directives for submitting jobs, particularly for high-throughput/embarrassingly-parallel workloads. (it’s understandably very MPI focused). For example:

Variables in the #SBATCH directives for more flexible sbatch scripts
Better array semantics (similar to HTCondor). Currently users have to map $SLURM_ARRAY_TASK_ID to something useful. It would be fantastic if arrays could be automatically generated from directives like:

#SBATCH --array=arguments.csv
#SBATCH --array=$DATADIR/*.dat

Clearer method(s) of running (non-MPI) subjobs across multiple cores/nodes within a single jobid than srun and steps.
Tighter singularity integration

bennet · March 6, 2021, 3:26pm

I think the question is very broad, perhaps too broad for a useful answer. Most of the ‘HPC’ clusters I know of are now doing far less traditional HPC and much more what would be considered ‘HTC’.

A scheduler/job manager that is good for tighly coupled MPI code that runs on thousands of nodes might not be such a great choice for a cluster that largely runs one-core jobs that last less than four hours.

So perhaps a better approach would be to describe what knids of jobs your users are bringing to your systems and then ask what would be a good setup to answer their needs?

We use Slurm here, but it has its drawbacks. As Ben pointed out, it is MPI focused, but many of its most notable features go largely unused and unneeded here. We’ve had some issues where, for example, its affinity plug-in may be causing more problems that it solves.

Most of our users do not want advanced features, they want simple operation. Those who want advanced features are more likely to try to get them from mpirun. If the configuration can be kept simple, Slurm can certainly start a lot of jobs fast.

About half our jobs are one core on one node. More that 3/4 of jobs are single node, or should be. Only a few of our jobs require 64 cores or greater. Many people are blindly asking for core counts that aren’t even multiples of a node (some want processes to be powers of two, but most not, I think).

We have people who use launcher from TACC as a kind of scheduler within a scheduler, though I think launcher may be a bit of an orphan now. There are two PRs open, one from a year ago and one from a half-year ago, that don’t seem to have been reviewed. It seems to Just Work [TM], and in many ways it is easier for people to get started using it than, say, GNU parallel.

As Ben has, I’ve worked with HTCondor, and I think that for high-throughput jobs, it might be a good choice. Many places are using HTCondor and giving it access to submit jobs to their Slurm clusters. My experience has been mostly with OSG, and many of our HPC users would be impatient with the lag between submitting a job to HTCondor and its start. That may be somewhat tunable in the configuration, but it’s really designed for cranking thousands of jobs through not providing immediate gratification from the first job to start.

I don’t see HTCondor and Slurm as being mutually exclusive, either. They seem designed for different things, and it seems that some places are able to use both in different contexts (Wisconsin is a good example). There is additional overhead for maintaining and monitoring two systems, but it’s perhaps worth considering. Your engineers might prefer Slurm, whereas the sequencers might prefer HTCondor.

I would not want to be on a framing crew with a cooper’s hammer, and I wouldn’t want to be setting barrel staves with a framer’s hammer. Maybe evaluating the tool can only be done when it is clear to which purpose it will be put?