Confusion on QOSMaxWallDurationPerJobLimit

mtrahan · April 22, 2022, 7:54pm

I’ve been trying to get an interactive job running on my local cluster and I keep getting a (QOSMaxWallDurationPerJobLimit) error when waiting for my job allotment. Does anyone know what this means and how I can fix this?

bryank · April 25, 2022, 12:40pm

Slurm has many ways of enforcing limits on usage, and QOS (or Quality Of Service) is one of them. In this case it is likely that you just need to specify some time limit for your job. To find the maximum allowed time you first need to know which QOS you are using. In a default setup this will be normal, but you can check with the command:

sacctmgr show association where users=$USER

And check the column ‘QOS’. This will be the list of QOS values you can specify to srun or sbatch with the -q command. (Note you might also check the partition you are submitting to to see if it sets a QoS. You can do this with the output of scontrol show partition)

To see the limits for this QOS, you can use the command:

sacctmgr show qos normal

If the output is long for a field (like the TRES fields), you’ll need to ask for the full output, which you can do with, for example:

sacctmgr -P show qos normal format=MaxTRESPA

See the Slurm documentation for more information:
https://slurm.schedmd.com/resource_limits.html
https://slurm.schedmd.com/qos.html