jma
November 23, 2018, 10:42am
1
I have a job that I want to run on the “bigmem” node, and so I am trying to submit it like this:
sbatch --partition=bigmem parseEncoding.sh
But I immediately get this error:
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
In the old Sherlock we used to have to set the --qos
, but setting it here is incorrect and also results in error (I found this in the Sherlock 2.0 documentation.
vsoch
November 23, 2018, 10:45am
2
To make sure that jobs are not submit that don’t need the bigmem node, we require that the memory requested is at least 128GB. Since you don’t request any memory, it is assuming the default and denying the request (as your job would be able to run on a regular node).
To fix this and allocate the job to the node just add a specification for memory with --mem
sbatch --partition=bigmem --mem 130000 parseEncoding.sh
Note that the unit is in MB (130 GB is 130,000 MB)
How to see limits for a node
The single command scontrol show qos
will show you a huge table, and for a formatted version try:
sacctmgr show qos format=Name,MaxTRESPerUser,MaxSubmitJobsPerUser,MaxJobsPerUser,MaxWall,MaxTRESPA,MaxSubmitJobsPerAccount -r normal,owners,gpu,dev,bigmem,long,owner
Name MaxTRESPU MaxSubmitPU MaxJobsPU MaxWall MaxTRESPA MaxSubmitPA
---------- ------------- ----------- --------- ----------- ------------- -----------
normal cpu=512 1000 2-00:00:00 cpu=1024 2000
dev cpu=2,mem=8G 2 02:00:00 cpu=99999 32
long cpu=32 20 16 7-00:00:00 40
bigmem mem=3T 10 1-00:00:00 mem=6T 20
gpu gres/gpu=8 50 2-00:00:00 gres/gpu=24 100
owner cpu=99999 3000 7-00:00:00 cpu=99999 5000
owners cpu=2048 3000 2-00:00:00 cpu=4096 5000
Information using sinfo
Another option to “inspect” a node is with sinfo. For example:
$ sinfo -N -p bigmem --long
Fri Nov 23 03:31:27 2018
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
sh-02-13 1 bigmem idle 32 4:8:1 153600 0 108491 CPU_GEN: none
sh-02-14 1 bigmem idle 32 4:8:1 153600 0 108491 CPU_GEN: none
sh-112-01 1 bigmem idle 56 4:14:1 307200 0 109661 CPU_GEN: none
sh-112-02 1 bigmem idle 32 2:16:1 512000 0 106461 CPU_GEN: none