University of Vermont: Why is my job not running? (self answer)

david.matthews.1 · September 23, 2021, 2:02pm

I submitted a job to the cluster and it has not started yet? Why is that? What can I do about it?

Will be self-answered.

david.matthews.1 · September 23, 2021, 2:29pm

There are two (main) root reasons why your job may not be running on the cluster: either (A) there are no resources available on the cluster right now, or (B) there are resources but your job does not have high enough priority to run.

If all the below information is a bit too much, or if you want a quick dashboard to look at you can visit:
https://dmatthe1.w3.uvm.edu
which displays current available cluster resources, cluster queue activity, as well as recent usage by each user.
This dashboard can provide a quick look at if there are resources available, or what your priority is – however looking at the raw command line program outputs will give a more complete picture.

Resources:

You job may not be running since the cluster is busy and there are no available resources right now. To determine if this is the cause start by figuring out how many CPU cores, and how much memory your job needs.

Get a list of all your current jobs by running:
squeue -u <username>

Then run:
control show job <jobid>

To get specific information about your job. In this output you should see a few lines like:

NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=56000M,node=1,billing=21,gres/gpu=1

This tells us that our job needs a node with 4 CPU cores, 56GB of memory, and 1 GPU.

Now, lets determine if these resources are available on the cluster right now:

Start by running
sinfo

This will give us basic information about each partition and what resources are available.
We know our job needs 1 GPU node, so if any GPU nodes are idle then the resources are available (At least on the VACC where all GPU nodes have more than 56GB of memory).

However, if each partition lists all nodes as mixed, allocated, or reserved, then we need to probe deeper.

Run the command:

sinfo -N -O partition,nodelist,statelong,gresused,cpusstate,cpusload,freemem,time

This command will tell us the specific hardware in use on each node in the cluster.
By scrolling through this, you may observe that although no nodes have 4 CPU cores and 1 GPU available, some nodes have 2 CPU cores and 1 GPU available.

If you want your job to run now, then consider submitting a new job with fewer resources.

The previous command tells us how much free memory is available on each cluster node but not how much slurm has already allocated and is willing to allocate to a new job.
If you want to check if any nodes have enough memory, consider running the following command
control show node <nodename>
For example running:

[dmatthe1@vacc-user2 ~]$ scontrol show node dg-gpunode02
NodeName=dg-gpunode02 Arch=x86_64 CoresPerSocket=16 
   CPUAlloc=24 CPUTot=32 CPULoad=23.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:8
   NodeAddr=dg-gpunode02 NodeHostName=dg-gpunode02 Version=20.11.8
   OS=Linux 3.10.0-1160.36.2.el7.x86_64 #1 SMP Thu Jul 8 02:53:40 UTC 2021 
   RealMemory=772643 AllocMem=309952 FreeMem=367624 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=dggpu 
   BootTime=2021-08-05T15:04:11 SlurmdStartTime=2021-08-11T12:19:00
   CfgTRES=cpu=32,mem=772643M,billing=252,gres/gpu=8
   AllocTRES=cpu=24,mem=309952M,gres/gpu=5
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

Tells us both the memory we can request on the node (CfgTRES) which is over 700GB, and the currently allocated amount of memory. Right now it is about 300GB.

If your job does not need to run for more than 3 hours and does not utilize GPUs (consider using checkpointing!), then consider submitting it to the short queue. This queue has more nodes assigned to it than the default bluemoon queue and so you job may run faster if resources are a limiting factor.

Priority:
At the VACC, SLURM calculates job priority using a multi-factor priority calculator.

Even if the cluster is empty, if your priority is low enough, you job may not start immediately.

Your job priority is computed from a few different parameters:

The resources requested for your job
The time your job has been in the queue
How many resources you have used recently

We can run the following command to get more information:

[dmatthe1@vacc-user2 ~]$ scontrol show config | grep "Priority"
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 10-12:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = ACCRUE_ALWAYS,CALCULATE_RUNNING
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 1000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PriorityWeightTRES      = CPU=1000,Mem=1000,GRES/gpu=10000

This tells us that the primary factor in computing job priority is FairShare (10x higher than other measures!), otherwise known as how much compute you have used recently.

To determine your fair share run sshare, however if you want to compare your fairshare to other users consider running sshare --all.

For specific information about jobs currently in the queue, let’s run sprio -l.
This will tell us the priority of each job as well as the component parts of the priority.
If the cluster is busy and your fairshare is low, there are two things you can do (a) wait. After 2 weeks, you Age Priority will max out, raising your job priority to at least 1000 regardless of your fairshare priority.

If on the other hand you want your job to run now, consider allocating a full node. JobSize is part of your job priority calculation and so larger jobs will run slightly faster.

If you are going to allocate a full node, make sure you are actually using it! We don’t want to waste cluster resources!