How can I tell if my job is actually using the GPU(s) I requested?

langford · October 25, 2019, 2:42pm

I requested the GPU from SLURM, but how can I see what’s happening while my job runs?

brevans · October 29, 2019, 2:56pm

You can connect to the compute node your job is running on and run nvidia-smi to see processes using GPUs on that node. For example, if I have a job running on the GPU node gpu04:

# from a login node
ssh gpu04

# now on the gpu node
nvidia-smi

If you want more granular and complete stats on gpu performance, wrap your gpu-enabled command in your job script with something like this:

nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,temperature.gpu,power.draw,clocks.sm,clocks.mem,clocks.gr,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv,nounits --loop-ms=100 -f gpu_usage.csv &
gpu_watch_pid=$!

# do gpu work here

kill $gpu_watch_pid

You can then examine the file gpu_usage.csv for trends in gpu utilization over the lifetime of the job execution.