I requested the GPU from SLURM, but how can I see what’s happening while my job runs?
You can connect to the compute node your job is running on and run nvidia-smi to see processes using GPUs on that node. For example, if I have a job running on the GPU node gpu04:
# from a login node
ssh gpu04
# now on the gpu node
nvidia-smi
If you want more granular and complete stats on gpu performance, wrap your gpu-enabled command in your job script with something like this:
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,temperature.gpu,power.draw,clocks.sm,clocks.mem,clocks.gr,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv,nounits --loop-ms=100 -f gpu_usage.csv &
gpu_watch_pid=$!
# do gpu work here
kill $gpu_watch_pid
You can then examine the file gpu_usage.csv
for trends in gpu utilization over the lifetime of the job execution.
1 Like