We are running Slurm 20.02.6 (via Bright Cluster Manager 9.0) on RHEL 8.1. Seems to work right for us.
Define GRES for GPUs; in /etc/slurm/slurm.conf
have:
GresTypes=gpu
In /etc/slurm/gres.conf
have:
NodeName=gpu[001-012] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,1,2,3,4,5,6,7,8,9,10,11
NodeName=gpu[001-012] Name=gpu Type=v100 File=/dev/nvidia1 Cores=12,13,14,15,16,17,18,19,20,21,22,23
NodeName=gpu[001-012] Name=gpu Type=v100 File=/dev/nvidia2 Cores=24,25,26,27,28,29,30,31,32,33,34,35
NodeName=gpu[001-012] Name=gpu Type=v100 File=/dev/nvidia3 Cores=36,37,38,39,40,41,42,43,44,45,46,47
In /etc/slurm/cgroup.conf, have:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
TaskAffinity=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=yes
On a simple test, the job shows 2 available devices:
$ srun --time=1:00:00 --partition=gpu --gres=gpu:2 --pty /bin/bash
gpu007$ nvidia-smi
Tue May 11 17:40:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:18:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 37C P0 71W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
gpu007$ env |grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1