How can one determine the amount of RAM a node has in an HPC environment?
The best way is to consult your site’s documentation.
On SLURM, one can do this in two steps:
-
invoke
sinfo
to see the list of nodes and their states -
invoke
srun
to run thefree
command on the desired compute node. For example, for node namedr001
, invoke:srun -w r001 free
.
Here is a real example from PSC Bridges supercomputer:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
RM* up 2-00:00:00 1 drain* r242
RM* up 2-00:00:00 1 comp r400
RM* up 2-00:00:00 1 drain r668
RM* up 2-00:00:00 9 resv r[405-412,670]
RM* up 2-00:00:00 702 alloc r[006-241,243-399,401-404,413-667,669,671-719]
RM-shared up 2-00:00:00 21 mix r[720-721,733-747,749-752]
RM-shared up 2-00:00:00 3 alloc r[723-724,748]
RM-shared up 2-00:00:00 9 idle r[722,725-732]
...
The nodes under “alloc”, “drain”, “resv” states cannot be reached, so let’s try to see the free memory of r720 (it is on the non-default partition RM-shared
so we have to specify that):
srun -p RM-shared -w r720 free
total used free shared buff/cache available
Mem: 131734464 13434632 107366248 560604 10933584 116069696
Swap: 17591292 2375984 15215308
Looks like we have 128GB of total RAM. That matches what is said on its manual page, here (r720 is one of the regular memory nodes): https://www.psc.edu/bridges/user-guide/system-configuration .
In GridEngine Family systems you can pull up what the Queue Scheduler is using with qhost
.
If you are inspecting directly and need more details than free
gives, most systems will let you run other tools like lshw
on the job node as well.
The method is dependent on the resource manager or scheduler in use at your site. With Slurm you can quickly see this in the output of
scontrol show node $NODENAME
Replace $NODENAME with the actual name of the node you’re interested in. You may leave the node name off and get a listing that includes all nodes.