Stress Testing on Slurm

schadalapaka · November 7, 2019, 12:00am

Hi,

Has anyone run “stress-tests” on Slurm scheduler?
If so, what parameters did you test?

Thanks!
Sarvani Chadalapaka

ccoffey · November 7, 2019, 3:40pm

Hi Sarvani,

I like to throw this job and variants of it at our scheduler:

#!/bin/bash
#SBATCH --job-name=myjob # the name of your job
#SBATCH --output=/some/area/output_%A_%a.txt # this is the file your output and errors go to
#SBATCH --time=2 # 2 min
#SBATCH --chdir=/some/area # your work directory
#SBATCH --mem=500 # 500MB of memory
#SBATCH --array=1-2000 # 2000 member array

srun sleep 1

Depending on the core count on your cluster, increase the array size accordingly. This is a great job to test the speed of scheduling, as well as the speed of job cleanup in parallel. Reason being is there will be a storm of network connections on the scheduler as well as the nodes, and the network of course. Our cluster is 3k cores and this job really stresses the scheduler nicely. It’s been helpful to help tune the scheduler, and network.

Hope that helps, let me know if you have any questions.

Best,
Chris

Paul_Edmon · November 7, 2019, 5:12pm

I wrote a utility to do slurm testing called Slurm Test Deck Generator. Here’s the github for it: https://github.com/fasrc/stdg

ccoffey · November 8, 2019, 6:39pm

This looks interesting Paul - will check it out!

Best,
Chris

guilfoos · November 20, 2019, 8:02pm

We’re not using SLURM, but we have used ReFrame to write a set of scheduler tests to basically confirm that our scheduler acts the way we expect. For example, we submit test jobs to every combination of queue and resource type we support, test that jobs that we expect to be unschedulable are rejected, etc.

We are considering switching to SLURM in 2020, and would likely re-write these tests to test our SLURM configuration.

We currently run these when we come out of a downtime to confirm that nothing changed unexpectedly.