HTC using SLURM Job Arrays

csim · November 18, 2020, 4:55pm

As mentioned in the Europa kickoff meeting, Europa is an High Throughput Computing Environment, or HTC for short. This is in contrast to Ganymede which is an HPC system. The major difference between HTC and HPC are the networking (and sometime storage).

Ganymede has a high performance interconnect and parallel scratch storage system while Europa (at this time) does not. So, while Ganymede is great for running large, multi-node, single simulations, Europa is not. Europa is for many smaller jobs to run concurrently; this is high throughput computing.

In future posts, we will look at other tools we can use for managing our HTC workloads, but the first tool we’re going to look out is a SLURM feature. SLURM is the scheduling / queue system we use and it has support for the notion of job arrays.

Let’s start with a simple SLURM job array example

#!/bin/bash

#SBATCH -J test-arrays        # Job name
#SBATCH -o job.%j.out         # Name of stdout output file (%j expands to jobId)
#SBATCH -N 1                  # Total number of nodes requested
#SBATCH -n 1                  # Total number of tasks requested
#SBATCH --array=1-5           # array ranks to run

#SBATCH -t 01:30:00           # Run time (hh:mm:ss) - 1.5 hours

echo My SLURM Job Array Task ID is: "$SLURM_ARRAY_TASK_ID"
echo I ran on host: `hostname`
sleep 10

Here, we have a simply example job array script that will submit 5 jobs with this one sbatch command one 1 node and 1 core each. Each node gets a different value of $SLURM_ARRAY_TASK_ID

[css180002@europa arrays]$ sbatch job-array.sh 
Submitted batch job 225
[css180002@europa arrays]$ squeue -u$USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
             225_1    normal test-arr css18000  R       0:07      1 compute-1-1-8 
             225_2    normal test-arr css18000  R       0:07      1 compute-1-1-9 
             225_3    normal test-arr css18000  R       0:07      1 compute-1-1-11 
             225_4    normal test-arr css18000  R       0:07      1 compute-1-1-18 
             225_5    normal test-arr css18000  R       0:07      1 compute-1-1-19 

[css180002@europa arrays]$ cat job.22*.out
My SLURM Job Array Task ID is: 5
I ran on host: compute-1-1-19
My SLURM Job Array Task ID is: 1
I ran on host: compute-1-1-8
My SLURM Job Array Task ID is: 2
I ran on host: compute-1-1-9
My SLURM Job Array Task ID is: 3
I ran on host: compute-1-1-11
My SLURM Job Array Task ID is: 4
I ran on host: compute-1-1-18

If you are abstract your workflow such that “all” it needs is to take an input task ID “rank”, you now have a very easy to use infrastructure for generating HTC workflows.

Note, a slurm job is run for each individual array rank. If you’d like to run many small jobs as part of one larger job, that requires a different tool such as TACC launcher.

By default, SLURM will try to run all of your array jobs at the same time (if the nodes are available).
If you’d like to change this behavior, you can include a percent sign to set a max number of concurrent jobs that will run, with --array=1-100%4 for instance.

You can also request custom task IDs with --array=1,3,5,7 and a custom step size with --array=1-7:2.

In your job scripts, several environment variables are available. The $SLURM_JOBID is sequential, and includes an underscore to denote both the initial submitted job and the task. The $SLURM_ARRAY_JOB_ID is the same for all jobs in the array while the $SLURM_ARRAY_TASK_ID is equal to the index supplied by the array option.

lhw150030 · November 18, 2020, 8:52pm

Slurm Script Update.

The job array worked very well for the task I have at hand. Following is the code I used.
utdNodesAllRun.sh

#SBATCH -J utdNodesAll        # Job name
#SBATCH -o utdNodesAll.%j.out # Name of stdout output file (%j expands to jobId)
#SBATCH -e utdNodesAll.%j.err # Error File Name 
#SBATCH -N 1                  # Total number of nodes requested
#SBATCH -n 16                 # Total number of mpi tasks requested
#SBATCH --array=1-15          # Array ranks to run
#SBATCH -t 24:00:00           # Run time (hh:mm:ss) - 24 hours

ml load matlab
echo Running calibration scripts for UTD Node: "$SLURM_ARRAY_TASK_ID"
echo Running on host: `hostname`
matlab -nodesktop -nodisplay -nosplash -r "try utdNodesOptSolo2("$SLURM_ARRAY_TASK_ID"); catch; end; quit"

And its currently running on Europa

[lhw150030@europa calibration]$ squeue -u$USER | grep 249
             249_1    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-1 
             249_2    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-3 
             249_4    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-8 
             249_5    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-9 
             249_6    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-11 
             249_7    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-15 
             249_8    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-18 
            249_10    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-20 
            249_11    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-21 
            249_12    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-22 
            249_13    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-23 
            249_14    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-24 
            249_15    normal utdNodes lhw15003  R    1:11:40      1 compute-1-1-25

sasmitam · November 18, 2020, 8:58pm

Good progress Lakitha ! Keep it up !!!

rzutd · December 14, 2020, 11:03pm

Hello everyone!

I was trying to submit a similar job to run a large parametric simulation on 100 cores with the launcher module.
I haven’t succeeded in launching it correctly. Is there a max number of processes a user can launch? Am I understanding not the node architecture correctly?

This is one of the attempts. The LAUNCHER_JOB_FILE has 100 lines.

I was also considering starting a separate thread for the launcher module if anyone is also using it.

    # !/bin/bash
    # Simple SLURM script for submitting multiple serial
    # jobs (e.g. parametric studies) using a script wrapper
    # to launch the jobs.
    #
    # To use, build the launcher executable and your
    # serial application(s) and place them in your WORKDIR
    # directory. Then, edit the LAUNCHER_JOB_FILE to specify
    # each executable per process.
    #-------------------------------------------------------
    #
    # <------ Setup Parameters ------>
    #
    #SBATCH -J launcher
    #SBATCH -N 7
    #SBATCH -n 16
    #SBATCH -p normal
    #SBATCH -o Parametric.%j.out
    #SBATCH -e Parametric.%j.err
    #SBATCH -t 48:00:00
    #------------------------------------------------------
    module load launcher
    export LAUNCHER_SCHED=interleaved
    export LAUNCHER_WORKDIR=~/monte
    export LAUNCHER_JOB_FILE=script/montecmd1
    $LAUNCHER_DIR/paramrun

rzutd · January 18, 2021, 10:01am

Hello everyone @csim, @solj, @sasmitam,
Another question about parametric runs. I am using the following script to run multiple parametric jobs using launcher.

#!/bin/bash
#SBATCH -J launcher
#SBATCH -N 3
#SBATCH -n 48
#SBATCH -p normal
#SBATCH -o out/slurm-out/Parametric.%j.out
#SBATCH -e out/slurm-out/Parametric.%j.err
#SBATCH -t 36:00:00
#------------------------------------------------------
module load launcher
export LAUNCHER_SCHED=interleaved
export LAUNCHER_WORKDIR=~/monte
export LAUNCHER_JOB_FILE=script/montecmd1
$LAUNCHER_DIR/paramrun

The jobs start normally and seem to run on multiple nodes as expected.

[rxz074000@europa ~]$ squeue -u$USER
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
     1089    normal launcher rxz07400  R 1-06:50:09      3 compute-1-1-[1-3]

However, when I check the output, the jobs seem to only run on the first selected node and never make it on the remaining nodes. The slurm error output shows the following message.

[rxz074000@europa monte]$ vi out/slurm-out/Parametric.1089.err    

using /tmp/launcher.1089.hostlist.anax6Zvm to get hosts
starting job on compute-1-1-1
starting job on compute-1-1-2
starting job on compute-1-1-3
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-2: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-2: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-2: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
...

I cannot figure out whether it’s something I am doing or if there’s another issue. The executable is the same for all jobs with different parameters.

solj · January 26, 2021, 2:52pm

Just wanted to note that this was due to a misapplied configuration change. This has been fixed and the job should now launch normally. Anyone experiencing this behavior should email our ticketing system.

csim · February 1, 2021, 3:58pm

Hey @rzutd,

I see you are running some launcher jobs now and have been for a few days. Are you still having issues? Is there anything unresolved we can help with?

Best,
Chris

rzutd · February 1, 2021, 9:06pm

@csim

Yes, I started running a batch yesterday and I haven’t seen any issues. The issue on post#4, where I was only able to submit 48 jobs at a time is gone (I submitted 96 and it worked), so I can now submit a longer $LAUNCHER_JOB_FILE. The issue on post#5 was solved by @solj’s configuration fix.

One more interesting behavior - I noticed this in other runs. I noticed that some nodes run drastically slower than others (in this run it was compute-1-1-35). The jobs on all the other nodes seem to proceed at a relatively similar speed. Is it expected?

Thank you for checking in!

csim · February 1, 2021, 10:53pm

Ok I “think” I know what’s causing the performance problem. There are some BIOS settings on the chassis which are related to power delivery and ours are set in a non-default configuration. When a node’s CMOS battery goes out, the custom BIOS settings are lost when powered down and then when rebooted, the power delivery is wrong and thus some nodes are power starved leading to a lower clock speed.

Tomorrow, I’ll do a pass on the entire cluster and benchmark the nodes are find the ones that are set incorrectly. I’ll then ask the Ops team to take care of it. Should be resolved in the next 2-3 days, depending on when we can get into the data center next.

Best,
Chris