I need to submit many hundreds jobs to the cluster. What is the best way to submit these jobs to ensure that the jobs are scheduled and complete as fast as possible?
CURATOR: Katia
I need to submit many hundreds jobs to the cluster. What is the best way to submit these jobs to ensure that the jobs are scheduled and complete as fast as possible?
CURATOR: Katia
ANSWER:
This question is not very clear so hard to answer. But here are some thoughts.
If we talk about a single job that does a lot of I/O communication than there are a couple of suggestions:
The other way this question could be interpret is how should I arrange and submit my (many) jobs so they would start running (and finish) as soon as possible. There are a couple suggestions in this case:
Jack’s COMMENT: Asking Google “slurm options for best throughput” gives as first response: “Tuning Slurm Scheduling for Optimal
Responsiveness and Utilization” -> https://slurm.schedmd.com/SUG14/sched_tutorial.pdf
ANSWER:
In general, you should specify to the scheduler all the requirements for your job, and not add constraints which are not requirements. Adding additional
constraints can often cause the job to spend additional time waiting in the queue,
which will degrade overall throughput.
(NOTE: The Slurm and Moab options below can be affected by how the
specific cluster is configured, so while valid for many such clusters they may
not be valid for all clusters).
Some common requirements to consider:
Walltime for the job.
Typically every job has a wall time (either specified by user or defaulted), and the scheduler will terminate jobs when the walltime is exceeded (whether the job completed or not). So you usually want to specify the smallest walltime such that you are sure your job (if running properly) will finish within. I.e., if you expect the job will usually finish in 8 hours, you might want to pad this to 9 or 10 hours just to be certain, but 24 hours is probably excessive. Shorter jobs might get greater priority, and typically can better take advantage of backfilling, both of which will shorten time waiting in the queue.
On Slurm, this is done with the --time=TIME or -t TIME flags, where TIME can
be the number of minutes, or something like HOURS:MINS:SECS or DAYS-HOURS:MINS
For Moab systems this is typically done with something like “-l walltime=TIME”
the number of nodes/CPU cores to be allocated
The scheduler allocates CPU cores on nodes to jobs, so it needs to know what
you want. I am assuming you know what is best for your code/problem. The
important thing is that you tell the scheduler what you need/want. In particular,
so codes can use multiple CPUs, but only if they are on the same node, in which
case you need to tell the scheduler that (e.g. in Slurm, to get 8 cores on a
single node, something like “–nodes=1 --ntasks=8” or even more specific
“–ntasks=1 --cpus-per-task=8”. On Moab systems, this is usually specified
with something like “-l nodes=1:ppn=8” ).
For pure MPI jobs, however, it often does not matter much how the tasks are
distributed among nodes (at least when all the nodes can talk to each other at
full bandwidth over the high speed interconnect). In such cases, it is usually
best to only tell the scheduler how many tasks you wish to run, and not constrain
the scheduler in how to distribute the tasks among nodes. Specifying a node
count can cause the job to spend more time in the queue, and is not advised
unless such will significantly improve runtime.
In order to request 8 single core MPI tasks, letting the scheduler distribute
over nodes as it sees fit, on Slurm systems one would use “–ntasks=8” without
the --nodes flag. On Moab systems, it would be something like “–nodes=8”
(the “nodes” in this case refer to “virtual nodes”, i.e. a processor cores)
Sometimes users will try to “trick” the scheduler — e.g. if the compute nodes
have 64 GB of RAM and 16 cores per node, and your MPI job needs 16 cores and
8 GB of RAM per core, you might request 16 cores over 2 nodes and leave out the
memory specification. But this can in some cases delay the
scheduling of your job — it is better tell the scheduler what you really want,
e.g. “–ntasks=8 --mem-per-cpu=8” in Slurm, as there could be 8 nodes which are
not fully utilized and could run a single core task needing 8 GB of RAM on which
you job could run immediately, instead of waiting for 2 complete nodes to become
free.
There are potential negatives that should be considered as well:
a) running on the same node as other jobs increases the potential for the other
jobs to interfere with your job. While the operating system and cluster
configuration make some effort to mitigate this, such efforts are not perfect.
I/O and network bandwidth are typically shared among all jobs on a node, and
so are usually vulnerable to such interference.
Typically, if you are running a very parallel code, running on (and using most
of the cores of) many nodes, it is best not to share nodes with other jobs.
The width of these jobs generally increase their vulnerability to
such interference, and generally significantly reduces any benefits in scheduling.
But if your job is not using all the cores available on a typical node, allowing
it to run on the same nodes with other jobs will usually give enough benefit
in the scheduling of the job to more than offset the risk of a job needing
to be rerun due to negative interference from another job.