How do I get my jobs to be scheduled as fast as possible?

I need to submit many hundreds jobs to the cluster. What is the best way to submit these jobs to ensure that the jobs are scheduled and complete as fast as possible?

CURATOR: Katia

ANSWER:

This question is not very clear so hard to answer. But here are some thoughts.
If we talk about a single job that does a lot of I/O communication than there are a couple of suggestions:

  • If job reads or writes one line at a time then the most efficient way is to copy (or create) the file to the local scratch/tmp directory (which one would depend on the cluster) and make read/write to/from this local file. At the end file can be moved back to the project space
  • Depending on a cluster, some nodes might have faster and slower ethernet connection on various nodes. Requesting a node with faster ethernet connection might help with jobs that have high I/O demands

The other way this question could be interpret is how should I arrange and submit my (many) jobs so they would start running (and finish) as soon as possible. There are a couple suggestions in this case:

  • For many similar jobs use array job (instead of submitting many jobs )
  • For many very short tasks combine them into a single job instead of submitting many very short jobs
  • Make sure you request appropriate resources for your jobs - do not request resources that you job does not need - very long time limit, too many cores etc.
  • Study careful resources provided with the cluster. Some clusters have specific queues dedicated to the jobs that otherwise would have a long waiting time in a general queue.

Jack’s COMMENT: Asking Google “slurm options for best throughput” gives as first response: “Tuning Slurm Scheduling for Optimal
Responsiveness and Utilization” -> https://slurm.schedmd.com/SUG14/sched_tutorial.pdf

ANSWER:

In general, you should specify to the scheduler all the requirements for your job, and not add constraints which are not requirements. Adding additional
constraints can often cause the job to spend additional time waiting in the queue,
which will degrade overall throughput.

(NOTE: The Slurm and Moab options below can be affected by how the
specific cluster is configured, so while valid for many such clusters they may
not be valid for all clusters).

Some common requirements to consider:

  1. Walltime for the job.
    Typically every job has a wall time (either specified by user or defaulted), and the scheduler will terminate jobs when the walltime is exceeded (whether the job completed or not). So you usually want to specify the smallest walltime such that you are sure your job (if running properly) will finish within. I.e., if you expect the job will usually finish in 8 hours, you might want to pad this to 9 or 10 hours just to be certain, but 24 hours is probably excessive. Shorter jobs might get greater priority, and typically can better take advantage of backfilling, both of which will shorten time waiting in the queue.
    On Slurm, this is done with the --time=TIME or -t TIME flags, where TIME can
    be the number of minutes, or something like HOURS:MINS:SECS or DAYS-HOURS:MINS
    For Moab systems this is typically done with something like “-l walltime=TIME”

  2. the number of nodes/CPU cores to be allocated
    The scheduler allocates CPU cores on nodes to jobs, so it needs to know what
    you want. I am assuming you know what is best for your code/problem. The
    important thing is that you tell the scheduler what you need/want. In particular,
    so codes can use multiple CPUs, but only if they are on the same node, in which
    case you need to tell the scheduler that (e.g. in Slurm, to get 8 cores on a
    single node, something like “–nodes=1 --ntasks=8” or even more specific
    “–ntasks=1 --cpus-per-task=8”. On Moab systems, this is usually specified
    with something like “-l nodes=1:ppn=8” ).

For pure MPI jobs, however, it often does not matter much how the tasks are
distributed among nodes (at least when all the nodes can talk to each other at
full bandwidth over the high speed interconnect). In such cases, it is usually
best to only tell the scheduler how many tasks you wish to run, and not constrain
the scheduler in how to distribute the tasks among nodes. Specifying a node
count can cause the job to spend more time in the queue, and is not advised
unless such will significantly improve runtime.

In order to request 8 single core MPI tasks, letting the scheduler distribute
over nodes as it sees fit, on Slurm systems one would use “–ntasks=8” without
the --nodes flag. On Moab systems, it would be something like “–nodes=8”
(the “nodes” in this case refer to “virtual nodes”, i.e. a processor cores)

  1. specifying the amount of memory needed
    You typically should specify how much memory the job needs. Different schedulers
    give somewhat different options in how this can be specified (total for the
    job, for each node in the job, for each task in the job, for each allocated
    CPU core, etc). It is often most useful to specify per processor core or task.
    On Slurm, to specify the amount of memory per CPU core allocated, one could
    use “–mem-per-cpu=MEMORY”. Moab users can use something
    like “-l pmem=MEMORY”. In both cases, MEMORY is in MB.

Sometimes users will try to “trick” the scheduler — e.g. if the compute nodes
have 64 GB of RAM and 16 cores per node, and your MPI job needs 16 cores and
8 GB of RAM per core, you might request 16 cores over 2 nodes and leave out the
memory specification. But this can in some cases delay the
scheduling of your job — it is better tell the scheduler what you really want,
e.g. “–ntasks=8 --mem-per-cpu=8” in Slurm, as there could be 8 nodes which are
not fully utilized and could run a single core task needing 8 GB of RAM on which
you job could run immediately, instead of waiting for 2 complete nodes to become
free.

  1. allowing your job to share nodes with other jobs
    Most schedulers support several policies with respect to whether more than one
    job can run on the same node (assuming sufficient resources). Depending on the
    cluster, this might or might not be selectable by the user submitting the job.
    If it is an option available to you, in general allowing your job to run on the
    same nodes as other jobs might reduce the amount of time your jobs waits in the
    queue.

There are potential negatives that should be considered as well:
a) running on the same node as other jobs increases the potential for the other
jobs to interfere with your job. While the operating system and cluster
configuration make some effort to mitigate this, such efforts are not perfect.
I/O and network bandwidth are typically shared among all jobs on a node, and
so are usually vulnerable to such interference.

Typically, if you are running a very parallel code, running on (and using most
of the cores of) many nodes, it is best not to share nodes with other jobs.
The width of these jobs generally increase their vulnerability to
such interference, and generally significantly reduces any benefits in scheduling.

But if your job is not using all the cores available on a typical node, allowing
it to run on the same nodes with other jobs will usually give enough benefit
in the scheduling of the job to more than offset the risk of a job needing
to be rerun due to negative interference from another job.

1 Like