How can I manage a large number of very small independent jobs of varying length with some order dependencies?

jacks9 · March 8, 2018, 5:56pm

What tools are available on HPC clusters to help manage a large number of very small independent jobs (a few seconds each) of varying length and with some order dependencies as a single workflow?

CURATOR: Jack Smith

jacks9 · March 28, 2018, 12:53pm

ANSWER: What you are looking for is something called a “Pilot Job”. It’s like a scheduler within a scheduler, where you submit a single batch job that requests (reserves) a fixed set of resources (max cores, max cputime, etc.) like any other batch job, but then that job spawns smaller jobs from within to make the best use of resources it has reserved. It’s also sometimes called a “Big Job”.

One tool for managing such pilot jobs is RADICAL Pilot (http://radical-cybertools.github.io/radical-pilot/index.html) from the the RADICAL group at Rutgers. It’s based on SAGA (Simple APIs for Grid Applications), also developed by the RADICAL GROUP, which is a Python framework capable of spawning and managing multiple tasks (from single-threaded to MPI) with conditional workflows and staging of data between tasks.

jpessin1 · July 6, 2018, 12:26am

ANSWER:

There are several ways to manage interjob dependencies there are many tool kits like RADICAL mentioned by Jacks9. Or drmaa.

Sometimes the answer is just to launch a batch job from within a job. In the top job you can have a script that will run a job-array, wait for it to finish, then run the next job-array.