SLURM: If my job fails, how can I ensure that temporary data are cleaned up?

If a SLURM job fails or terminates unexpectedly, what mechanisms are available for making sure that temporary data, especially that on compute nodes, is cleaned up?

If you use the local storage (like $TMPDIR) on the node to store temporary data, it will be automatically purged on exit. Every HPC cluster can have a different location of $TMPDIR and can have size limits. Check with local HPC provider for those.

If there are other things you need to do to cleanup, you can use bash’s error handling. The following is based on response on stack overflow: interrupt handling - How to trap ERR when using 'set -e' in Bash - Stack Overflow

#!/bin/bash

set -eE  # same as: `set -o errexit -o errtrace`
trap 'cleanup' ERR

function cleanup(){
  echo "FAILED! Cleaning up..."
  # rm stuff ...
}

function func(){
  ls /root/
}

func

The call to func will fail, which will cause cleanup to be called. Note that if any command failure that isn’t part of a conditional will cause this cleanup function to be called and the whole script to exit. See the man page for bash section here: bash(1): GNU Bourne-Again SHell - Linux man page

You can also trap EXIT instead if you want to also run it when the script completes. It runs regardless of error or no error. It runs second if both ERR and EXIT are trapped.