If a SLURM job fails or terminates unexpectedly, what mechanisms are available for making sure that temporary data, especially that on compute nodes, is cleaned up?
If you use the local storage (like $TMPDIR) on the node to store temporary data, it will be automatically purged on exit. Every HPC cluster can have a different location of $TMPDIR and can have size limits. Check with local HPC provider for those.
If there are other things you need to do to cleanup, you can use bash’s error handling. The following is based on response on stack overflow: interrupt handling - How to trap ERR when using 'set -e' in Bash - Stack Overflow
#!/bin/bash
set -eE # same as: `set -o errexit -o errtrace`
trap 'cleanup' ERR
function cleanup(){
echo "FAILED! Cleaning up..."
# rm stuff ...
}
function func(){
ls /root/
}
func
The call to func
will fail, which will cause cleanup
to be called. Note that if any command failure that isn’t part of a conditional will cause this cleanup
function to be called and the whole script to exit. See the man page for bash section here: bash(1): GNU Bourne-Again SHell - Linux man page
You can also trap EXIT instead if you want to also run it when the script completes. It runs regardless of error or no error. It runs second if both ERR and EXIT are trapped.