I have a set of jobs that dies without an error message - how can I tell if it is the job itself, the scheduler (I’m on an SGE cluster), or both?
If it is from the job how can I get more info for troubleshooting?
I have a set of jobs that dies without an error message - how can I tell if it is the job itself, the scheduler (I’m on an SGE cluster), or both?
If it is from the job how can I get more info for troubleshooting?
This is a partial answer, covering SLURM job scheduler. When a job is killed due to time-out, it will have this kind of error message at (or near) the very last of the output file (stderr file, if you use -e
option):
slurmstepd: error: *** JOB 8841014 ON coreV2-22-017 CANCELLED AT 2019-03-08T11:30:03 DUE TO TIME LIMIT ***
Here is an example job that will time out:
#!/bin/bash
# 20190308
# Test SLURM
# Demo for a job that will timeout
# For 1 task
#SBATCH -n 1
# Job name
#SBATCH -J Timeout
#SBATCH -t 00:01:00
#SBATCH -o %x.o%j
## Additional switches may need to be specified on your system
echo "Start date: $(date)"
echo
echo "Sleeping"
set -x
sleep 10m
The output is:
Start date: Fri Mar 8 11:28:49 EST 2019
Sleeping
+ sleep 10m
slurmstepd: error: *** JOB 8841014 ON coreV2-22-017 CANCELLED AT 2019-03-08T11:30:03 DUE TO TIME LIMIT ***