Recently I submitted four (four sbatch commands) high throughput parameter screening jobs on Europa using the the launcher framework on Europa. Three of the four ran parameter screening jobs as I intended but one of the four did not run the parameter screening jobs at all. At the time of me writing this message the one job that is not running the parameter screening job at is still running but not running any of the executables I intended it to run. Using ssh into the computational node and top showed me that the node is doing nothing which is very odd to me.
The batch script:
#!/bin/bash
#SBATCH -J Launcher
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -p normal
#SBATCH -t 48:00:00
#SBATCH -o ExecutionList3.%j.out
#SBATCH -e ExecutionList3.%j.err
module load launcher
export LAUNCHER_WORKDIR=$PWD
export LAUNCHER_SCHED=interleaved
export LAUNCHER_JOB_FILE=ExecutionList3
export PATH=$PATH:$HOME/usr/bin/python3.6
alias python="/usr/bin/python3.6"
$LAUNCHER_DIR/paramrun
end of batch script.
scontrol on the job shows this:
[aak152030@europa output]$ scontrol show job=63809
JobId=63809 JobName=Launcher
UserId=aak152030(533757) GroupId=matsci(382) MCS_label=N/A
Priority=4294885886 Nice=0 Account=utdallas QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=17:25:53 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2021-07-08T14:35:58 EligibleTime=2021-07-08T14:35:58
AccrueTime=2021-07-08T14:35:58
StartTime=2021-07-08T17:09:30 EndTime=2021-07-10T17:09:30 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-07-08T17:09:30
Partition=normal AllocNode:Sid=europa:3715867
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-2-8-28
BatchHost=compute-2-8-28
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=16,node=1,billing=16
Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/aak152030/kmc_sets/Set16/Set_f/MyLauncher3.slurm
WorkDir=/scratch/aak152030/kmc_sets/Set16/Set_f
StdErr=/scratch/aak152030/kmc_sets/Set16/Set_f/ExecutionList3.63809.err
StdIn=/dev/null
StdOut=/scratch/aak152030/kmc_sets/Set16/Set_f/ExecutionList3.63809.out
Power=
MailUser=(null) MailType=NON
end of scontrol:
Part of me using top on the node
top - 10:43:44 up 78 days, 20:16, 1 user, load average: 1.14, 1.56, 1.64
Tasks: 235 total, 1 running, 234 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 32110.5 total, 30538.1 free, 458.6 used, 1113.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30347.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1178358 aak1520+ 20 0 54356 4168 3280 R 0.3 0.0 0:00.80 top
1 root 20 0 242236 11136 8680 S 0.0 0.0 0:32.91 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:01.57 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp