Running a job on Europa but job does nothing?

ak152030 · July 9, 2021, 3:44pm

Recently I submitted four (four sbatch commands) high throughput parameter screening jobs on Europa using the the launcher framework on Europa. Three of the four ran parameter screening jobs as I intended but one of the four did not run the parameter screening jobs at all. At the time of me writing this message the one job that is not running the parameter screening job at is still running but not running any of the executables I intended it to run. Using ssh into the computational node and top showed me that the node is doing nothing which is very odd to me.

The batch script:

#!/bin/bash
#SBATCH -J Launcher
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -p normal
#SBATCH -t 48:00:00
#SBATCH -o ExecutionList3.%j.out
#SBATCH -e ExecutionList3.%j.err

module load launcher
export LAUNCHER_WORKDIR=$PWD
export LAUNCHER_SCHED=interleaved
export LAUNCHER_JOB_FILE=ExecutionList3
export PATH=$PATH:$HOME/usr/bin/python3.6
alias python="/usr/bin/python3.6"
$LAUNCHER_DIR/paramrun

end of batch script.

scontrol on the job shows this:

[aak152030@europa output]$ scontrol show job=63809
JobId=63809 JobName=Launcher
UserId=aak152030(533757) GroupId=matsci(382) MCS_label=N/A
Priority=4294885886 Nice=0 Account=utdallas QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=17:25:53 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2021-07-08T14:35:58 EligibleTime=2021-07-08T14:35:58
AccrueTime=2021-07-08T14:35:58
StartTime=2021-07-08T17:09:30 EndTime=2021-07-10T17:09:30 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-07-08T17:09:30
Partition=normal AllocNode:Sid=europa:3715867
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-2-8-28
BatchHost=compute-2-8-28
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=16,node=1,billing=16
Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/aak152030/kmc_sets/Set16/Set_f/MyLauncher3.slurm
WorkDir=/scratch/aak152030/kmc_sets/Set16/Set_f
StdErr=/scratch/aak152030/kmc_sets/Set16/Set_f/ExecutionList3.63809.err
StdIn=/dev/null
StdOut=/scratch/aak152030/kmc_sets/Set16/Set_f/ExecutionList3.63809.out
Power=
MailUser=(null) MailType=NON

end of scontrol:

Part of me using top on the node

top - 10:43:44 up 78 days, 20:16, 1 user, load average: 1.14, 1.56, 1.64
Tasks: 235 total, 1 running, 234 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 32110.5 total, 30538.1 free, 458.6 used, 1113.9 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30347.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

1178358 aak1520+ 20 0 54356 4168 3280 R 0.3 0.0 0:00.80 top
1 root 20 0 242236 11136 8680 S 0.0 0.0 0:32.91 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:01.57 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp

csim · July 9, 2021, 4:05pm

Looks like there are a few nodes in the queue that are not properly mounting scratch. This could be the cause of your issue.

Is it job ID 63809 that’s causing the problem?

csim · July 9, 2021, 4:21pm

I’ve removed that problem node from the queue system. I’d say cancel that job and resubmit and let us know if the issue happens again.

@solj is currently reviewing the cluster to see what (if any) other nodes are causing issues.

ak152030 · July 9, 2021, 4:58pm

Yes it was job ID 63809 that seemed to be problematic to me. It seems that when you removed the node compute-2-8-28 from the queue, the same job ID 63809 restarted on a different computational node automatically. The restarted job seemed to be working as I intended. If you automatically restarted job 63809 on a different node for me thank you so much for that!