sometimes i get an error response in monsoon when i use a particular interpreter language called davinci. The error is below:
/var/spool/slurm/slurmd/job33248702/slurm_script: ./Paleobedforms_load_files_run_krc.dv: /packages/davinci/2.22/bin/davinci: bad interpreter: Permission denied
i just dont get why if i send 700 arrays through monsoon, some dont work because of a failed permission thing, but the rest do work…further frustrating is that when i rerun the same thing, different arrays dont work . . . i have no idea what the deal is
Its been suggested that maybe its a node thing? Is there a way to see which nodes your SLURM scripts were run on within monsoon? (i.e., Sandy Bridge Xeon, Haswell Xeon, etc.)
Just a quick followup – it turns out that there was a misconfiguration on the Slurm-node-partition that Aaron’s array-job was running on. So, occasionally, member-jobs of the array-job were assigned to a node that was actually restricted to a subset of users.
This was verified by scanning through an sacct
listing of all members of the array job, and observing the specifics of the failed jobs. Basically, there’s two key parts that users may want to use when investigating their own array runs:
First, if you only have the JobID of a “child” job, begin by finding the JobID of the “parent” array job. Here’s one way to do this:
# the parent job ID (& task-ID) are listed in the first column
$ sacct -j 99025
JobID JobName Partition Account AllocCPUS State ExitCode
--------- ---------- ---------- ---------- ---------- ---------- --------
99000_117 myjob core slurm99 1 FAILED 126:0
99000_11+ batch slurm99 1 FAILED 126:0
99000_11+ extern slurm99 1 COMPLETED 0:0
Then use sacct
again to list details of its “children”:
# 'sacct' limits column width by default, so my workaround below
# is to use "parsable" output and run that through 'column'
$ sacct -p -o NodeList,State,JobID,JobIDRaw,ExitCode -j 99000 | grep FAIL | column -s'|' -t
cn789 FAILED 99000_111 99019 126:0
cn789 FAILED 99000_111.batch 99019.batch 126:0
cn789 FAILED 99000_112 99020 126:0
cn789 FAILED 99000_112.batch 99020.batch 126:0
cn789 FAILED 99000_113 99021 126:0
cn789 FAILED 99000_113.batch 99021.batch 126:0
cn789 FAILED 99000_117 99025 126:0
cn789 FAILED 99000_117.batch 99025.batch 126:0
Note that the use of the ExitCode column may help separate script failures from scheduler/cluster failures.