Addressing permission error and Checking which nodes SLURM uses (i.e., Sandy Bridge Xeon, Haswell Xeon, etc.)

aaronweintraub · September 22, 2020, 4:48pm

sometimes i get an error response in monsoon when i use a particular interpreter language called davinci. The error is below:

/var/spool/slurm/slurmd/job33248702/slurm_script: ./Paleobedforms_load_files_run_krc.dv: /packages/davinci/2.22/bin/davinci: bad interpreter: Permission denied

i just dont get why if i send 700 arrays through monsoon, some dont work because of a failed permission thing, but the rest do work…further frustrating is that when i rerun the same thing, different arrays dont work . . . i have no idea what the deal is

Its been suggested that maybe its a node thing? Is there a way to see which nodes your SLURM scripts were run on within monsoon? (i.e., Sandy Bridge Xeon, Haswell Xeon, etc.)

jasonbuechler · September 23, 2020, 11:12pm

Just a quick followup – it turns out that there was a misconfiguration on the Slurm-node-partition that Aaron’s array-job was running on. So, occasionally, member-jobs of the array-job were assigned to a node that was actually restricted to a subset of users.

This was verified by scanning through an sacct listing of all members of the array job, and observing the specifics of the failed jobs. Basically, there’s two key parts that users may want to use when investigating their own array runs:

First, if you only have the JobID of a “child” job, begin by finding the JobID of the “parent” array job. Here’s one way to do this:

# the parent job ID (& task-ID) are listed in the first column

$ sacct -j 99025
    JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
--------- ---------- ---------- ---------- ---------- ---------- --------
99000_117      myjob       core    slurm99          1     FAILED    126:0
99000_11+      batch               slurm99          1     FAILED    126:0
99000_11+     extern               slurm99          1  COMPLETED      0:0

Then use sacct again to list details of its “children”:

# 'sacct' limits column width by default, so my workaround below
# is to use "parsable" output and run that through 'column'

$ sacct -p -o NodeList,State,JobID,JobIDRaw,ExitCode -j 99000 | grep FAIL | column -s'|' -t 
cn789  FAILED  99000_111        99019        126:0
cn789  FAILED  99000_111.batch  99019.batch  126:0
cn789  FAILED  99000_112        99020        126:0
cn789  FAILED  99000_112.batch  99020.batch  126:0
cn789  FAILED  99000_113        99021        126:0
cn789  FAILED  99000_113.batch  99021.batch  126:0
cn789  FAILED  99000_117        99025        126:0
cn789  FAILED  99000_117.batch  99025.batch  126:0

Note that the use of the ExitCode column may help separate script failures from scheduler/cluster failures.