Hi, When I was looking at a records of many jobs (~100-200) I submitted on Europa I noticed a few jobs for unknown reason timed out when I was expecting them to be finished within the time I allotted. Now I don’t know if those jobs timed out because of a problem in the code I choose to compile on Europa or if the computational node on Europa had some problems in mounting scratch systems. Luckily I managed to recover the job number and the computational node name for a single example of a job never finishing.
Job number: 102214
Node Name: compute-1-10-36
Here is the general .err and .out files I have for that node
The .err file:
using /tmp/launcher.102214.hostlist.WOEddYdH to get hosts
starting job on compute-1-10-36
slurmstepd: error: *** JOB 102214 ON compute-1-10-36 CANCELLED AT 2021-08-07T06:34:01 DUE TO TIME LIMIT ***
%%%%%%%%%%%%%%%%%%%%%%%%%%%% END OF ERR FILE
The .out file:
Launcher: Setup complete.
------------- SUMMARY ---------------
Number of hosts: 1
Working directory: /scratch/aak152030/kmc_sets/Set17/Set_e
Processes per host: 16
Total processes: 16
Total jobs: 16
Scheduling method: interleaved
Launcher: Starting parallel tasks…
Launcher: Task 2 running job 3 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.700_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 0 running job 1 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.633_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 13 running job 14 on compute-1-10-36 (cd G1/run2_Aw100/Bias_-0.067_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 15 running job 16 on compute-1-10-36 (cd G1/run2_Aw100/Bias_0.000_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 3 running job 4 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.733_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 9 running job 10 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.933_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 10 running job 11 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.967_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 7 running job 8 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.867_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 4 running job 5 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.767_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 11 running job 12 on compute-1-10-36 (cd G1/run1_Aw50/Bias_1.000_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 12 running job 13 on compute-1-10-36 (cd G1/run2_Aw100/Bias_-0.100_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 1 running job 2 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.667_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 8 running job 9 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.900_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 5 running job 6 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.800_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 14 running job 15 on compute-1-10-36 (cd G1/run2_Aw100/Bias_-0.033_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 6 running job 7 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.833_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Job 3 completed in 411 seconds.
Launcher: Task 2 done. Exiting.
Launcher: Job 1 completed in 476 seconds.
Launcher: Task 0 done. Exiting.
Launcher: Job 2 completed in 467 seconds.
Launcher: Task 1 done. Exiting.
Launcher: Job 6 completed in 505 seconds.
Launcher: Task 5 done. Exiting.
Launcher: Job 8 completed in 579 seconds.
Launcher: Task 7 done. Exiting.
Launcher: Job 9 completed in 606 seconds.
Launcher: Task 8 done. Exiting.
Launcher: Job 10 completed in 615 seconds.
Launcher: Task 9 done. Exiting.
Launcher: Job 11 completed in 623 seconds.
Launcher: Task 10 done. Exiting.
Launcher: Job 12 completed in 639 seconds.
Launcher: Task 11 done. Exiting.
Launcher: Job 16 completed in 23798 seconds.
Launcher: Task 15 done. Exiting.
Launcher: Job 15 completed in 24567 seconds.
Launcher: Task 14 done. Exiting.
Launcher: Job 14 completed in 28112 seconds.
Launcher: Task 13 done. Exiting.
Launcher: Job 13 completed in 29120 seconds.
Launcher: Task 12 done. Exiting.
%%%%%%%%%%%%%%%%%%%%%%%%%%% END OF .OUT FILE
When looking at the .out file it seems the longest run time for some of my jobs are just under 30000 seconds which makes me expect that the jobs that timed out should of finished within that time frame also but they did not.
I hope I did not provide too much needless information here.