Running a Job on Europa but Job never Finishes

Hi, When I was looking at a records of many jobs (~100-200) I submitted on Europa I noticed a few jobs for unknown reason timed out when I was expecting them to be finished within the time I allotted. Now I don’t know if those jobs timed out because of a problem in the code I choose to compile on Europa or if the computational node on Europa had some problems in mounting scratch systems. Luckily I managed to recover the job number and the computational node name for a single example of a job never finishing.

Job number: 102214
Node Name: compute-1-10-36

Here is the general .err and .out files I have for that node

The .err file:

using /tmp/launcher.102214.hostlist.WOEddYdH to get hosts
starting job on compute-1-10-36
slurmstepd: error: *** JOB 102214 ON compute-1-10-36 CANCELLED AT 2021-08-07T06:34:01 DUE TO TIME LIMIT ***
%%%%%%%%%%%%%%%%%%%%%%%%%%%% END OF ERR FILE
The .out file:

Launcher: Setup complete.

------------- SUMMARY ---------------
Number of hosts: 1
Working directory: /scratch/aak152030/kmc_sets/Set17/Set_e
Processes per host: 16
Total processes: 16
Total jobs: 16
Scheduling method: interleaved


Launcher: Starting parallel tasks…
Launcher: Task 2 running job 3 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.700_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 0 running job 1 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.633_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 13 running job 14 on compute-1-10-36 (cd G1/run2_Aw100/Bias_-0.067_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 15 running job 16 on compute-1-10-36 (cd G1/run2_Aw100/Bias_0.000_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 3 running job 4 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.733_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 9 running job 10 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.933_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 10 running job 11 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.967_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 7 running job 8 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.867_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 4 running job 5 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.767_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 11 running job 12 on compute-1-10-36 (cd G1/run1_Aw50/Bias_1.000_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 12 running job 13 on compute-1-10-36 (cd G1/run2_Aw100/Bias_-0.100_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 1 running job 2 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.667_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 8 running job 9 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.900_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 5 running job 6 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.800_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 14 running job 15 on compute-1-10-36 (cd G1/run2_Aw100/Bias_-0.033_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Task 6 running job 7 on compute-1-10-36 (cd G1/run1_Aw50/Bias_0.833_V/output/; python3 executor3-2.py; cd …/…/…/…/)
Launcher: Job 3 completed in 411 seconds.
Launcher: Task 2 done. Exiting.
Launcher: Job 1 completed in 476 seconds.
Launcher: Task 0 done. Exiting.
Launcher: Job 2 completed in 467 seconds.
Launcher: Task 1 done. Exiting.
Launcher: Job 6 completed in 505 seconds.
Launcher: Task 5 done. Exiting.
Launcher: Job 8 completed in 579 seconds.
Launcher: Task 7 done. Exiting.
Launcher: Job 9 completed in 606 seconds.
Launcher: Task 8 done. Exiting.
Launcher: Job 10 completed in 615 seconds.
Launcher: Task 9 done. Exiting.
Launcher: Job 11 completed in 623 seconds.
Launcher: Task 10 done. Exiting.
Launcher: Job 12 completed in 639 seconds.
Launcher: Task 11 done. Exiting.
Launcher: Job 16 completed in 23798 seconds.
Launcher: Task 15 done. Exiting.
Launcher: Job 15 completed in 24567 seconds.
Launcher: Task 14 done. Exiting.
Launcher: Job 14 completed in 28112 seconds.
Launcher: Task 13 done. Exiting.
Launcher: Job 13 completed in 29120 seconds.
Launcher: Task 12 done. Exiting.

%%%%%%%%%%%%%%%%%%%%%%%%%%% END OF .OUT FILE
When looking at the .out file it seems the longest run time for some of my jobs are just under 30000 seconds which makes me expect that the jobs that timed out should of finished within that time frame also but they did not.

I hope I did not provide too much needless information here.

No you didn’t. Diagnostic information like this is great!

So, you’re running on one host with 16 cores and running 16 total jobs. So, it’ll be running one job per core and all running at the same time (using launcher).

Tell me about the 16 jobs. What makes them different? Different data sets? different input values? Do you expect them to run in the same time or should they have significantly different run times?

For instance, what is the difference between Job 3 (411 seconds) vs Job 13? (~30k seconds)?

$SCRATCH availability shouldn’t be causing this particular issue as some jobs were able to write and they are all on the same node, however, $SCRATCH performance certainly could be an issue. We do not have a parallel file system on Europa yet and there are some users who are currently abusing it.

Looking at the logs, it does appear it ran the entire 48 hours (max queue time) without finishing a few of those tasks. So, either their workloads are VERY different or it was very slow I/O. How much I/O does these jobs do?

Best,
Chris

The 16 jobs perform kinetic Monte Carlo simulations of solar cells. All jobs here perform a bias parameter sweep across 2 solar cell morphologies. In the diagnostics I present here I only see two morphologies (run1_Aw50 and run2_Aw100) and the rest are a bias parameter sweeps. Job 3 took a shorter amount of time 411 seconds because I had bias input of 0.7 V. Typically higher bias inputs decreases the number of particles present in the simulation causing it to run faster. Job 13 took a longer amount of time of ~30k seconds because it had a reverse bias input of -0.1 V which allows more particles to be present in the simulation causing the run to be significantly longer than job 3.

Some jobs that did not finish are jobs 4 and 5 which struck me as very strange because those were in the morphology batch run1_Aw50. Jobs under run1_Aw50 ran significantly faster than run2_Aw100 because of different bias. I expected jobs 4 and 5 to have runtimes between jobs 3 and 6 because those jobs were bias parameter sweeping (0.700, 0.733, 0.767, 0.800) Volts. Bias inputs 0.700 V and 0.800 V finished quickly but 0.733 V and 0.767 V did not finish at all.

I don’t think that these jobs are I/O intensive because they typically don’t produce that much data on the scratch individually. Running du -sh * on directories that morphology run1_Aw50 applies shows this

4.9M Bias_0.633_V
4.8M Bias_0.667_V
4.9M Bias_0.700_V
1.7M Bias_0.733_V
1.7M Bias_0.767_V
4.6M Bias_0.800_V
1.5M Bias_0.833_V
4.8M Bias_0.867_V
4.8M Bias_0.900_V
4.9M Bias_0.933_V
4.9M Bias_0.967_V
5.0M Bias_1.000_V

So the total amount of hard drive writing these jobs take typically peak at around 5MB of storage each.

Although it is possible that they open and close the I/O buffer quite frequently while writing ~1kB each time I always assumed that the compiler optimized that out. I am using python3 in launcher but all that python3 is doing to running a c++ compiled code. The c++ lines that I believe has the highest i/o is

std::ofstream statcurrout;
char* str_value = new char[528];
statcurrout.open(str_value, std::ios::app); // highest io line I think

This buffer access line I believe should run at most 5000 times for each job I have listed here.
I compiled my c++ code using:

-Ofast -fira-region=all -fvect-cost-model=unlimited

On small scales tests I found that -Ofast, -O3, -O2 all produced the same particle trajectories in my monte carlo runs.

I hope I was more clear with the nature of my jobs. I never actually discussed how I ran my jobs with anyone before.

Excellent info!

Are your jobs 100% serial or are you using some third party library that might be doing some parallelization / threading? One reason for seeing drastically different run times is when you “oversubscribe” a node; that is running more tasks / threads than number of cores and thus some processes get starved for CPU time.

Finally, how much memory does a single task use and how much total data does it write? That is, if you have a single job running in a single directory and used “du -skh /path/to/dirname”, how big would it be?

Based on these answers, I’ll suggest a few experiments to tease out the cause of the problems.