My job recently failed and the error log says “Bus Error”. What happened and how do I fix the issue?
On the Yale clusters, a bus error usually means your job ran out of memory (RAM). If you cannot reduce the memory usage of your code, you can request additional memory for your job using the --mem-per-cpu
or --mem
Slurm flags:
https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/#directives
More details: Your program can run into this fault because of the way we manage memory with cgroups so that many jobs can be run on the same physical machine without interfering with one another. If a process inside a job tries to access memory “outside” what was allocated to that job, e.g. more than what you requested, the operating system tells your program that address is invalid with the fault Bus Error, aka SIGBUS, exit(10). A similar fault you might be more familiar with is a Segmentation Fault, aka SIGSEGV, exit(11) which usually results from a program incorrectly trying to access a valid memory address.