Binary Memory Corruption

Hello,

I am trying to run VASP+Wannier90 on Ganymede. This does not use the VASP module, but instead a separate executable I compiled for the express purpose of interfacing with Wannier90. I have used this setup successfully in the past, but lately any jobs I try to run fail with the following error:
Error in `/home/prc160030/bin/vasp.5.4.4/binaries/vasp_std_w90v2.1’: malloc(): memory corruption: 0x000000000bd94590
This error message repeats multiple times, each ending with a different hex value.

The error seems to indicate the binary itself is the problem. Its kept in my home directory and has not been copied or moved since I initially compiled it, so I don’t know what could be causing the problem.
For technical reasons, I can only run these jobs on the MSL-SKX node. So the configuration is always 1-node, 24 cores. Other jobs run fine on this node.

I can try recompiling VASP, but I was wondering if anyone else had other ideas first.

Thanks in advance,
Patrick

Odds are this is a problem with the code and not a hardware problem. Is there anything different about this run from other runs you’ve done with the same executable? Some new input option you’ve twiddled or changed input that would cause a different execution path?

Usually this error means your code is stomping on memory it has previously allocated. It is time to do some debugging and memory profiling to determine the cause. I have a lecture on debugging that I taught this spring that is online.

Go to http://utd.link/prc and download lecture 12-debugging.pdf.

I’d start by running it through valgrind and see if that can find anything. Valgrind is available on Ganymede. Additionally, you might want to have it generate a core file and do post mortem with gdb.

A few quick tests verified that the executable was working as intended, so the problem had something to do with this particular job. By watching the output files as the job ran, it became apparent that the VASP part would complete fine, and the error would occur when the Wannier90 part tried to start. I iterated parameters until I eventually found the problem had to do with parallelization. Basically, Wannier90 will fail if any of VASP’s internal parallelization routines are used. Symmetry must also be disabled. This, of course, is not in the documentation for either code.

Nice find! So am I understanding correctly that you are good now and able to run?

Yes. As long as I configure the VASP part correctly, the Wannier90 part runs without error.