What is a checkpoint?

I was watching a video about a HPC facility in Canada talk about their new storage solution. They mentioned that checkpoints used to run in the hours, but with the new solution it takes 1/2 hour. In the HPC world, what is a checkpoint? Just curious…

Can we get a link to the video? Normally, the most relevant checkpoint definition would be application checkpointing, but those wouldn’t typically take that much time (unless there was a truly massive amount of data required to restart the application from an intermediate state).

Maybe they were talking about something more storage-specific, like snapshots or replication (snapshots should be practically instantaneous, replication time is based off of the amount of data to send, and the throughput of the storage and any connecting networks).

An application checkpoint allows a crashed application (think utility power outage) to restart and continue its computational task from the latest checkpoint file. That file is a binary file with all the “instructions” on how to proceed. Basically, the application is halted, all open files and content in memory are saved in a wrapper shell, which can be restarted. Should be fast, depends on memory size. Checkpoint files should be saved in some common file system that automatically removes old files.

We have migrated from BLCR to DMTCP because it is not kernel version dependent and much easier with parallel jobs. Some more information and a sample of how to use it can be found at https://dokuwiki.wesleyan.edu/doku.php?id=cluster:190

Our compute nodes are on utility power only hence every user learns DMTCP. It also allows the sysadmin to take nodes down on an emergency basis with little impact if coordinated with users.

-Henk