Rolling out OS upgrades

What are best practices for rolling out OS upgrades across a cluster?

The group I work with has found that updates within a minor revision of the OS release it is easier to do updates with the package manager, and reinstall Mellanox drivers if needed. If the minor revision is updated, we just reimage the cluster.

I’ve found with automated tools, I close a node, work on the new image, re-image the one node, and do some system tests. If that node is working, I close groups of nodes and re-image. If a node fails to re-image properly, we make a note of it and usually revisit after all the other nodes are updated.

Thanks for getting the discussion started!