UPS / Battery Backup For HPC Systems

toreliza · September 16, 2019, 10:11pm

Hello,

We’re re-evaluating the battery back up policies for our data center. Currently we have all nodes on battery back up; head nodes, storage nodes, compute nodes. We are thinking of no longer backing up compute nodes, and are wondering what hardware/nodes other institutions consider critical to back up (via battery). If other HPC centers have experience and/or advice they would be willing to share I’d very much appreciate the wisdom!

dertz · September 17, 2019, 1:07am

The batteries keep our systems up and constant power while the generator gets up to point where it produce the energy need. This eliminates equipment failures sometimes experienced with rapid power lost. It has been my experience that turning on and off equipment increases the risk of equipment failure. The sudden lost of power also can also increase risks of leaving software applications in unstable states and the time required to clean up after power failures more than pays for the cost of the batteries.

akkornel · September 17, 2019, 12:59am

We have our entire data center protected by double-conversion flywheel UPSes, just long enough to keep things up before backup generators can kick on in the event of a power outage. Personally, I’d suggest against segregating your data center in such a way.

That being said, our data center has more than just our HPC clusters. A fair amount of space is used for data storage, and alot of our space is used for things that aren’t clusters. That’s intentional, though; if you are building a data center around a cluster, and aren’t going to use it for anything else, then of course things will be different.

But it’s also nice having the peace-of-mind. We don’t have to take the risk that someone plugged a storage controller into the non-UPS rack. We don’t have to deal with a transient power issue knocking out all of the compute nodes, not only from the sudden workload of having to bring everything back up, but also from the impact to our reputation. It can be hard telling users that they’ve lost a day of compute because we decided to save $X in capital costs. And of course the outage would happen when there’s a critical grant deadline.

All that being said, there are still ways you can save money. We use flywheel UPSes, so there are no batteries to replace. We do fresh-air cooling whenever possible, which saves on power costs. Our PDUs run at 240 VAC line-to-neutral (415 VAC line-to-line), which means we only need a single set of transformers (prior to the UPS), plus a small 120/240 transformer for office spaces. And our data center is single-fed: We have two sets of cables coming in to the building, but we are getting power from just one substation. Each rack only has one PDU, and row only has one set of busbars, and traces back to only one UPS. We have to shut everything down once every 5 years, for several days of maintenance, but it was deemed to be an acceptable trade-off for a research data center.