The lifecycle of a cluster

vsoch · July 2, 2019, 2:30am

I saw this recent post on Twitter about the end of the life of the cluster titan, and it made me a little sad:

But it also prompts a lot of questions that would be interesting to discuss. For example, it’s logical that servers and disks only have a certain life, and often are replaced well before the infrastructure that they serve is taken down. It’s also the case that hardware changes so quickly that at some point, given that funding is sufficient, it might even be easier to build a new cluster then to try and restore an older one. So with this in mind, I want to ask my question. What is the typical lifecycle of a research cluster? What are the factors that determine clusters that have shorter lives versus longer lives? What are the challenges in maintaining an older cluster? A newer one? It occurs to me that documentation bases might be really hard to maintain just based on the fact that they need to be totally redone for some new entity every 5 to 10 years. It also seems likely that there might be some balance between providing the newest and trendiest that the user might want, and maintaining something that is stable and reliable. Now given some common lifecycle that you might have in mind, is there any potential future that would allow for change to be less frequent, and clusters to be more stable? Will clusters ever be able to have longer lives?

Looking forward to hearing what people think!

Phillip_Benoit · July 2, 2019, 5:13pm

It really does depend on the type of research and how well the OS and research code is supported. Assuming the machine can still receive software and security updates, the lifespan can be extended as long as the data does not change. One could theoretically run mathematical simulations that output kilobyte files on nearly any machine but accommodating 5 petabytes of radio telescope data would take some upgrades for most machines. For most scientific purposes, the amount of data to process is ever increasing. This takes more disk space, memory and network capacity. There is also a bias in development towards writing code that is easier for the developer to understand than the machine. This means that people are writing horribly inefficient code in MatLab as opposed to writing better code in a more efficient language like C. The solution for most institutions just seems to be buy faster computers as opposed to fixing their code so it runs more efficiently. There are things that can be done to extend clusters’ lives but it takes knowing how to use them within the framework of the research you’re doing.

vsoch · August 5, 2019, 4:17pm

Just to note the historical event, they took a video of “Titan’s Last Gasp” - as if they needed to make it any more sad than it already was!

In case the post goes away, here is the content:

Today, August 2, 2019 at 1:03 p.m. Eastern Daylight Time, the Titan supercomputer at ORNL breathed its last gasp as it was powered off. Titan first appeared on the TOP500 list as #1 in November 2012 and remained in the top 5 of the list for 5 years and the top 10 for 6 years. Over the course of its 7 years of service, Titan provided nearly 27 billion processor-hours of compute time to 896 research projects using AMD 18,688 16-core Opteron processors and NVIDIA K20x GPUs.

Titan was reincarnated from its predecessor, Jaguar by replacing the node boards but reusing the cabinets, cooling system, backplane, interconnect cables, and power supplies. It is survived by its successor system, Summit which has 4,608 IBM Power9 CPUs and 27,648 NVIDIA V100 GPUs and is currently the #1 system on the TOP500 list.

toreliza · December 4, 2019, 10:12pm

From a high-level perspective, I’d say that (node) warranty lifetimes have a major impact on cluster lifecycle; for example, if I am trying to recoup the cost of a cluster by promoting node purchases to researchers, I don’t want to include support (from our center) on out-of-warranty nodes, nor do I want to encourage the purchaser to run them past warranty (because of the effect on the entire cluster). The rapid pace of hardware technology advancement also detracts from trying to run nodes into the ground, or to patch them together. The state of scientific applications has influence as well; if code features keep up with hardware evolution, it’s almost mandatory to be able to offer researchers the ability to use those features; if existing hardware can’t be tweaked to perform, it’s difficult to defend keeping it.

A shout-out to Phillip_Benoit re lamenting the infiltration of ‘horribly inefficient code in Matlab’!

alansill · December 5, 2019, 3:54pm

In addition to the considerations mentioned above, you could add power, space, and cooling topics. The increase in density and power efficiency over the past two decades has been relentless, leading to an initially simple to understand but now actually fairly interestingly complex set of related features in this space. First, obviously you can get a lot more computing power in a lot less physical space these days - for example, our most recent cluster purchase will comfortably fit over 30,000 cores in just four racks, compared to the previous cluster (now only 3 years old) that took 10 racks to fit just over half as many cores. Accelerators (GPUs, etc.) also are providing large amounts of computing power in a small cpomparatively small amount of space, beyond the compression already achieved with conventional cores. It’s also the case that the power efficiency has been increasing at a faster rate than the computational power, even for conventional cores, so you can get more computation for the same amount of power.

Counteracting these considerations are the consequences of dealing with the physical effects of this increase in density. As racks push up into the several tens of kW range each, adding direct liquid cooling, immersion cooling, or other similar methods becomes necessary and adds complication, space requirements, engineering considerations, and expense. Overall this is usually a big win, in spite of these complications; so moving to new hardware just for reasons of better power efficiency, increased reliability (including warranty coverage periods), and reduced space requirements as long as you can handle the consequences is usually a good idea. Oh, also, you’ll need money to buy the new hardware, which is of course probably the major consideration for most!