Combined Compute/Storage Cluster?

Ben · February 26, 2021, 1:53am

Hi,

Has anyone built a converged cluster using the storage on the compute nodes to reduce costs? If so:

how was it set up?
how well did it perform in practice?
did it actually save money in the end?
is there a sweet spot for performance/cost?

I have a requirement to build a roughly 30 node cluster with 1 PB scratch storage. The workload will be non-MPI/pleasingly-parallel so there are no synchronization/jitter concerns. The results will be sent to an archive so no stringent requirements on the storage reliability/redundancy. The intended use is for a specific project so there is no need to scale compute and storage separately.

Previously we’ve always had a separate storage system, but I’m curious if it would be cost-effective to add storage on the compute nodes and build a filesystem using beegfs. However would inifinband/RDMA be essential to reduce CPU load, or would RoCE be sufficient? If infiniband is required could I fit everything including the head node, subnet manager, and metadata server under single switch? Is it possible to do build in house or would it really require a vendor?

Thanks in advance for your opinions, recommendations, and experiences, Ben

hjmangalam · March 8, 2021, 6:13pm

While there are lots of solutions that might apply here, the one that I’m familiar with is a variant of the BeeGFS system called BEEOND (for BeeGFS ON Demand), which spins up a user-dedicated parallel FS on demand (appro for a long-running but high-perf storage system dedicated to that user), using any nodes that can contribute storage, similar to the converged system you describe. (And apologies for the repeated BeeGFS-centric posts - it’s the one with which I’m most familiar and I suffer from the ‘when you’re good with a hammer, everything looks like a nail’. )

You could use this approach to put together a longer-running converged instance for multiple users - BeeGFS allows you to use anything you want as a storage node. The downside is the reliability of such a system - user code could blow up or lockup a contributing node, endangering the reliability of the whole, altho BeeGFS allows some work-arounds by setting timeouts and re-direction of writes, tho of course reads will hang. That’s the main reason we did not go ahead with implementing it for our own cluster.

Interested in hearing about other solutions.

Harry