Storage solutions - Weka/Vast/BeeGFS vs lustre/gpfs?

stryder · October 25, 2019, 6:52pm

Is anyone is using, evaluating or doing a POC on newer storage solutions from Weka, Vast or BeeGFS as an alternative to parallel systems such as lustre or gpfs?

Of particle interest, do these work to keep GPU system fed?

hjmangalam · October 27, 2019, 3:12am

We’ve been using BeeGFS in 4 largish (PB+) systems for about 5 yrs. 2 use LSI hardware controllers with XFS as the the FS underlay. 2 use ZFS as the underlay. 3 of the FSs use single metadata servers (the 2 XFS and 1 ZFS) and have performed very well. 1 ZFS uses mirrored MD servers and that has been problematic, tho some of that is related to primary config mistakes, mostly having the MD filesystem being config’ed too small. However, we will not use mirrored MD going forward.

All of these were built on generic Supermicro chassis, using QDR/FDR IB nets, mostly using RDMA comms.

There are pros and cons for both XFS and ZFS. The XFS systems are both faster and (somewhat surprisingly), more reliable under load. The ZFS systems can be config’ed to use compression and therefore regain the parity loss in the RAIDZ2 we use. The ZFS underlay tends to be much slower with ZOTfiles (Zillions of Tiny files, coming from offenders like Trinity).

We had a near catastrophe with the LSI Nytro controllers, but we were able to swap them out to more ‘normal’ controllers with LSI’s direct help (and a nod to Advanced HPC for mediating that assistance).

Both the ZFS and XFS versions would benefit from an NVME layer as part of a true HPC fielsystem, either as part of a same-namespace addon, perhaps using Excelero (compatible with BeeGFS) or a separate layer like VASTData. There’s just no excuse for not having a decent flash layer for hot data.

We have had problems with fine-grained UID/GID slicing/dicing relative to a GPFS system - BeeGFS just doesn’t support the bazillion ways to do permissions that GPFS does. On the other hand, it keeps things cleaner using the std Linux permissions. YMMV depending on what your users need.

There is no automatic per-server rebalancing with BeeGFS, altho that is apparently coming RSN, however at least on the hardware RAID/XFS, we have (inadvertently) driven heavy IO to servers with 95%+ capacity with little loss of performance. With ZFS, it seems to be more sensitive to capacity limits. ALso, ZFS seems to take much longer to refresh/resilver after a disk replacement.

So while ZFS has much to offer in terms of checksumming, compression, etc, it dosn’t really seem like a good fit for HPC IO loads, at least without a flash layer to take the heavy IO hit. You can mess around with SLOG, cache sizes, intent logs, etc, but nothing really seems to address the basic loaginess of ZFS.

All that said, HPC filesystems are almost all about edge cases and BeeGFS seems to handle 95% of what we throw at it with pretty good grace, even with the ZFS choppiness.

hjm

ccoffey · February 10, 2020, 5:41pm

Harry, thanks for all of that information. Can you mention where/how you are using beegfs? Are you using it everywhere? For instance scratch (short term), and project (long-term) purposes? I’m not too surprised ZFS doesn’t perform as well for HPC loads in general compared to XFS. XFS is fast! But ZFS has so much to offer it’s still worthwhile of course, but possibly not for a scratch/primary I/O type filesystem.

We have a non-beegfs ZFS implementation running on nexentastor, and it’s not fast, I didn’t expect it to really be. We use it for long-term project storage. This installation is only a single I/O node. I have a feeling that if we used beegfs/zfs with many backend I/O servers it would suffice nicely our long-term project needs and possibly our short-term scratch.

You mentioned reliability being better for XFS, that is a bit surprising. While we run XFS as our defacto backend for NFS filesystems with zero issues, we haven’t had issues with our ZFS either. Are the ZFS issues just due to misconfig on the deployment/architecture? Or?

Have you considered a lustre/ZFS combo?

I’ve always felt very happy with lustre reliability/performance in my 10 years supporting. We’ve only run it with ext4 backend. I’ve wished lustre supported multiple name spaces/mounts however. I don’t believe this is supported as of yet. I believe GPFS does support that however. Does beegfs support multiple namespaces?

Thanks Harry.

stryder · February 10, 2020, 6:20pm

I would be curious where people put their home directories. As many sites started with 4-5GB home directories, ie just enough to login and do simple, simple work (before moving on to project space) but even that isn’t enough with everyone doing their own python, R, etc stacks, all of which default to and fill up home directories (even if the bulk of the work is then done in project space).

I know there is good reason to split home directories off on say NetApp nfs and project space to gpfs (DDN, or whomever) but at the same time, if the parallel file system isn’t working, nothing else is going to get accomplished so I’m also thinking it makes sense to simply combine them.

hjmangalam · February 10, 2020, 11:30pm

Harry, thanks for all of that information. Can you mention where/how you

are using beegfs? Are you using it everywhere? For instance scratch

(short term), and project (long-term) purposes? I’m not too surprised

ZFS doesn’t perform as well for HPC loads in general compared to XFS.

XFS is fast! But ZFS has so much to offer it’s still worthwhile of

course, but possibly not for a scratch/primary I/O type filesystem.

I left UCI last year, but we were using BeeGFS for nearly everything, tho in slightly different configurations. We used a single head with multiple JBODs (ZFS) for our major backup system that received an overnight parallel rsync from the other filesystems. So you could think of that as a long-term storage system and ZFS is pretty good for such things (compression and parity checking on reads).

We also were using BeeGFS for multiple scratch systems - 2 on XFS (single metadata/management heads with 5/6 intelligent XFS servers behind them - those JUST WORKED. The most recent scratch BeeGFS was a mirrored metadata server with multiple JBODs running on ZFS and that one was has been a pain, tho more due to problems with ZFS, metadata mirroring, and some misconfiguration on initial setup (ignoring half the supplied metadata storage, for example). The upshot is KISS. Allow LOTS of flash metadata space (and put that on a RAID10) and watch the servers carefully. Those XFS servers were generic Supermicros and worked well for 8(!) years with the only problem our near fatal catastrophe with those early LSI Nytro controllers. The replacements (bog-standard LSI SAS hardware controllers) have been solid.

Disks are so cheap now that the recovery of the parity loss thru compression isn’t nearly as important as it used to be and most of the ZFS goodies make sense only if you have a dedicated ZFS geek/tuner in house.

And nowadays I’d have the primary scratch on flash, with a more standard FS behind it. No 2 ways about it (trinity runs sped up 7-20x on generic flash). I haven’t experimented with the different types of flash-optimized FSs, but we were close to trying Exelero (whose current CTO was BeeGFS’s primary architect) when I left.

We do use NFS for a few critical systems (against my advice - we’ve had more problems with NFS than with BeeGFS), but there’s a valid argument to keeping some eggs in different baskets. My issues with NFS have to do with keeping $HOMEs on NFS (with the resultant IO bottlenecks, since many users run large jobs out of their $HOME) and putting our checkpoint server on an NFS machine - an obviously IO mismatch for NFS)

We have a non-beegfs ZFS implementation running on nexentastor, and it’s

not fast, I didn’t expect it to really be. We use it for long-term

project storage. This installation is only a single I/O node. I have a

feeling that if we used beegfs/zfs with many backend I/O servers it

would suffice nicely our long-term project needs and possibly our

short-term scratch.

ZFS is especially bad on ZOTfiles (Zillions Of Tiny) like what come out of some simulations, the trinity app, many self-writ apps, which use zero-length files for bookkeeping or indexing and unlike XFS, seems to slow down considerably when storage reaches ~80%. It’s great for data intergrity, but possibly related, takes a bazillion years to resilver large disks, much more so than XFS on hardware controllers.

You mentioned reliability being better for XFS, that is a bit surprising.

While we run XFS as our defacto backend for NFS filesystems with zero

issues, we haven’t had issues with our ZFS either. Are the ZFS issues

just due to misconfig on the deployment/architecture? Or?

IIRC, the problems with ZFS were not so much reliability, but complexity and opacity. You can do lots of things with ZFS, but figuring out how they impact various other things can take time and effort and since most RC groups are understaffed, that can lead to problems. Especially with performance tuning, there are lots of things that can have minor to major effects and those don’t show up until your BeeGFS is well into use and it’s too late. And BeeGFS can’t make use of some useful ZFS niceties like snapshots, altho Thinkparq has that on their roadmap (I think).

The other thing is that while ZFS is fairly efficient at CPU usage, it can peg the cores of an underpowered server since it’s an unaccelerated software system. I have some long term screenshots of our backup server showing this - loadavg goes over 20 and IOwait shows significant numbers during the nightly processing. (the PMJ output image below is a composite of 3 nights of data logs of the same backup command).

Have you considered a lustre/ZFS combo?

No. Complexity atop complexity isn’t a good fit for us. We tried it 3x over a decade with bad results each time. One UCI cluster IS using it in house and claims good perf on large files, but every FS you use is another one you have to become expert in, taking away time from the others and BeeGFS addressed all the pain points that Lustre was supposed to, so no reason to try it again.

I’ve always felt very happy with lustre reliability/performance in my 10

years supporting. We’ve only run it with ext4 backend.

Really…? We were on the user list for SDSC’s Lustre system and it seemed to me that it was fairly regularly being bounced up and down for various reasons. Happy to see that you had a good experience with it.

I’ve wished

lustre supported multiple name spaces/mounts however. I don’t believe

this is supported as of yet. I believe GPFS does support that however.

Does beegfs support multiple namespaces?

Not in the way that GPFS does - but then you run into the complexity of THAT bowl of cats. We do use GPFS for archived data, but it’s a VERY expensive way to do it, even at academic prices. And the way it’s implemented at UCI makes it difficult to access and exploit, so after a year of giving away the GPFS space for free, it’s still only 12% used, whereas the BeeGFS systems are 80-90% used. This is almost entirely due to the mechanism of how it’s accessed (via an single-pipe NFS export to the cluster), not the inherent perf of GPFS, which I think is pretty good.

I’d also look at Quobyte. Qumulo is also pretty good (no better storage monitoring interface), but the last time we looked, the academic prices were way out of reach. You can do similar things with Starfish over multiple vendors’ filesystems. You really do need this kind of data management system if you’re going to be dealing with a PB+ data back end.

The academic price for Starfish was still reasonable when I was trying to get it installed at UCI. We used it in beta for several months and it was VERY useful - one of the few commercial products I’ve used that really did seem to bear out its price.

Best of luck,

hjm