How is storage performance for high I/O HPC jobs affected by running in the cloud?

toreliza · February 18, 2019, 7:19pm

We have some projects that include running calculations with high I/O on our university HPC cluster. We are considering moving them to cloud, probably AWS. I believe that the data to be used will reside in a data center in the cloud, suggesting that latency and bandwidth could possibly be affected by geological distances. Not to mention that properties and capacities on both the client side and the cloud storage end will most likely affect performance as observed by the user running the job.

What are the effects of the level of parallelization of the storage itself? Most likely this depends on the size of the chunks of data (objects) being requested (as well as the frequency). The client configuration; processing speed, memory, its own storage properties; must also influence I/O performance in the cloud. Does anyone have some recent numbers (and impressions) they could share?

Thank you!

lllowe · December 8, 2021, 5:33pm

Most people who mention cloud storage are referring to bucket storage (AWS S3). But you can’t run calculations from bucket storage, so you need to move all the data from bucket storage to the block storage (AWS EBS) of the instance (AWS EC2) running the calculation.

If you are doing parallel programming, you need to create a shared filesystem for the instances or use a managed one (‘AWS EFS’ a long time ago when I last knew something, but now they have FSx for Lustre).

So, if you were to look up just the cost of maintaining that persistent parallel file system in the cloud you would give up (or at least I would), and buy more local HPC storage, or use XSEDE, or use DOE machines.

Amazon, prove me wrong!