I tried to run a large job array using the /scratch storage to save the jobs’ output. My jobs are usually CPU-intensive, but they don’t use a large amount of RAM or disk storage. For this reason, I thought that it would be better to split the workload so that smaller jobs are distributed over more nodes. I was warned that the lack of a parallel I/O for the /scratch partition on Europa would create a bottleneck when there are too many random disk writes during job execution.
To improve the implementation, I have a few of questions:
are /home and /scratch set up differently? All of the output I need should fit comfortably in my home folder, so I could use that.
do I/O reads and writes have a similar effect on performance?
how many I/O calls are too many? My jobs usually append a ~100 character line to the output file every few minutes of execution.
/home is over the ethernet fabric and uses NVMe drives. These drives are very good at random I/O. /scratch is over the infiniband fabric and is right now simply 12 spinning SATA disks. These are good at streaming I/O but aren’t very good at small, random read / writes.
Reads will be faster than writes; sequential I/O is much, much faster than random I/O.
We are still determining the limitations of our file systems. We also hope to expand / replace scratch before going into production in the Fall.
For your workloads, I feel /home might work better than /scratch so you could try that. But, one thing you could try first, is to use /dev/shm.
/dev/shm is a RAM disk. On most systems, half the RAM of the system is available via a POSIX file system mounted as /dev/shm. That is, you can write to /dev/shm using standard file system calls / commands (cp, mv, cat, etc.) and it’ll work and be VERY fast because it’s actually stored in RAM.
In order to make use of this, you’ll have to alter your workflows. At the start of each launcher task, you’ll need to copy stuff over to /dev/shm and after, you’ll need to copy it back to $HOME or $SCRATCH BEFORE the job ends. You wont be able to access /dev/shm after the job is done as it is local to each compute node.
Play around with it and let’s slowly scale your launcher nodes up from maybe 100 to 200 before we go back up to 500.
Thank you for the explanation; this definitely paints a clearer picture. I doubt that the RAM disk will help much since we are trying to keep the writes to a minimum. Unless the response to random I/O is considerably faster and prevents a bottleneck issue of all nodes writing at the same time.
For now, I will try running from $HOME. I submitted a 50 node batch and it seems to be running at a good speed. This is the smallest job/unit size I can reasonably run and seems to take 20/30 hours. If there was a way to see how much time is spent writing to disk, it would help. I have been trying to use sstat but it doesn’t give me much information.