Performance tuning on Infiniband

I’m new to looking at Infiniband, such as debugging and tuning strategies. What tools and resources are out there for learning how to optimize and debug? For instance, I know if ibdiagnet for Mellanox cards, are there other programs that show throughput and other helpful statistics?

Hi,

I recommend having a few solid tests you can run quickly, then get into debugging if you have to, and tuning if you can.

If you’re new to InfiniBand, learn to use the low-level point-to-point bandwidth and latency tools, (ib_read_bw, ib_write_bw, ib_read_lat, ib_write_lat). These will ensure that the drivers and hardware are working correctly. That’s your first sanity check.

Next up, run the OSU Micro-Benchmarks. That exercises your MPI installation.

Avoid full-blown scientific applications until those things are working, or you’ll be wasting your time. If they are working, and an application isn’t performing as expected, then you get into tuning or the application.

For people who might not be experts on the hardware, running test programs like ib_read_bw, etc, is there something that would help them interpret the numbers that they get from those tests?

For example, it appears that ib_read_bw can be run by a normal user. If a regular user wanted to know about performance, how would they know whether

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             11552.68            11534.44		   0.184551
---------------------------------------------------------------------------------------

was good, bad, or indifferent?

The same would be true for the OSU benchmarks, I think. Lots of figures will be produced, but for someone who is new to this, is there something that can guide them in interpreting whether their output is indicative of good performance for the cluster they happen to be running on?

Hi @shmget: I did a similar exercise at the end of last year and used an approach suggested by @rpwagner i.e. running the OSU-micro-benchmarks. As per @bennet’s question on interpreting these results, I compared the results of well known benchmarks (from the large number of benchmarks in the osu suite) against expected numbers (these being the documented bandwidth/latency for the IB interconnect at hand) for IB, mainly point-to-point (one mpi rank per node and no threading) latency and bandwidth (similar to this study for omni-path).

What I saw was that the newer versions of mpi implementations with modern transport protocols like UCX perform the best (and if you’re using slurm on the cluster, ensure to build it with PMIx support as well).

If you’re interested, all the code I used to run the tests and analyze the results is available here. I tested the mpi versions that were available using this submit script and analyzed the data here. For the new mpi installations, I used spack to build multiple versions of osu-benchmarks and performed the same test (results). If you’re interested in reusing the code, I can try cleaning up the parts that look confusing to you.