Anyone else have failures with Intel 2019-mpi with Mellanox IB for # of cores > 300?

Just curious if anyone has any comments regarding a problem we are seeing on our now skylake cluster. All MPI/compiler combinations including intel-2018, intel-2019, gcc-9.1.0 with openmpi/mvapich2 as well as intel-2018-mpi run fine on all 960 cores, however, intel-2019-mpi on more than ~300 cores fails. We can switch the FI_PROVIDER from ofi_rxm to verbs, but then all codes slow down significantly (but no crash).

I have posted a message on Mellanox’s forum and was told I should contact Intel. I have also tried to post on Intel’s HPC forum and the libfabric forum, but these messages are stuck in moderation.

Anyway, if anyone has any pointers for this. I am really interested in using intel-2019 mpi.

Hi, we are currently standing up a new cluster with Mellanox ConnectX-5 adapters. I have found that using openMPI, mvapich2, and intel2018-mpi, we can run MPI jobs on all 960 cores in the cluster, however, using intel2019-mpi we can’t get beyond ~300 mpi ranks. If we do, we get the following error for every rank:

Abort(273768207) on node 650 (rank 650 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(507)…: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=650, new_comm=0x7911e8) failed
PMPI_Comm_split(489)…:
MPIR_Comm_split_impl(167)…:
MPIR_Allgather_intra_auto(145)…: Failure during collective
MPIR_Allgather_intra_auto(141)…:
MPIR_Allgather_intra_brucks(115)…:
MPIC_Sendrecv(344)…:
MPID_Isend(662)…:
MPID_isend_unsafe(282)…:
MPIDI_OFI_send_lightweight_request(106):
(unknown)(): Other MPI error

This is using the default FI_PROVIDER of ofi_rxm. If we switch to using “verbs”, we can run all 960 cores, but tests show an order of magnitude increase in latency and much longer run times.

We have tried installing our own libfabrics (from the git repo ; also we verified with verbose debugging that we are using this libfabrics) and this behavoir does not change

Is there anything I can change to allow all 960 cores using the default ofi_rxm provider? Or, is there a way to improve performance using the verbs provider?

For completeness:
Using MLNX_OFED_LINUX-4.6-1.0.1.1-rhel7.6-x86_64 ofed
CentOS 7.6.1810 (kernel = 3.10.0-957.21.3.el7.x86_64)
Intel Parallel studio version 19.0.4.243
Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]

Thanks!

Eric

Eric, we have a very similar system, and have Intel 19 in the queue to install, and can move it up.

It seems counterintuitive that mpi over libfabric over verbs is a lot faster than mpi over verbs, but a lot of counterintuitive things are true. If you could show some low-level benchmarks like osu_latency with intel 18/19/verbs/libfabric, that would help.

Thanks,
David Chaffin
UArk

Hi David,

Thanks for reaching out. According to Intel support, this is a bug and the next update will fix this.

Anyway, I was able to compile my own libfabric v1.8.0 and I believe things are working OK now when I swap the intel 2019-mpi fabric provider out and provide my own.

As far as the latency results, I think that Intel-2019-mpi just has some problems with the verbs provider. Here are some results I just generated using Intel-2019 and Intel-2018 with/without my own libfabric:

Note, the Intel2019 provided libfabric verbs is much worse than my own libfabric. Also, the Intel2019 provided libfabric verbs crashes when the number of cores > 300 or so. This doesn’t happen with my own libfabric. Also note that both libfabric verbs are worse than than intel2018, intel2019 with FI_PROVIDER=ofi_rxm and intel2019 w/ openMPI.

All of these tests were done on two nodes, one core each:

  1. intel/2019 + intel/2019+mpi:
    [slurm> [7 ewalter@fm03 ~/osu_intel2019/mpi/pt2pt ]$srun ./osu_latency

OSU MPI Latency Test v5.4.1

Size Latency (us)

0 1.34
1 1.34
2 1.34
4 1.35
8 1.34
16 1.34
32 1.34
64 1.37
128 1.40
256 1.97
512 2.06
1024 2.29
2048 2.75
4096 3.67
8192 5.18
16384 7.86
32768 10.52
65536 15.27
131072 23.25
262144 85.20
524288 126.66
1048576 208.21
2097152 367.19
4194304 691.05

  1. Same but did export FI_PROVIDER=^ofi_rxm (Intel docs say this is how to choose “verbs” instead of “ofi_rxm”:

[slurm> [8 ewalter@fm03 ~/osu_intel2019/mpi/pt2pt ]$export FI_PROVIDER=^ofi_rxm
[slurm> [9 ewalter@fm03 ~/osu_intel2019/mpi/pt2pt ]$srun ./osu_latency

OSU MPI Latency Test v5.4.1

Size Latency (us)

0 48.79
1 48.86
2 49.05
4 49.04
8 48.97
16 49.02
32 49.49
64 51.34
128 52.55
256 58.17
512 99.53
1024 156.44
2048 91.36
4096 111.51
8192 143.85
16384 215.19
32768 356.01
65536 634.66
131072 1191.38
262144 2313.45
524288 4548.35
1048576 9012.70

  1. Same as 2 but using my own libfabric verbs:

[slurm> [2 ewalter@fm03 ~/osu_intel2019/mpi/pt2pt ]$env|grep FI_
FI_PROVIDER_PATH=/usr/local/skylake/gcc-4.8.5/libfabric-1.8.0/lib/libfabric:/usr/local/skylake/gcc-4.8.5/libfabric-1.8.0/lib
FI_PROVIDER=^ofi_rxm
[slurm> [3 ewalter@fm03 ~/osu_intel2019/mpi/pt2pt ]$srun -n 2 ./osu_latency

OSU MPI Latency Test v5.4.1

Size Latency (us)

0 1.92
1 1.91
2 1.91
4 1.91
8 1.91
16 1.91
32 1.93
64 1.94
128 2.01
256 2.55
512 3.20
1024 2.76
2048 3.18
4096 4.52
8192 5.74
16384 7.87
32768 10.44
65536 16.10
131072 27.73
262144 52.73
524288 104.75
1048576 219.32
2097152 444.31
4194304 895.42

  1. Now with intel/2018 + intel/2018-mpi:

OSU MPI Latency Test v5.4.1

Size Latency (us)

0 1.30
1 1.30
2 1.30
4 1.30
8 1.30
16 1.30
32 1.79
64 1.80
128 1.86
256 1.96
512 2.06
1024 2.21
2048 2.63
4096 3.38
8192 4.86
16384 6.67
32768 9.31
65536 13.78
131072 21.16
262144 86.97
524288 127.97
1048576 208.06
2097152 365.83
4194304 672.90

  1. Finally, intel2019 with openmpi-3.1.4:

[slurm> [1 ewalter@fm03 ~/osu_intel2019_ompi/mpi/pt2pt ]$srun ./osu_latency

OSU MPI Latency Test v5.4.1

Size Latency (us)

0 1.21
1 1.24
2 1.24
4 1.24
8 1.27
16 1.27
32 1.30
64 1.39
128 1.94
256 2.05
512 2.18
1024 2.42
2048 2.78
4096 3.84
8192 5.18
16384 7.00
32768 9.12
65536 12.20
131072 21.61
262144 32.46
524288 56.39
1048576 97.14
2097152 180.75
4194304 353.95

Let me know if you have any questions or comments.

Thanks!

Eric

Eric,

ofi_rxm doesn’t look very good, does it?

I did the OSU latency and bw tests with both single-node shared memory and two-node over IB.
Centos and OFED are about 1 version older.

Centos 3.10.0-957.10.1
MOFED 4.5.1.0.1.1
Intel MPI 18.0.2 and 19.0.4
Connect-X5
unmodified intel libfabric
19.0.4:
export I_MPI_FABRICS=shm:ofi export FI_PROVIDER=verbs

I haven’t tried a large run yet, but these data seem to show IB performance in 19.0.4 is slightly improved and zero-size latency is slightly improved, but there is a pretty significant regression in shared-memory performance for large transfers.

latency bandwidth
Size 18.0.2 SHM 18.0.2 IB 19.0.4 SHM 19.0.4 IB 18.0.2 SHM 18.0.2 IB 19.0.4 SHM 19.0.4 IB
0 0.66 1.21 0.4 1.16
1 0.94 1.22 0.4 1.16 1.78 2.73 3.59 5.42
2 0.94 1.22 0.4 1.17 3.63 5.6 7.09 10.93
4 0.94 1.22 0.4 1.17 7.38 11.16 14.42 22.11
8 0.94 1.22 0.4 1.17 14.68 22.43 29.2 44.22
16 0.94 1.22 0.4 1.18 29.25 44.83 58.73 88.36
32 0.97 1.28 0.41 1.15 60.93 87.79 111.7 179.1
64 0.98 1.68 0.41 1.19 134.3 157.79 226.13 355.12
128 1.03 1.71 0.54 1.24 266.58 312.79 416.61 665.17
256 0.99 1.74 0.58 1.65 523.78 617.98 1122.71 1251.76
512 1.14 1.82 0.8 1.73 872.19 1182.07 872.98 2495.59
1024 1.42 1.97 0.86 1.91 1465.47 2209.63 1209.55 4316.68
2048 1.51 2.34 1 2.29 2344.37 3856.31 2051.44 6367.19
4096 2.1 2.92 1.44 3.07 3185.61 6148.02 3797.18 8060.09
8192 3.86 4.13 1.9 4.32 4219.96 8352.73 5494.97 9239.97
16384 5.79 5.38 3.31 6.74 4710.59 8521.69 6699.2 10219.29
32768 11.59 7.62 5.18 9.3 5198.02 10567.78 7501.25 10752.05
65536 8.16 11.1 9.3 12.67 12671.75 10856.02 8053.1 11419.71
131072 10.76 18.63 18.35 19.47 14723.34 11740.45 8345.94 10722.96
262144 18.37 32.14 35.29 71.74 16153.7 11837.02 8353.96 10132.35
524288 35.13 126.22 84.07 112.44 16253.88 11628.41 8331.19 11583.93
1048576 81.57 208.41 169.7 193.16 13875.37 11720.75 7590.23 11606.28
2097152 204.52 377.61 337.33 349.34 10451.14 11771.32 7348.55 11600.84
4194304 439.36 740.37 670.89 678.62 9806.69 11767.09 7591.09 11631.22

Hi David,

Thanks for running these tests and the info.

It looks like we get similar results for the IB tests. I repeated my tests using SHM (2 cores on 1 node) and also get similar results (with the nosedive at the larger message sizes).

So it turns out that this seems to be wrong (or, at least I am misunderstanding it):

From: https://software.intel.com/en-us/mpi-developer-guide-linux-ofi-providers-support

“The verbs provider uses RxM utility provider to emulate FI_EP_RDM endpoint over verbs FI_EP_MSG endpoint by default. The verbs provider with FI_EP_RDM endpoint can be used instead of RxM by setting the FI_PROVIDER=^ofi_rxm runtime parameter,”

I find that this results the provider becoming sockets for me. Using FI_PROVIDER=verbs gives similar results as the default ofi_rxm provider using intel-2019, but both results in segfaults at larger numbers of cores (~400). When I use my own built libfabric, I get the results I posted before (set #3), which are somewhat worse in latency, but much worse in the osu_bw test:

OSU MPI Bandwidth Test v5.4.1

FI_PROVIDER_PATH=/usr/local/skylake/gcc-4.8.5/libfabric-1.8.0/lib/libfabric:/usr/local/skylake/gcc-4.8.5/libfabric-1.8.0/lib
FI_PROVIDER=^ofi_rxm

Size Bandwidth (MB/s)

1                       1.39
2                       2.82
4                       5.66
8                      11.31
16                     22.86
32                     45.56
64                     89.51
128                   176.62
256                   349.46
512                   564.31
1024                 1317.93
2048                 2318.49
4096                 2800.52
8192                 4030.74
16384                4743.87
32768                5668.91
65536                6082.44
131072               6236.04
262144               6293.15
524288               6138.85
1048576              5645.81
2097152              5597.79
4194304              5589.22

I would love to see if you also get crashes for intel-2019 with 400 or so cores. For instance, osu_gather crashes at size=32768 on 400 cores for me. When I try to use 896 cores, it crashes before any test is run.

I am going to steer users away from Intel-2019 mpi (and instead use OpenMPI) until the next update comes out (in a few weeks).

Thanks again,

Eric

Eric, have you ever figured this out? I’ve asked on the Intel thread, but I think here is more complete. Still on 2020.1 release the same issues occurs. I have a machine with 400 cores and Intel MPI works fine, but another one with 1280 cores it’s impossible to run. I found that the new number of “crash” is on 640 cores.

So anyone figured this out?

Hmmm but you’re running with 300 cores right? I found a new crash wall at 640. Have you tried with this numbers? Running the libfabric provided by Intel?

Thanks!

Hi,

Yes, upgrading from 2019.4 to 2019.5 fix the issue for us. I haven’t tried 2020.1 yet.

Hope this helps!

Regards,

Eric

Hi,

Yes, I have gone up to ~1000 cores with 2019.5 but it failed miserably for 2019.4.

Regards,

Eric