MPI on M3#

In order to run an MPI program on M3 you need to load the correct library and potentially set the correct environment variables. Ideally the library you load is the same one the your MPI program was compiled with. It is possible under some circumstances to mix and match

On CentOS 7#

Most nodes on M3 are currently running CentOS 7, however we have begun the migration process to Rocky 9. Because rocky 9 is more recent and gives us access to updated software we use different libraries on CentOS 7 and Rocky 9

Software#

First we need an MPI program to run

This will retrieve one of the examples provided with the open mpi library.

curl https://raw.githubusercontent.com/open-mpi/ompi/main/examples/connectivity_c.c -o connectivity_c.c

Load the MPI compiler#

Use the command

module load hpcx/2.5.0-redhat7.6

You should now find that the mpicc command is available to you

Compile the software#

The command

mpicc connectivity_c.c -o connectivity_c

will produce a program called connectivity_c

Other software will have a different command to compile. The software should include a readme or install file with instructions. A common pattern to to use a configure script followed by the make command. You should check the logs produced by the configure script to ensure that it found the mpicc command

Running the software#

Running mpi software involves a complicated interplay between the scheduler finding available nodes to run your program, the mpi library linking up all the running copies of your program and the two cooperating the actually launch your program. On top of that there is another complicated interplay finding the right network interfaces to use in your program (a value which might vary depending on which generation of M3 hardware you are using).

The simplest command to run an mpi job is

echo -e '#!/bin/bash\n srun hostname\n srun connectivity_c' | sbatch --ntasks=2 --nodes=2 --cpus-per-task=1 --tasks-per-node=1

This command uses a number of slurm options to make the MPI tasks run on different nodes for demonstration purposes. Under normal circumstances forcing an MPI job to run across nodes if it can run on a single node will decrease performance.

This should produce output like

m3i008
m3i009
Connectivity test on 2 processes PASSED.

99% of the time, the above command will work correctly, however under some circumstances you might get warning messages like

libibverbs: ibv_create_ah failed to get output IF

or error messages like

[1701048534.229996] [m3k008:211954:0]      ib_device.c:990  UCX  ERROR ibv_create_ah(dlid=0 sl=0 port=1 src_path_bits=0 dgid=fe80::ba3f:d2ff:fe2a:c488 sgid_index=3 traffic_class=106) failed: Invalid argument

(usually with a lot of other lines, but the this is the key indicator)

These usually occur if you are using additional slurm flags to select a particular node or partition. These indicate the MPI has been unable to auto-detect the correct network interface to use and needs some hints. In this circumstances you can use

echo -e '#!/bin/bash\n srun hostname\n srun --task-prolog=/usr/local/hpcx/ucx_net_devices.sh connectivity_c' | sbatch --ntasks=2 --nodes=2 --cpus-per-task=1 --tasks-per-node=1

the task-prolog script is a simple helper to find the correct devices.

On Rocky 9#

At the time of writing, most nodes in M3 are running CentOS 7 but we are beginning to move to Rocky9. If this is no longer true, congratulations you have found some old outdated documentation.

We expect a very similar process on Rocky 9 with the exception that there will be a new hpcx module. We expect this to be called

module load hpcx/2.14

The name is not currently finalised, and the module file is not yet created. You can load hpcx by

. /apps/hpcx/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/hpcx-init.sh
hpcx_load

UCX_NET_DEVICES#

If the task prolog doesn’t work for you either you may need to set the environment variable UCX_NET_DEVICES directly. This is a straight environment variable that you can set with export UCX_NET_DEVICES=... Some possible values include

export UCX_NET_DEVICES=mlx5_0:1,mlx5_bond_0:1

This is a list of the two most common device names on M3 nodes. If a device doesn’t exist, it will be ignored with a warning

export UCX_NET_DEVICES=bond0.113

This is the TCP/IP interface. It is expected to be slower (higher latency) than the default network interfaces, but potentially more compatible