MPI on M3#
In order to run an MPI program on M3 you need to load the correct library and potentially set the correct environment variables. Ideally the library you load is the same one the your MPI program was compiled with. It is possible under some circumstances to mix and match
On CentOS 7#
Most nodes on M3 are currently running CentOS 7, however we have begun the migration process to Rocky 9. Because rocky 9 is more recent and gives us access to updated software we use different libraries on CentOS 7 and Rocky 9
Software#
First we need an MPI program to run
This will retrieve one of the examples provided with the open mpi library.
curl https://raw.githubusercontent.com/open-mpi/ompi/main/examples/connectivity_c.c -o connectivity_c.c
Load the MPI compiler#
Use the command
module load hpcx/2.5.0-redhat7.6
You should now find that the mpicc command is available to you
Compile the software#
The command
mpicc connectivity_c.c -o connectivity_c
will produce a program called connectivity_c
Other software will have a different command to compile. The software
should include a readme or install file with instructions. A common
pattern to to use a configure
script followed by the make
command. You should check the logs produced by the configure script to
ensure that it found the mpicc command
Running the software#
Running mpi software involves a complicated interplay between the scheduler finding available nodes to run your program, the mpi library linking up all the running copies of your program and the two cooperating the actually launch your program. On top of that there is another complicated interplay finding the right network interfaces to use in your program (a value which might vary depending on which generation of M3 hardware you are using).
The simplest command to run an mpi job is
echo -e '#!/bin/bash\n srun hostname\n srun connectivity_c' | sbatch --ntasks=2 --nodes=2 --cpus-per-task=1 --tasks-per-node=1
This command uses a number of slurm options to make the MPI tasks run on different nodes for demonstration purposes. Under normal circumstances forcing an MPI job to run across nodes if it can run on a single node will decrease performance.
This should produce output like
m3i008
m3i009
Connectivity test on 2 processes PASSED.
99% of the time, the above command will work correctly, however under some circumstances you might get warning messages like
libibverbs: ibv_create_ah failed to get output IF
or error messages like
[1701048534.229996] [m3k008:211954:0] ib_device.c:990 UCX ERROR ibv_create_ah(dlid=0 sl=0 port=1 src_path_bits=0 dgid=fe80::ba3f:d2ff:fe2a:c488 sgid_index=3 traffic_class=106) failed: Invalid argument
(usually with a lot of other lines, but the this is the key indicator)
These usually occur if you are using additional slurm flags to select a particular node or partition. These indicate the MPI has been unable to auto-detect the correct network interface to use and needs some hints. In this circumstances you can use
echo -e '#!/bin/bash\n srun hostname\n srun --task-prolog=/usr/local/hpcx/ucx_net_devices.sh connectivity_c' | sbatch --ntasks=2 --nodes=2 --cpus-per-task=1 --tasks-per-node=1
the task-prolog script is a simple helper to find the correct devices.
On Rocky 9#
At the time of writing, most nodes in M3 are running CentOS 7 but we are beginning to move to Rocky9. If this is no longer true, congratulations you have found some old outdated documentation.
We expect a very similar process on Rocky 9 with the exception that there will be a new hpcx module. We expect this to be called
module load hpcx/2.14
The name is not currently finalised, and the module file is not yet created. You can load hpcx by
. /apps/hpcx/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/hpcx-init.sh
hpcx_load
UCX_NET_DEVICES#
If the task prolog doesn’t work for you either you may need to set the environment variable UCX_NET_DEVICES directly. This is a straight environment variable that you can set with export UCX_NET_DEVICES=...
Some possible values include
export UCX_NET_DEVICES=mlx5_0:1,mlx5_bond_0:1
This is a list of the two most common device names on M3 nodes. If a device doesn’t exist, it will be ignored with a warning
export UCX_NET_DEVICES=bond0.113
This is the TCP/IP interface. It is expected to be slower (higher latency) than the default network interfaces, but potentially more compatible