NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment.
To see details on the container environment (such as operating system and modules available) please see README.md
on this repo.
Allocate at least two GPUs and set up your environment
module load openmpi rocm
export OMPI_CXX=hipcc
Find the code and compile
cd HPCTrainingExamples/MPI-examples
mpicxx -o ./pt2pt ./pt2pt.cpp
Set the environment variable and run the code
mpirun -n 2 -mca pml ucx ./pt2pt
Get the OSU micro-benchmark tarball and extract it
mkdir OMB
cd OMB
wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz
tar -xvf osu-micro-benchmarks-7.3.tar.gz
Create a build directory and cd to osu-micro-benchmarks-7.3
mkdir build
cd osu-micro-benchmarks-7.3
module load rocm openmpi
Build and install OSU micro-benchmarks
./configure --prefix=`pwd`/../build/ \
CC=`which mpicc` \
CXX=`which mpicxx` \
CPPFLAGS=-D__HIP_PLATFORM_AMD__=1 \
--enable-rocm \
--with-rocm=${ROCM_PATH}
make -j12
make install
If you get the error "cannot include hip/hip_runtime_api.h", grep for HIP_PLATFORM_HCC and replace it with HIP_PLATFORM_AMD in configure.ac and configure files.
Check if osu microbenchmark is actually built
ls -l ../build/libexec/osu-micro-benchmarks/mpi/
if you see files collective, one-sided, pt2pt, and startup, your build is successful.
Allocate 2 GPUs, and make those visible
export HIP_VISIBLE_DEVICES=0,1
Make sure GPU-Aware communication is enabled and run the benchmark
mpirun -n 2 -mca pml ucx ../build/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
-m $((16*1024*1024)) D D
Notes:
- Try different pairs of GPUs.
- Run the command "rocm-smi --showtopo" to see the link type between the pairs of GPUs.
- How does the bandwidth vary for xGMI connected GPUs vs PCIE connected GPUs?
This example takes an MPI Ghost Exchange code that runs on the CPU and ports it to the GPU and GPU-aware MPI.
module load amdclang openmpi
git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/Orig
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx ./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
We can improve this performance by using process placement so that we are using all the memory channels.
On MI2100 nodes, we have 2 NUMA per node. So we can assign 4 ranks per NUMA when running with 8 ranks:
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa --report-bindings ./GhostExchange \
-x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
On MI300A node, we have 4 NUMA per node. So we can assign 2 ranks per NUMA when running with 8 ranks:
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa --report-bindings ./GhostExchange \
-x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
For the port to the GPU, we are going to take advantage of Managed Memory (or single memory space on MI300A)
export HSA_XNACK=1
cd ../Ver1
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
Alternatively, on MI300A, we can run with:
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
The MPI buffers are only used on the GPU, so we can just allocate them there and save memory on the CPU.
export HSA_XNACK=1
cd ../Ver3
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
Alternatively, on MI300A, we can run with:
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000
Memory allocations can be expensive for the GPU. This next version just allocates the MPI buffers once in the main routine.
export HSA_XNACK=1
cd ../Ver3
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000cd
Alternatively, on MI300A, we can run with:
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
./GhostExchange -x 4 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 1000