Skip to content

Latest commit

 

History

History
164 lines (131 loc) · 4.87 KB

README.md

File metadata and controls

164 lines (131 loc) · 4.87 KB

GPU Aware MPI

Point-to-point and collective

NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment. To see details on the container environment (such as operating system and modules available) please see README.md on this repo.

Allocate at least two GPUs and set up your environment

module load openmpi rocm
export OMPI_CXX=hipcc

Find the code and compile

cd HPCTrainingExamples/MPI-examples
mpicxx -o ./pt2pt ./pt2pt.cpp

Set the environment variable and run the code

mpirun -n 2 -mca pml ucx ./pt2pt

OSU Benchmark

Get the OSU micro-benchmark tarball and extract it

mkdir OMB
cd OMB
wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz
tar -xvf osu-micro-benchmarks-7.3.tar.gz

Create a build directory and cd to osu-micro-benchmarks-7.3

mkdir build
cd osu-micro-benchmarks-7.3
module load rocm openmpi

Build and install OSU micro-benchmarks

./configure --prefix=`pwd`/../build/ \
                CC=`which mpicc` \
                CXX=`which mpicxx` \
                CPPFLAGS=-D__HIP_PLATFORM_AMD__=1 \
                --enable-rocm \
                --with-rocm=${ROCM_PATH}
make -j12
make install

If you get the error "cannot include hip/hip_runtime_api.h", grep for HIP_PLATFORM_HCC and replace it with HIP_PLATFORM_AMD in configure.ac and configure files.

Check if osu microbenchmark is actually built

ls -l ../build/libexec/osu-micro-benchmarks/mpi/

if you see files collective, one-sided, pt2pt, and startup, your build is successful.

Allocate 2 GPUs, and make those visible

export HIP_VISIBLE_DEVICES=0,1

Make sure GPU-Aware communication is enabled and run the benchmark

mpirun -n 2 -mca pml ucx ../build/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
				-m $((16*1024*1024)) D D

Notes:

  • Try different pairs of GPUs.
  • Run the command "rocm-smi --showtopo" to see the link type between the pairs of GPUs.
  • How does the bandwidth vary for xGMI connected GPUs vs PCIE connected GPUs?

Ghost Exchange example

This example takes an MPI Ghost Exchange code that runs on the CPU and ports it to the GPU and GPU-aware MPI.

module load amdclang openmpi
git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/Orig
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx ./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

We can improve this performance by using process placement so that we are using all the memory channels.

On MI2100 nodes, we have 2 NUMA per node. So we can assign 4 ranks per NUMA when running with 8 ranks:

mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa --report-bindings ./GhostExchange \
				    -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

On MI300A node, we have 4 NUMA per node. So we can assign 2 ranks per NUMA when running with 8 ranks:

mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa --report-bindings ./GhostExchange \
				    -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

For the port to the GPU, we are going to take advantage of Managed Memory (or single memory space on MI300A)

export HSA_XNACK=1
cd ../Ver1
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ 
					./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

Alternatively, on MI300A, we can run with:

mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
				    ./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

The MPI buffers are only used on the GPU, so we can just allocate them there and save memory on the CPU.

export HSA_XNACK=1
cd ../Ver3
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ 
					./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

Alternatively, on MI300A, we can run with:

mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
				    ./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000

Memory allocations can be expensive for the GPU. This next version just allocates the MPI buffers once in the main routine.

export HSA_XNACK=1
cd ../Ver3
mkdir build && cd build
cmake ..
make
mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:4:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
					./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000cd

Alternatively, on MI300A, we can run with:

mpirun -n 8 --mca pml ucx --bind-to core --map-by ppr:2:numa -x HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
				   ./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 1000