From HPCTrainingExamples/ManagedMemory/README.md
in the training exercises repository
NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment.
To see details on the container environment (such as operating system and modules available) please see README.md
on this repo.
The source code for these exercises is based on those in the presentation, but with details filled in so that there is a working code. You may want to examine the code in these exercises and compare it to the code in the presentation and to the code in the other exercises.
git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/ManagedMemory
First, run the standard CPU version. This is a working version of the original CPU code from the programming model presentation.
The example will work with any C compiler and run on any CPU. To set up the environment, we need to set the CC environment
variable to the C compiler executable. We do this by loading the amdclang module which sets CC=/opt/rocm-<version>/llvm/bin/amdclang
.
The makefile uses the CC environment which we have set. In our modules, we set the "family" to compiler so that only one compiler
can be loaded at a time.
cd HPCTrainingExamples/ManagedMemory/CPU_Code
module load amdclang
make
will compile with /opt/rocm-<version>/llvm/bin/amdclang -g -O3 cpu_code.c -o cpu_code
Then run code with
./cpu_code
This example shows the standard GPU explicit memory management. For this case, we must move the memory ourselves. This example will run on any AMD Instinct GPU (data center GPUs) and most workstation or desktop discrete GPUs and APUs. The AMD GPU driver and ROCm software needs to be installed.
For the environment setup, we need the ROCm bin directory added to the path. We do this by loading the ROCm module with
module load rocm
. This will set the path to the rocm bin directory. We could also do this with export PATH=/opt/rocm-<version>/bin
or by supplying the full path /opt/rocm-<version>/bin/hipcc
to the compile line. Note that even this may not be
necessary as the ROCm install may have placed a link to hipcc
in /usr/bin/hipcc
during the ROCm install.
We also supply a --offload-arch=${AMDGPU_GFXMODEL}
option to the compile line. While not necessarily required, it helps
in cases where the architecture is not autodetected properly. We use the following line to query what the
architecture string AMDGPU_GFXMODEL
should be. We can also set our own AMDGPU_GFXMODEL
variable in cases where
we want to cross-compile or compile for more than one architecture.
AMDGPU_GFXMODEL ?= $(strip $(shell rocminfo |grep -m 1 -E gfx[^0]{1} | sed -e 's/ *Name: *//'))
The AMDGPU_GFXMODEL
architecture string is gfx90a
for MI200 series and gfx942
for MI300A and MI300X. We can
also compile for more than one architecture with export AMDGPU_GFXMODEL="gfx90a;gfx942"
.
cd ../GPU_Code
make
This will compile with hipcc -g -O3 --offload-arch=${AMDGPU_GFXMODEL} gpu_code.hip -o gpu_code
.
Then run the code with
./gpu_code
In this example, we will set the HSA_XNACK
environment variable to 1 and let the Operating System move the memory for us.
This will run on AMD Instinct GPUs for the data center including MI300X, MI300A, and MI200 series. To set up the environment,
module load rocm
.
export HSA_XNACK=1
module load rocm
cd ../Managed_Memory_Code
make
./gpu_code
To understand the difference between the explicit memory management programming and the managed memory, let's compare the two codes.
diff gpu_code.hip ../GPU_Code/
You should see the following:
34a35,37
> double *in_d, *out_d;
> HIP_CHECK(hipMalloc((void **)&in_d, Msize));
> HIP_CHECK(hipMalloc((void **)&out_d, Msize));
38a42,43
> HIP_CHECK(hipMemcpy(in_d, in_h, Msize, hipMemcpyHostToDevice));
>
41c46
< gpu_func<<<grid,block,0,0>>>(in_h, out_h, M);
---
> gpu_func<<<grid,block,0,0>>>(in_d, out_d, M);
43a49
> HIP_CHECK(hipMemcpy(out_h, out_d, Msize, hipMemcpyDeviceToHost));
It may be more instructive to look at the lines of hip code that are required compared to the explicit memory management GPU code.
grep hip ../GPU_Code/gpu_code.hip
which gets the following output
#include "hip/hip_runtime.h"
hipError_t gpuErr = call; \
if(hipSuccess != gpuErr){ \
__FILE__, __LINE__, hipGetErrorString(gpuErr)); \
HIP_CHECK(hipMalloc((void **)&in_d, Msize));
HIP_CHECK(hipMalloc((void **)&out_d, Msize));
HIP_CHECK(hipMemcpy(in_d, in_h, Msize, hipMemcpyHostToDevice));
HIP_CHECK(hipDeviceSynchronize());
HIP_CHECK(hipMemcpy(out_h, out_d, Msize, hipMemcpyDeviceToHost));
grep hip gpu_code.hip
And for the managed memory program, we essentially get just the
addition of the hipDeviceSynchronize
call plus including the
hip runtime header and the error checking macro.
#include "hip/hip_runtime.h"
hipError_t gpuErr = call; \
if(hipSuccess != gpuErr){ \
__FILE__, __LINE__, hipGetErrorString(gpuErr)); \
HIP_CHECK(hipDeviceSynchronize());
We'll run the same code as we used in the managed memory example. Because the memory pointers are addressable on both the CPU and the GPU, no memory management is necessary. First, log onto an MI300A node. Then compile and run the code as follows.
export HSA_XNACK=1
module load rocm
cd ../APU_Code
make
./gpu_code
It may be confusing why we need HSA_XNACK=1
. Even with the APU, we need to map the pointers into the
GPU page map though the memory itself does not need to be copied.
For this example, we have a simple code with the loop offloading in the main code, openmp_code
,
and a second version, openmp_code1
, with the offloaded loop in a subroutine where the compiler
cannot tell the size of the array. Running this on the MI200 series, it passes, despite that it
does not have a single address space. We add export LIBOMPTARGET_INFO=-1
or for less output
export LIBOMPTARGET_INFO=$((0x1 | 0x10))
to verify that it is running on the GPU.
export HSA_XNACK=1
module load amdclang
cd ../OpenMP_Code
make
You should see some warnings that are basically telling you the AMD clang compiler is ignoring the simd
clause
is being ignored. You can remove the simd
from the OpenMP pragmas, but at the expense of portability to some
other OpenMP compilers. Now run the code.
./openmp_code
./openmp_code1
export LIBOMPTARGET_INFO=$((0x1 | 0x10)) # or export LIBOMPTARGET_INFO=-1
./openmp_code
./openmp_code1
If the executable is running on the GPU you will see some output as a result of the LIBOMPTARGET_INFO
environment
variable being set. If it is not running on the GPU, you will not see anything.
For more experimentation with this example, comment out the first line of the two source codes.
//#pragma omp requires unified_shared_memory
make
export LIBOMPTARGET_INFO=-1
./openmp_code
./openmp_code1
Now with the LIBOMPTARGET_INFO
variable set, we get a report that memory is being copied to the device
and back. The OpenMP compiler is helping out a lot more than might be expected even without an APU.
First, set up the environment
module load amdclang
module load rocm
For the Raja example, we need to build the Raja code first
cd ~/HPCTrainingExamples/ManagedMemory/Raja_Code
PWDir=`pwd`
git clone --recursive https://github.com/LLNL/RAJA.git Raja_build
cd Raja_build
mkdir build_hip && cd build_hip
cmake -DCMAKE_INSTALL_PREFIX=${PWDir}/Raja_HIP \
-DROCM_ROOT_DIR=/opt/rocm \
-DHIP_ROOT_DIR=/opt/rocm \
-DHIP_PATH=/opt/rocm/bin \
-DENABLE_TESTS=Off \
-DENABLE_EXAMPLES=Off \
-DRAJA_ENABLE_EXERCISES=Off \
-DENABLE_HIP=On \
..
make -j 8
make install
cd ../..
rm -rf Raja_build
export Raja_DIR=${PWDir}/Raja_HIP
Now we build the example. Note that we just allocated the arrays on the
host with malloc. To run it on the MI200 series, we need to set the
HSA_XNACK
variable.
# To run with managed memory
export HSA_XNACK=1
mkdir build && cd build
CXX=hipcc cmake ..
make
./raja_code
cd ..
rm -rf build
cd ${PWDir}
rm -rf Raja_HIP
cd ..
rm -rf ${PROB_NAME}
First, set up the environment
module load amdclang
module load rocm
For the Kokkos example, we also need to build the Kokkos code first
cd ~/HPCTrainingExamples/ManagedMemory/Kokkos_Code
PWDir=`pwd`
git clone https://github.com/kokkos/kokkos Kokkos_build
cd Kokkos_build
mkdir build_hip && cd build_hip
cmake -DCMAKE_INSTALL_PREFIX=${PWDir}/Kokkos_HIP -DKokkos_ENABLE_SERIAL=ON \
-DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_ZEN=ON -DKokkos_ARCH_VEGA90A=ON \
-DCMAKE_CXX_COMPILER=hipcc ..
make -j 8
make install
cd ../..
rm -rf Kokkos_build
export Kokkos_DIR=${PWDir}/Kokkos_HIP
Now we build the example. Note that we have not had to declare the arrays in Kokkos Views.
# To run with managed memory
export HSA_XNACK=1
mkdir build && cd build
CXX=hipcc cmake ..
make
./kokkos_code
cd ${PWDir}
rm -rf Kokkos_HIP
cd ..
rm -rf ${PROB_NAME}
With recent versions of Kokkos, there is support for a single memory copy for the MI300A GPU.
-Dkokkos_ENABLE_IMPL_HIP_UNIFIED_MEMORY=ON in Kokkos 4.4+
Makes it easy to switch between host/device duplicate arrays to single memory copy on the MI300A.