Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

geohussain · 2021-06-08T05:01:19Z

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source v4.0.5 with cuda-aware enabled

Please describe the system on which you are running

Operating system/version: Rhel 7.7
Computer hardware: 8 Tesla V-100 GPUs
Network type: NVLink

Details of the problem

When I send large packets of data between GPUs (~1Gigabytes) using MPI_Send and MPI_Recv and free Cuda variables afterwards, the memory does not get freed on the GPU and starts inflating in subsequent iterations. The expected behavior is that memory in the GPU should be after sending and receiving large packets of data. The following is the code that is producing this behavior.

main.cpp

#include <iostream>
#include <cuda_runtime.h>
#include <mpi.h>

#define CUCHK(error, msg)                      \
	if (error != cudaSuccess) {                  \
		throw std::runtime_error(             \
			std::string(msg) + " with "            + \
      std::string(cudaGetErrorName(error))   + \
	    std::string(" -> ")                    + \
      std::string(cudaGetErrorString(error)) + \
			" @" + std::string(__FILE__) + ":" + std::to_string(__LINE__)); \
	}

int main(int argc, char** argv)
{
    /*
     * Initialize MPI
     */
    MPI_Init(&argc, &argv);

    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Status stat;

    if (size !=2) {
        if (rank == 0) {
            printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
        }
        MPI_Finalize();
        exit(0);
    }
    cudaError_t ier;
    cudaSetDevice(rank);
    ier = cudaGetLastError();
    CUCHK(ier, "failed to set device")

    /*
     * Loop 1 GB
     */
    for (int i=0; i<=100; i++) {
        long int N;
        N = 1 << 27;


        // Alocate memory for A on CPU
        auto *A = (double*)malloc(N*sizeof(double));

        // Initialize all elements of A to 0.0
        for (int j=0; j<N; j++) {
            A[j] = 0.0;
        }

        double *d_A;
        cudaMalloc(&d_A, N*sizeof(double));
        ier = cudaGetLastError();
        CUCHK(ier, "could not allocate to device")

        cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice);
        ier = cudaGetLastError();
        CUCHK(ier, "could not copy from host to device")

        int tag1 = 10;
        int tag2 = 20;

        int loop_count = 50;

        double start_time, stop_time, elapsed_time;
        start_time = MPI_Wtime();

        for (int j=1; j<=loop_count; j++) {
            if(rank == 0) {
                MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
                MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat);
            }
            else if(rank == 1) {
                MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat);
                MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
            }
        }

        stop_time = MPI_Wtime();
        elapsed_time = stop_time - start_time;

        long int num_B = 8*N;
        long int B_in_GB = 1 << 30;
        double num_GB = (double)num_B /(double)B_in_GB;
        double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

        if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer);

        cudaFree(d_A);
        ier = cudaGetLastError();
        CUCHK(ier, "could not free device")

        free(A);
    }


    std::cout << "Hello, World!" << std::endl;
    MPI_Finalize();

    return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.18)

# set the project name
project(mpi_gpu_buffer LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED true)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

find_package(MPI REQUIRED)
find_package(OpenMP REQUIRED)
find_package(Threads REQUIRED)
add_executable(mpi_gpu_buffer main.cpp)

#-----------------------------------------------------------------------------------------------------------------------
#|                                                       CUDA                                                          |
#-----------------------------------------------------------------------------------------------------------------------

enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
set(CMAKE_CUDA_STANDARD 14)
set(CMAKE_CUDA_STANDARD_REQUIRED true)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-code arch=compute_70,code=sm_70 -lineinfo")
#set(CMAKE_CUDA_FLAGS_DEBUG "${CMAKE_CUDA_FLAGS_DEBUG} -G -Xcompiler -rdynamic -lineinfo")
set(CUDA_PROPAGATE_HOST_FLAGS OFF)
set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER})
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
#set(CMAKE_CUDA_ARCHITECTURES 52 61 70)
set(CMAKE_CUDA_ARCHITECTURES 61 70 75)
set(CUDA_LIBRARY CUDA::cudart)

set_property(TARGET mpi_gpu_buffer PROPERTY CUDA_ARCHITECTURES 61 70 75)
target_include_directories(mpi_gpu_buffer PRIVATE
        ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
target_link_libraries(mpi_gpu_buffer
        ${CUDA_LIBRARY}
        ${MPI_CXX_LIBRARIES}
        MPI::MPI_CXX
        OpenMP::OpenMP_CXX)

The text was updated successfully, but these errors were encountered:

jsquyres · 2021-06-08T13:09:45Z

FYI @open-mpi/ucx

yosefe · 2021-06-08T13:27:52Z

adding @Akshay-Venkatesh
@geohussain

can you please post mpirun command line?
is it a single node job?
do you know if UCX is being used?

geohussain · 2021-06-08T14:58:27Z

mpirun -n 2 ./mpi_gpu_buffer
Yes, it is a single node job
UCX is being used

Akshay-Venkatesh · 2021-06-08T20:21:54Z

mpirun -n 2 ./mpi_gpu_buffer

Yes, it is a single node job

UCX is being used

@geohussain which version of ucx is being used here?

geohussain · 2021-06-09T04:37:45Z

mpirun -n 2 ./mpi_gpu_buffer

Yes, it is a single node job

UCX is being used

@geohussain which version of ucx is being used here?

$ ucx_info -v
# UCT version=1.10.0 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --disable-optimizations --disable-logging --disable-debug --disable-assertions --with-mlx5-dv --enable-mt --disable-params-check --enable-cma --disable-numa --with-cuda=/cm/shared/apps/cuda11.0/toolkit/11.0.3 --prefix=/cm/shared/apps/ucx/intel-compiler/1.10-with-mlx5

Akshay-Venkatesh · 2021-06-09T14:22:20Z

@geohussain

I'm able to reproduce the issue. cuda-ipc transport in UCX caches peer mappings and a free call of peer mapped memory is not guaranteed to release memory. These get freed at finalize (or if VA recycling is detected which appears not to be the case) and the workaround is to disable caching by using (UCX_CUDA_IPC_CACHE=n). For the sample program you've provided, this doesn't have an impact on performance because the transfer sizes are large and because there is no communication buffer reuse but for programs different from this, there would be a performance penalty. UCX could intercept cudaFree calls but it would have to notify each peer that maps this memory out of band and this logic is somewhat complex. Would the the current workaround suffice?

The modified test and run command is here: https://gist.github.com/Akshay-Venkatesh/d44e51aea6e980a06f75991bed57c90b

FYI @bureddy

tdavidcl · 2024-09-06T21:37:47Z

Hi,
I thinks i'm encountering this exact issue currently on a workstation.
Basically using MPI communications on CUDA allocated memory result in memory leaks.
What is the current status of this issue is it fixed in more recent versions (i'm using 4.1.4)?

Akshay-Venkatesh · 2024-09-27T22:17:12Z

When using UCX, this issue is addressed by openucx/ucx#10104 which used library internal buffers by default and doesn't directly map user buffers (which is the root cause behind the leaks)

dylanjude · 2024-09-27T22:49:46Z

I just pulled UCX master (4234ca0cd), compiled, and subsequently built openmpi 5.0.5 with that UCX but I'm still seeing the memory growth on the GPU for the original code example. Is there an environment variable or different configure flag I need to get this fix working?

UCX configure command:
/p/home/djude/downloads/ucx/contrib/../configure --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/p/home/djude/local/ucx-master --with-cuda=/p/app/cuda/cuda-12.4 --with-verbs --enable-mt --enable-cma --without-go

OpenMPI configure command:
../configure --prefix=/p/home/djude/local/openmpi-5.0.5 --with-ucx=/p/home/djude/local/ucx-master --without-verbs --enable-mpi1-compatibility --with-cuda=/p/app/cuda/cuda-12.4 --enable-orterun-prefix-by-default --with-slurm --with-platform=../contrib/platform/mellanox/optimized --with-libevent=internal --without-xpmem

tdavidcl · 2024-09-28T10:10:46Z

....
I'm able to reproduce the issue. cuda-ipc transport in UCX caches peer mappings and a free call of peer mapped memory is not guaranteed to release memory. These get freed at finalize (or if VA recycling is detected which appears not to be the case) and the workaround is to disable caching by using (UCX_CUDA_IPC_CACHE=n).
....

I'm not testing with a clean reproducer but with a hydro code (not public yet sadly) and i get the same behavior with UCX_CUDA_IPC_CACHE or without it.

mpirun -n 4

mpirun -n 4 -x UCX_CUDA_IPC_CACHE=n

mpirun -n 4 but fallback on host<->host communication instead of direct GPU

For context :

> ompi_info
                 Package: Debian OpenMPI
                Open MPI: 4.1.4
  Open MPI repo revision: v4.1.4
   Open MPI release date: May 26, 2022
...
  Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr'
                          '--includedir=${prefix}/include'
                          '--mandir=${prefix}/share/man'
                          '--infodir=${prefix}/share/info'
                          '--sysconfdir=/etc' '--localstatedir=/var'
                          '--disable-option-checking'
                          '--disable-silent-rules'
                          '--libdir=${prefix}/lib/x86_64-linux-gnu'
                          '--runstatedir=/run' '--disable-maintainer-mode'
                          '--disable-dependency-tracking'
                          '--disable-silent-rules'
                          '--disable-wrapper-runpath'
                          '--with-package-string=Debian OpenMPI'
                          '--with-verbs' '--with-libfabric' '--with-psm'
                          '--with-psm2' '--with-ucx'
                          '--with-pmix=/usr/lib/x86_64-linux-gnu/pmix2'
                          '--with-jdk-dir=/usr/lib/jvm/default-java'
                          '--enable-mpi-java'
                          '--enable-opal-btl-usnic-unit-tests'
                          '--with-libevent=external' '--with-hwloc=external'
                          '--disable-silent-rules' '--enable-mpi-cxx'
                          '--enable-ipv6' '--with-devel-headers'
                          '--with-slurm' '--with-cuda=/usr/lib/cuda'
                          '--with-sge' '--without-tm'
                          '--sysconfdir=/etc/openmpi'
                          '--libdir=${prefix}/lib/x86_64-linux-gnu/openmpi/lib'
                          '--includedir=${prefix}/lib/x86_64-linux-gnu/openmpi/include'
...

tdavidcl · 2024-09-28T13:25:59Z

Ok small update I've checked that the issue still occur with

ucx-1.15.0.tar.gz + openmpi-4.1.6.tar.gz 
ucx-1.16.0.tar.gz + openmpi-4.1.6.tar.gz 
ucx-1.17.0.tar.gz + openmpi-4.1.6.tar.gz

tdavidcl · 2024-10-21T14:56:44Z

After investigations my issue was lead to the discovery of issues related to IPC handling (detailed in #12849), although I don't know if this issue is due to the same root cause.

Akshay-Venkatesh · 2024-10-29T20:46:48Z

When using UCX, this issue is addressed by openucx/ucx#10104 which used library internal buffers by default and doesn't directly map user buffers (which is the root cause behind the leaks)

@tdavidcl What I said above is wrong. It turns out that openucx/ucx#10104 doesn't actually address the memory leak. My apologies for the wrong claim. We plan to address this memory leak in UCX 1.19 after the upcoming release at the end of October.

jsquyres added question Target: v4.0.x Target: v4.1.x labels Jun 8, 2021

jsquyres added this to the v4.0.6 milestone Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

geohussain commented Jun 8, 2021

jsquyres commented Jun 8, 2021

yosefe commented Jun 8, 2021

geohussain commented Jun 8, 2021

Akshay-Venkatesh commented Jun 8, 2021

geohussain commented Jun 9, 2021

Akshay-Venkatesh commented Jun 9, 2021

tdavidcl commented Sep 6, 2024

Akshay-Venkatesh commented Sep 27, 2024 •

edited

Loading

dylanjude commented Sep 27, 2024

tdavidcl commented Sep 28, 2024 •

edited

Loading

tdavidcl commented Sep 28, 2024

tdavidcl commented Oct 21, 2024

Akshay-Venkatesh commented Oct 29, 2024

Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

Memory Leak using Cuda-Aware MPI_Send and MPI_Recv for large packets of data #9051

Comments

geohussain commented Jun 8, 2021

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

jsquyres commented Jun 8, 2021

yosefe commented Jun 8, 2021

geohussain commented Jun 8, 2021

Akshay-Venkatesh commented Jun 8, 2021

geohussain commented Jun 9, 2021

Akshay-Venkatesh commented Jun 9, 2021

tdavidcl commented Sep 6, 2024

Akshay-Venkatesh commented Sep 27, 2024 • edited Loading

dylanjude commented Sep 27, 2024

tdavidcl commented Sep 28, 2024 • edited Loading

tdavidcl commented Sep 28, 2024

tdavidcl commented Oct 21, 2024

Akshay-Venkatesh commented Oct 29, 2024

Akshay-Venkatesh commented Sep 27, 2024 •

edited

Loading

tdavidcl commented Sep 28, 2024 •

edited

Loading