full_gpu_inference_pipeline

May 22, 2024

3c90f9e · May 22, 2024

This branch is 7 commits behind Project-MONAI/tutorials:main.

Name	Name	Last commit message	Last commit date
parent directory ..
client/non_ensemble	client/non_ensemble	Minor fix (Project-MONAI#1720 )	May 22, 2024
README.md	README.md	fix external links and duplicate license (Project-MONAI#1211 )	Feb 12, 2023
download_model_repo.sh	download_model_repo.sh	Fix copyright and guidelines test errors for remaining tutorials (Pro…	Jan 21, 2023

README.md

Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step

Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step

Overview

This example is to implement a 3D medical imaging AI inference pipeline using the model and transforms of MONAI, and deploy the pipeline using Triton. the goal of it is to test the influence brought by different features of MONAI and Triton to medical imaging AI inference performance.

In this repository, I will try following features:

Python backend BLS (Triton), which allows you to execute inference requests on other models being served by Triton as a part of executing your Python model.
Transforms on GPU(MONAI), by using which, you can compose GPU accelerated pre/post processing chains.

Before starting, I highly recommand you to read the the following two links to get familiar with the basic features of Triton python backend and MONAI:

Prepare the model repository

The full pipeline is as below:

Prepare the model repository file directories

The Triton model repository of the experiment can be fast set up by:

git clone https://github.com/Project-MONAI/tutorials.git
cd full_gpu_inference_pipeline
bash download_model_repo.sh

The model repository is in folder triton_models. The file structure of the model repository should be:

triton_models/
├── segmentation_3d
│   ├── 1
│   │   └── model.pt
│   └── config.pbtxt
└── spleen_seg
├── 1
│   └── model.py
└── config.pbtxt

Environment Setup

Setup Triton environment

Triton environment can be quickly setup by running a Triton docker container:

docker run --gpus=1 -it --name='triton_monai' --ipc=host -p18100:8000 -p18101:8001 -p18102:8002 --shm-size=1g -v /yourfolderpath:/triton_monai nvcr.io/nvidia/tritonserver:21.12-py3

Please note that when starting the docker container, --ipc=host should be set, so that shared memory can be used to do the data transmission between server and client. Also you should allocate a relatively large shared memory using --shm-size option, because starting from 21.04 release, Python backend uses shared memory to connect user's code to Triton.

Setup python execution environment

Since we will use MONAI transforms in Triton python backend, we should set up the python execution environment in Triton container by following the instructions in Triton python backend repository. For the installation steps of MONAI, you can refer to monai install. Below are the steps used to setup the proper environments for this experiment:

Install the software packages below:

conda
cmake
rapidjson and libarchive (instructions for installing these packages in Ubuntu or Debian are included in Building from Source Section)
conda-pack

Create and activate a conda environment.

conda create -n monai python=3.8
conda activate monai

Since Triton 21.12 NGC docker image is used, in which python version is 3.8, we can create a conda env of python3.8 for convenience. You can also specify other python versions. If the python version you use is not equal to that of triton container's, please make sure you go through these extra steps. Before installing the packages in your conda environment, make sure that you have exported PYTHONNOUSERSITE environment variable:

export PYTHONNOUSERSITE=True

If this variable is not exported and similar packages are installed outside your conda environment, your tar file may not contain all the dependencies required for an isolated Python environment. Install MONAI and the recommended dependencies, you can also refer to the installation guide of MONAI.

pip install 'monai[all]'
pip install cupy

Next, we should package the conda environment by using conda-pack command, which will produce a package of monai.tar.gz. This file contains all the environments needed by the python backend model and is portable. Then put the created monai.tar.gz under the spleen_seg folder, and the config.pbtxt should be set as:

parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/monai.tar.gz"}
}

Also, please note that in the config.pbtxt, the parameter FORCE_CPU_ONLY_INPUT_TENSORS is set to no, so that Triton will not move input tensors to CPU for the Python model. Instead, Triton will provide the input tensors to the Python model in either CPU or GPU memory, depending on how those tensors were last used. And now the file structure of the model repository should be:

triton_models/
├── segmentation_3d
│   ├── 1
│   │   └── model.pt
│   └── config.pbtxt
└── spleen_seg
├── 1
│   └── model.py
├── config.pbtxt
└── monai.tar.gz

Run Triton server

Then you can start the triton server by the command:

tritonserver --model-repository=/ROOT_PATH_OF_YOUR_MODEL_REPOSITORY

Run Triton Client

We assume that the server and client are both on the same machine. Open a new bash terminal and run the commands below to setup the client environment.

nvidia-docker run -it --ipc=host --shm-size=1g --name=triton_client --net=host nvcr.io/nvidia/tritonserver:21.12-py3-sdk
pip install monai
pip install nibabel
pip install jupyter

Then you can run the jupyter nootbook in the client folder of this example. Please note that when starting the docker container, --ipc=host should be set, so that we can use shared memory to do the data transmission between server and client.

Benchmark

The benchmark was run on RTX 8000 and tested by using perf_analyzer.

perf_analyzer -m spleen_seg -u localhost:18100 --input-data zero --shape "INPUT0":512,512,114 --shared-memory system

Understanding the benchmark output

HTTP: send/recv indicates the time on the client spent sending the request and receiving the response. response wait indicates time waiting for the response from the server.
GRPC: (un)marshal request/response indicates the time spent marshalling the request data into the GRPC protobuf and unmarshalling the response data from the GRPC protobuf. response wait indicates time writing the GRPC request to the network, waiting for the response, and reading the GRPC response from the network.
compute_input : The count and cumulative duration to prepare input tensor data as required by the model framework / backend. For example, this duration should include the time to copy input tensor data to the GPU.
compute_infer : The count and cumulative duration to execute the model.
compute_output : The count and cumulative duration to extract output tensor data produced by the model framework / backend. For example, this duration should include the time to copy output tensor data from the GPU.

HTTP vs. gRPC vs. shared memory

Since 3D medical images are generally big, the overhead brought by protocols cannot be ignored. For most common cases of medical image AI, the clients are on the same machine as the server, so shared memory is an appliable way to reduce the send/receive overhead. In this experiment, perf_analyzer is used to compare different ways of communicating between client and server. Note that all the processes (pre/post and AI inference) are on GPU. From the result, we can come to a conclusion that using shared memory will greatly reduce the latency when data transfer is huge.

Pre/Post-processing on GPU vs. CPU

After doing pre and post-processing on GPU, we can get a 12x speedup for the full pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

full_gpu_inference_pipeline

full_gpu_inference_pipeline

README.md

Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step

Overview

Prepare the model repository

Prepare the model repository file directories

Environment Setup

Setup Triton environment

Setup python execution environment

Run Triton server

Run Triton Client

Benchmark

Understanding the benchmark output

HTTP vs. gRPC vs. shared memory

Pre/Post-processing on GPU vs. CPU

Files

full_gpu_inference_pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

full_gpu_inference_pipeline

Folders and files

parent directory

README.md

Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step

Overview

Prepare the model repository

Prepare the model repository file directories

Environment Setup

Setup Triton environment

Setup python execution environment

Run Triton server

Run Triton Client

Benchmark

Understanding the benchmark output

HTTP vs. gRPC vs. shared memory

Pre/Post-processing on GPU vs. CPU