- Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step
This example is to implement a 3D medical imaging AI inference pipeline using the model and transforms of MONAI, and deploy the pipeline using Triton. the goal of it is to test the influence brought by different features of MONAI and Triton to medical imaging AI inference performance.
In this repository, I will try following features:
- Python backend BLS (Triton), which allows you to execute inference requests on other models being served by Triton as a part of executing your Python model.
- Transforms on GPU(MONAI), by using which, you can compose GPU accelerated pre/post processing chains.
Before starting, I highly recommand you to read the the following two links to get familiar with the basic features of Triton python backend and MONAI:
- https://github.com/triton-inference-server/python_backend
- Tutorial fast_model_training_guide
The full pipeline is as below:
The Triton model repository of the experiment can be fast set up by:
git clone https://github.com/Project-MONAI/tutorials.git
cd full_gpu_inference_pipeline
bash download_model_repo.sh
The model repository is in folder triton_models. The file structure of the model repository should be:
triton_models/
├── segmentation_3d
│ ├── 1
│ │ └── model.pt
│ └── config.pbtxt
└── spleen_seg
├── 1
│ └── model.py
└── config.pbtxt
Triton environment can be quickly setup by running a Triton docker container:
docker run --gpus=1 -it --name='triton_monai' --ipc=host -p18100:8000 -p18101:8001 -p18102:8002 --shm-size=1g -v /yourfolderpath:/triton_monai nvcr.io/nvidia/tritonserver:21.12-py3
Please note that when starting the docker container, --ipc=host should be set, so that shared memory can be used to do the data transmission between server and client. Also you should allocate a relatively large shared memory using --shm-size option, because starting from 21.04 release, Python backend uses shared memory to connect user's code to Triton.
Since we will use MONAI transforms in Triton python backend, we should set up the python execution environment in Triton container by following the instructions in Triton python backend repository. For the installation steps of MONAI, you can refer to monai install. Below are the steps used to setup the proper environments for this experiment:
Install the software packages below:
- conda
- cmake
- rapidjson and libarchive (instructions for installing these packages in Ubuntu or Debian are included in Building from Source Section)
- conda-pack
Create and activate a conda environment.
conda create -n monai python=3.8
conda activate monai
Since Triton 21.12 NGC docker image is used, in which python version is 3.8, we can create a conda env of python3.8 for convenience. You can also specify other python versions. If the python version you use is not equal to that of triton container's, please make sure you go through these extra steps. Before installing the packages in your conda environment, make sure that you have exported PYTHONNOUSERSITE environment variable:
export PYTHONNOUSERSITE=True
If this variable is not exported and similar packages are installed outside your conda environment, your tar file may not contain all the dependencies required for an isolated Python environment. Install MONAI and the recommended dependencies, you can also refer to the installation guide of MONAI.
pip install 'monai[all]'
pip install cupy
Next, we should package the conda environment by using conda-pack
command, which will produce a package of monai.tar.gz. This file contains all the environments needed by the python backend model and is portable. Then put the created monai.tar.gz under the spleen_seg folder, and the config.pbtxt should be set as:
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/monai.tar.gz"}
}
Also, please note that in the config.pbtxt, the parameter FORCE_CPU_ONLY_INPUT_TENSORS
is set to no
, so that Triton will not move input tensors to CPU for the Python model. Instead, Triton will provide the input tensors to the Python model in either CPU or GPU memory, depending on how those tensors were last used.
And now the file structure of the model repository should be:
triton_models/
├── segmentation_3d
│ ├── 1
│ │ └── model.pt
│ └── config.pbtxt
└── spleen_seg
├── 1
│ └── model.py
├── config.pbtxt
└── monai.tar.gz
Then you can start the triton server by the command:
tritonserver --model-repository=/ROOT_PATH_OF_YOUR_MODEL_REPOSITORY
We assume that the server and client are both on the same machine. Open a new bash terminal and run the commands below to setup the client environment.
nvidia-docker run -it --ipc=host --shm-size=1g --name=triton_client --net=host nvcr.io/nvidia/tritonserver:21.12-py3-sdk
pip install monai
pip install nibabel
pip install jupyter
Then you can run the jupyter nootbook in the client folder of this example. Please note that when starting the docker container, --ipc=host should be set, so that we can use shared memory to do the data transmission between server and client.
The benchmark was run on RTX 8000 and tested by using perf_analyzer.
perf_analyzer -m spleen_seg -u localhost:18100 --input-data zero --shape "INPUT0":512,512,114 --shared-memory system
- HTTP:
send/recv
indicates the time on the client spent sending the request and receiving the response.response wait
indicates time waiting for the response from the server. - GRPC:
(un)marshal request/response
indicates the time spent marshalling the request data into the GRPC protobuf and unmarshalling the response data from the GRPC protobuf.response wait
indicates time writing the GRPC request to the network, waiting for the response, and reading the GRPC response from the network. - compute_input : The count and cumulative duration to prepare input tensor data as required by the model framework / backend. For example, this duration should include the time to copy input tensor data to the GPU.
- compute_infer : The count and cumulative duration to execute the model.
- compute_output : The count and cumulative duration to extract output tensor data produced by the model framework / backend. For example, this duration should include the time to copy output tensor data from the GPU.
Since 3D medical images are generally big, the overhead brought by protocols cannot be ignored. For most common cases of medical image AI, the clients are on the same machine as the server, so shared memory is an appliable way to reduce the send/receive overhead. In this experiment, perf_analyzer is used to compare different ways of communicating between client and server. Note that all the processes (pre/post and AI inference) are on GPU. From the result, we can come to a conclusion that using shared memory will greatly reduce the latency when data transfer is huge.
After doing pre and post-processing on GPU, we can get a 12x speedup for the full pipeline.