|
| 1 | +# Triton Inference Server OpenAI compatible API proxy |
| 2 | +[This project](https://github.com/visitsb/triton-inference-server-openai-api) provides an OpenAI API compatible proxy for NVIDIA [Triton Inference Server](https://www.nvidia.com/en-us/ai-data-science/products/triton-inference-server/). More specifically, LLMs on NVIDIA GPUs can benefit from high performance inference with [TensorRT-LLM](https://developer.nvidia.com/tensorrt#inference) backend running on [Triton Inference Server compared to using llama.cpp](https://jan.ai/post/benchmarking-nvidia-tensorrt-llm#key-findings). |
| 3 | + |
| 4 | +Triton Inference Server supports [HTTP/REST and GRPC inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md) based on the community developed [KServe protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2), but that is not useable with existing OpenAI API clients. |
| 5 | + |
| 6 | +This proxy bridges that gap and it currently API supports **text** generation [OpenAI API](https://platform.openai.com/docs/api-reference/introduction) endpoints only which are suitable for use with [Open WebUI](https://docs.openwebui.com/) or similar OpenAI clients- |
| 7 | +```text |
| 8 | +GET|POST /v1/models (or /models) |
| 9 | +GET /v1/models/{model} (or /models/{model}) |
| 10 | +POST /v1/chat/completions (or /v1/completions) streaming supported |
| 11 | +``` |
| 12 | + |
| 13 | +## Usage |
| 14 | +**Recommended** Use a pre-published [Docker image](https://hub.docker.com/repository/docker/visitsb/tritonserver) |
| 15 | +```bash |
| 16 | +docker image pull visitsb/tritonserver:24.05-trtllm-python-py3 |
| 17 | +``` |
| 18 | + |
| 19 | +Alternatively, use the `Dockerfile` to build a local image. The proxy is built on top of existing [Triton Inference Server](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) docker image which precludes the TensorRT-LLM backend. |
| 20 | + |
| 21 | +```bash |
| 22 | +# Pull upstream NVIDIA docker image |
| 23 | +docker image pull nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 |
| 24 | +# Clone this repository |
| 25 | +git clone <this repository> |
| 26 | +cd triton-inference-server-openai-api |
| 27 | +# Build your custom docker image with proxy bundled |
| 28 | +docker buildx build --no-cache --tag myimages/tritonserver:24.05-trtllm-python-py3 . |
| 29 | +``` |
| 30 | + |
| 31 | +Once your image is pulled (or built locally) you can run it directly using Docker- |
| 32 | +```bash |
| 33 | +# Run Triton Inference Server alongwith proxy as shoen in `sh -c` command |
| 34 | +docker run --rm --tty --interactive \ |
| 35 | + --gpus all --shm-size 4g --memory 32g \ |
| 36 | + --cpuset-cpus 0-3 --publish 11434:11434/tcp \ |
| 37 | + --volume <your Triton models folder>:/models:rw \ |
| 38 | + --name triton \ |
| 39 | + visitsb/tritonserver:24.05-trtllm-python-py3 \ |
| 40 | + sh -c '/opt/tritonserver/bin/tritonserver \ |
| 41 | + --model-store /models/mymodel/model \ |
| 42 | + & /opt/tritonserver/bin/tritonopenaiserver \ |
| 43 | + --tokenizer_dir /models/mymodel/tokenizer' |
| 44 | +``` |
| 45 | + |
| 46 | +Alternatively using `docker-compose.yml`- |
| 47 | +```yaml |
| 48 | +triton: |
| 49 | + image: visitsb/tritonserver:24.05-trtllm-python-py3 |
| 50 | + command: > |
| 51 | + sh -c '/opt/tritonserver/bin/tritonserver --model-store /models/mymodel/model & /opt/tritonserver/bin/tritonopenaiserver --tokenizer_dir /models/mymodel/tokenizer' |
| 52 | + ports: |
| 53 | + - "11434:11434/tcp" # OpenAI API Proxy |
| 54 | + - "8000:8000/tcp" # HTTP |
| 55 | + - "8001:8001/tcp" # GRPC |
| 56 | + - "8080:8080/tcp" # Sagemaker, Vertex |
| 57 | + - "8002:8002/tcp" # Prometheus metrics |
| 58 | + volumes: |
| 59 | + - <your Triton models folder>:/models:rw |
| 60 | + shm_size: "4G" |
| 61 | + deploy: |
| 62 | + resources: |
| 63 | + limits: |
| 64 | + memory: 32G |
| 65 | + reservations: |
| 66 | + memory: 8G |
| 67 | + devices: |
| 68 | + - driver: nvidia |
| 69 | + count: all |
| 70 | + capabilities: [compute,video,utility] |
| 71 | + ulimits: |
| 72 | + stack: 67108864 |
| 73 | + memlock: |
| 74 | + soft: -1 |
| 75 | + hard: -1 |
| 76 | +``` |
| 77 | +
|
| 78 | +## Performance |
| 79 | +Using [GenAI-Perf](https://github.com/triton-inference-server/client/tree/main/src/c%2B%2B/perf_analyzer/genai-perf) to measure performance for [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on a [NVIDIA RTX 4090 GPU](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/) the following was observed- |
| 80 | +
|
| 81 | +Test: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) evaluated using NVIDIA [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/docs/tutorial.html#openai-chat-completions-api). For llama.cpp evaluation [QuantFactory/Meta-Llama-3-8B-GGUF](https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF) - `Meta-Llama-3-8B.Q8_0.gguf` was used. |
| 82 | + |
| 83 | +```text |
| 84 | +Backend Loaded model size GPU Util Tokens/sec |
| 85 | +------- ----------------- -------- ---------- |
| 86 | +TensorRT (gRPC) 15879MiB / 24564MiB 91% 97.04 |
| 87 | +TensorRT (HTTP) 15879MiB / 24564MiB 91% 56.73 |
| 88 | +llama.cpp 9491MiB / 24564MiB 74% 70.23 |
| 89 | +``` |
| 90 | + |
| 91 | +In summary, TensorRT (gRPC) inference is better than llama.cpp, but using TensorRT (HTTP) gave similar performance to llama.cpp. |
| 92 | + |
| 93 | +The raw performance numbers are as below- |
| 94 | +#### TensorRT (gRPC) |
| 95 | +```text |
| 96 | +[INFO] genai_perf.wrapper:135 - Running Perf Analyzer : 'perf_analyzer -m llama3 --async --service-kind triton -u triton:8001 --measurement-interval 4000 --stability-percentage 999 -i grpc --streaming --shape max_tokens:1 --shape text_input:1 --concurrency-range 1' |
| 97 | + LLM Metrics |
| 98 | +┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ |
| 99 | +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ |
| 100 | +┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ |
| 101 | +│ Request latency (ns) │ 1,081… │ 1,048… │ 1,311,… │ 1,284… │ 1,083,… │ 1,064… │ |
| 102 | +│ Num output token │ 105 │ 100 │ 110 │ 110 │ 109 │ 107 │ |
| 103 | +│ Num input token │ 200 │ 200 │ 200 │ 200 │ 200 │ 200 │ |
| 104 | +└──────────────────────┴────────┴────────┴─────────┴────────┴─────────┴────────┘ |
| 105 | +Output token throughput (per sec): 97.04 |
| 106 | +Request throughput (per sec): 0.92 |
| 107 | +``` |
| 108 | + |
| 109 | +#### TensorRT (HTTP) via this OpenAI API Proxy |
| 110 | +```text |
| 111 | +[INFO] genai_perf.wrapper:135 - Running Perf Analyzer : 'perf_analyzer -m llama3 --async --endpoint v1/chat/completions --service-kind openai -u triton:11434 --measurement-interval 4000 --stability-percentage 999 -i http --concurrency-range 1' |
| 112 | + LLM Metrics |
| 113 | +┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ |
| 114 | +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ |
| 115 | +┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ |
| 116 | +│ Request latency (ns) │ 2,033… │ 1,732… │ 3,856,… │ 3,723… │ 2,525,… │ 1,802… │ |
| 117 | +│ Num output token │ 115 │ 110 │ 121 │ 121 │ 120 │ 119 │ |
| 118 | +│ Num input token │ 200 │ 200 │ 200 │ 200 │ 200 │ 200 │ |
| 119 | +└──────────────────────┴────────┴────────┴─────────┴────────┴─────────┴────────┘ |
| 120 | +Output token throughput (per sec): 56.73 |
| 121 | +Request throughput (per sec): 0.49 |
| 122 | +``` |
| 123 | + |
| 124 | +#### llama.cpp |
| 125 | +```text |
| 126 | +[INFO] genai_perf.wrapper:135 - Running Perf Analyzer : 'perf_analyzer -m llama3 --async --endpoint v1/chat/completions --service-kind openai -u llama:11434 --measurement-interval 4000 --stability-percentage 999 -i http --concurrency-range 1' |
| 127 | + LLM Metrics |
| 128 | +┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ |
| 129 | +┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ |
| 130 | +┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ |
| 131 | +│ Request latency (ns) │ 1,656… │ 1,596… │ 1,822,… │ 1,810… │ 1,701,… │ 1,649… │ |
| 132 | +│ Num output token │ 116 │ 104 │ 149 │ 147 │ 132 │ 118 │ |
| 133 | +│ Num input token │ 200 │ 200 │ 200 │ 200 │ 200 │ 200 │ |
| 134 | +└──────────────────────┴────────┴────────┴─────────┴────────┴─────────┴────────┘ |
| 135 | +Output token throughput (per sec): 70.23 |
| 136 | +Request throughput (per sec): 0.60 |
| 137 | +``` |
| 138 | + |
| 139 | +**Note** This proxy uses TensorRT (HTTP) currently, so above performance numbers should be considered relative. Performance will vary for TensorRT-LLM models based on [build and deployment options](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#using-the-tensorrt-llm-backend) used. |
| 140 | + |
| 141 | +Additional optimizations like speculative sampling and FP8 quantization can further improve throughput. For more on the throughput levels that are possible with TensorRT-LLM for different combinations of model, hardware, and workload, see the [official benchmarks](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md). |
| 142 | + |
| 143 | +## Build and deploy your own models |
| 144 | +The image includes [TensorRT-LLM toolbox](https://github.com/NVIDIA/TensorRT-LLM.git) and [backend](https://github.com/triton-inference-server/tensorrtllm_backend.git) for building your own TensorRT-LLM models. Both can be found under `/opt/tritonserver/third-party-src/` inside your Docker image. |
| 145 | + |
| 146 | +The basic steps to build a TensorRT model are outlined [here](https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#using-the-tensorrt-llm-backend) which essentially involves |
| 147 | +1. Downloading a [Hugging Face model](https://huggingface.co/models) of your choice, |
| 148 | +2. Converting it to a TensorRT format, and |
| 149 | +3. Lastly building a compiled model that can be deployed on Triton Inference Server. |
| 150 | + |
| 151 | +Additionally, you can also use the steps mentioned [here](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#retrieve-the-model-weights) to build your TensorRT model. Once your model is built, you can [deploy](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#deploy-with-triton-inference-server) and use it through the OpenAI API proxy. |
| 152 | + |
| 153 | +## Further references- |
| 154 | + - [Benchmarking NVIDIA TensorRT-LLM](https://jan.ai/post/benchmarking-nvidia-tensorrt-llm) - TensorRT-LLM was 30-70% faster than [llama.cpp](https://github.com/ggerganov/llama.cpp) on the same hardware, consumes less memory on consecutive runs with marginally more GPU VRAM utilization than llama.cpp and models are 20%+ smaller compiled model sizes than llama.cpp. |
| 155 | + - [Use Llama 3 with NVIDIA TensorRT-LLM and Triton Inference Server](https://docs.lxp.lu/howto/llama3-triton/) - 30-minute tutorial to show how to use TensorRT-LLM to build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs using Llama3 model as an example. |
| 156 | + - Similar guide can be on [Serverless TensorRT-LLM (LLaMA 3 8B)](https://modal.com/docs/examples/trtllm_llama) - how to use the TensorRT-LLM framework to serve Meta’s LLaMA 3 8B model at a total throughput of roughly 4,500 output tokens per second on a single NVIDIA A100 40GB GPU. |
0 commit comments