Skip to content

Commit 728cc00

Browse files
kaiyuxmfuntowiczShixiaowei02
authored
Update TensorRT-LLM (NVIDIA#1233)
* Update TensorRT-LLM --------- Co-authored-by: Morgan Funtowicz <[email protected]> Co-authored-by: Shixiaowei02 <[email protected]>
1 parent b7c309d commit 728cc00

File tree

163 files changed

+4088
-3915
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

163 files changed

+4088
-3915
lines changed

3rdparty/cutlass

Submodule cutlass updated 1826 files

CHANGELOG.md

+36
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,41 @@
11
# Change Log
22

3+
## Versions 0.7.0 / 0.7.1
4+
5+
* Models
6+
- BART and mBART support in encoder-decoder models
7+
- FairSeq Neural Machine Translation (NMT) family
8+
- Mixtral-8x7B model
9+
- Support weight loading for HuggingFace Mixtral model
10+
- OpenAI Whisper
11+
- Mixture of Experts support
12+
- MPT - Int4 AWQ / SmoothQuant support
13+
- Baichuan FP8 quantization support
14+
* Features
15+
- [Preview] Speculative decoding
16+
- Add Python binding for `GptManager`
17+
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
18+
- System prompt caching
19+
- Enable split-k for weight-only cutlass kernels
20+
- FP8 KV cache support for XQA kernel
21+
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
22+
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
23+
- fMHA support for chunked attention and paged kv cache
24+
* Bug fixes
25+
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
26+
- Fix LLaMa with LoRA error #637
27+
- Fix LLaMA GPTQ failure #580
28+
- Fix Python binding for InferenceRequest issue #528
29+
- Fix CodeLlama SQ accuracy issue #453
30+
* Performance
31+
- MMHA optimization for MQA and GQA
32+
- LoRA optimization: cutlass grouped gemm
33+
- Optimize Hopper warp specialized kernels
34+
- Optimize AllReduce for parallel attention on Falcon and GPT-J
35+
- Enable split-k for weight-only cutlass kernel when SM>=75
36+
* Documentation
37+
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
38+
339
## Versions 0.6.0 / 0.6.1
440

541
* Models

README.md

+100-54
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ TensorRT-LLM
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
99
[![cuda](https://img.shields.io/badge/cuda-12.2-green)](https://developer.nvidia.com/cuda-downloads)
1010
[![trt](https://img.shields.io/badge/TRT-9.2-green)](https://developer.nvidia.com/tensorrt)
11-
[![version](https://img.shields.io/badge/release-0.7.1-green)](./setup.py)
11+
[![version](https://img.shields.io/badge/release-0.9.0.dev-green)](./setup.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

1414
[Architecture](./docs/source/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -38,26 +38,31 @@ TensorRT-LLM
3838

3939
## Table of Contents
4040

41-
- [TensorRT-LLM Overview](#tensorrt-llm-overview)
42-
- [Installation](#installation)
43-
- [Quick Start](#quick-start)
44-
- [Support Matrix](#support-matrix)
45-
- [Devices](#devices)
46-
- [Precision](#precision)
47-
- [Key Features](#key-features)
48-
- [Models](#models)
49-
- [Performance](#performance)
50-
- [Advanced Topics](#advanced-topics)
51-
- [Quantization](#quantization)
52-
- [In-flight Batching](#in-flight-batching)
53-
- [Attention](#attention)
54-
- [Graph Rewriting](#graph-rewriting)
55-
- [Benchmark](#benchmark)
56-
- [Troubleshooting](#troubleshooting)
57-
- [Release notes](#release-notes)
58-
- [Change Log](#change-log)
59-
- [Known Issues](#known-issues)
60-
- [Report Issues](#report-issues)
41+
- [TensorRT-LLM](#tensorrt-llm)
42+
- [Latest News](#latest-news)
43+
- [Table of Contents](#table-of-contents)
44+
- [TensorRT-LLM Overview](#tensorrt-llm-overview)
45+
- [Installation](#installation)
46+
- [Quick Start](#quick-start)
47+
- [Support Matrix](#support-matrix)
48+
- [Devices](#devices)
49+
- [Precision](#precision)
50+
- [Key Features](#key-features)
51+
- [Models](#models)
52+
- [Performance](#performance)
53+
- [Advanced Topics](#advanced-topics)
54+
- [Quantization](#quantization)
55+
- [In-flight Batching](#in-flight-batching)
56+
- [Attention](#attention)
57+
- [Graph Rewriting](#graph-rewriting)
58+
- [Benchmark](#benchmark)
59+
- [Troubleshooting](#troubleshooting)
60+
- [Release notes](#release-notes)
61+
- [Change Log](#change-log)
62+
- [Versions 0.8.0](#versions-080)
63+
- [For history change log, please see CHANGELOG.md.](#for-history-change-log-please-see-changelogmd)
64+
- [Known Issues](#known-issues)
65+
- [Report Issues](#report-issues)
6166

6267
## TensorRT-LLM Overview
6368

@@ -288,7 +293,7 @@ The list of supported models is:
288293
* [Replit Code](examples/mpt)
289294
* [RoBERTa](examples/bert)
290295
* [SantaCoder](examples/gpt)
291-
* [StarCoder](examples/gpt)
296+
* [StarCoder1/StarCoder2](examples/gpt)
292297
* [T5](examples/enc_dec)
293298
* [Whisper](examples/whisper)
294299

@@ -402,50 +407,91 @@ For example: `mpirun -n 1 python3 examples/gpt/build.py ...`
402407

403408
## Release notes
404409

405-
* TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
410+
* TensorRT-LLM requires TensorRT 9.2 and 23.12 containers.
406411

407412
### Change Log
408413

409-
#### Versions 0.7.0 / 0.7.1
410-
411-
* Models
412-
- BART and mBART support in encoder-decoder models
413-
- FairSeq Neural Machine Translation (NMT) family
414-
- Mixtral-8x7B model
415-
- Support weight loading for HuggingFace Mixtral model
416-
- OpenAI Whisper
417-
- Mixture of Experts support
418-
- MPT - Int4 AWQ / SmoothQuant support
419-
- Baichuan FP8 quantization support
414+
#### Versions 0.8.0
415+
416+
* Model Support
417+
- Phi-1.5/2.0
418+
- Mamba support (see examples/mamba/README.md)
419+
- The support is limited to beam width = 1 and single-node single-GPU
420+
- Nougat support (see examples/multimodal/README.md#nougat)
421+
- Qwen-VL support (see examples/qwenvl/README.md)
422+
- RoBERTa support, thanks to the contribution from @erenup
423+
- Skywork model support
424+
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
420425
* Features
421-
- [Preview] Speculative decoding
422-
- Add Python binding for `GptManager`
423-
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
424-
- System prompt caching
425-
- Enable split-k for weight-only cutlass kernels
426-
- FP8 KV cache support for XQA kernel
427-
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
428-
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
429-
- fMHA support for chunked attention and paged kv cache
426+
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
427+
- LoRA support for C++ runtime (see docs/source/lora.md)
428+
- Medusa decoding support (see examples/medusa/README.md)
429+
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
430+
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
431+
- Support for batch manager to return logits from context and/or generation phases
432+
- Include support in the Triton backend
433+
- Support AWQ and GPTQ for QWEN
434+
- Support ReduceScatter plugin
435+
- Support for combining `repetition_penalty` and `presence_penalty` #274
436+
- Support for `frequency_penalty` #275
437+
- OOTB functionality support:
438+
- Baichuan
439+
- InternLM
440+
- Qwen
441+
- BART
442+
- LLaMA
443+
- Support enabling INT4-AWQ along with FP8 KV Cache
444+
- Support BF16 for weight-only plugin
445+
- Baichuan
446+
- P-tuning support
447+
- INT4-AWQ and INT4-GPTQ support
448+
- Decoder iteration-level profiling improvements
449+
- Add `masked_select` and `cumsum` function for modeling
450+
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
451+
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
452+
- Support FP16 fMHA on NVIDIA V100 GPU
453+
* API
454+
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
455+
- **[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
456+
- **[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
457+
- **[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
430458
* Bug fixes
431-
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
432-
- Fix LLaMa with LoRA error #637
433-
- Fix LLaMA GPTQ failure #580
434-
- Fix Python binding for InferenceRequest issue #528
435-
- Fix CodeLlama SQ accuracy issue #453
459+
- Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
460+
- Fix LLaMA with LoRA enabled build failure #673
461+
- Fix InternLM SmoothQuant build failure #705
462+
- Fix Bloom int8_kv_cache functionality #741
463+
- Fix crash in `gptManagerBenchmark` #649
464+
- Fix Blip2 build error #695
465+
- Add pickle support for `InferenceRequest` #701
466+
- Fix Mixtral-8x7b build failure with custom_all_reduce #825
467+
- Fix INT8 GEMM shape #935
468+
- Minor bug fixes
436469
* Performance
437-
- MMHA optimization for MQA and GQA
438-
- LoRA optimization: cutlass grouped gemm
439-
- Optimize Hopper warp specialized kernels
440-
- Optimize AllReduce for parallel attention on Falcon and GPT-J
441-
- Enable split-k for weight-only cutlass kernel when SM>=75
470+
- **[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
471+
- **[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
472+
- Performance optimization of beam search kernel
473+
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
474+
- Custom AllReduce plugins performance optimization
475+
- Top-P sampling performance optimization
476+
- LoRA performance optimization
477+
- Custom allreduce performance optimization by introducing a ping-pong buffer to avoid an extra synchronization cost
478+
- Integrate XQA kernels for GPT-J (beamWidth=4)
442479
* Documentation
443-
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
480+
- Batch manager arguments documentation updates
481+
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
482+
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)
483+
- Update to the `docs/source/new_workflow.md` documentation
484+
- Update AWQ INT4 weight only quantization documentation for GPT-J
485+
- Add blog: Speed up inference with SOTA quantization techniques in TRT-LLM
486+
- Refine TensorRT-LLM backend README structure #133
487+
- Typo fix #739
444488

445489
#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
446490

447491
### Known Issues
448492

493+
* On windows, running context FMHA plugin with FP16 accumulation on LLaMA, Mistral and Phi models suffers from poor accuracy and the resulting inference output may be garbled. The suggestion to workaround these is to enable FP32 accumulation when building the models, i.e. passing the options `--context_fmha disable --context_fmha_fp32_acc enable` to `trtllm-build` command as a work-around, and this should be fixed in the next version
494+
449495
* The hang reported in issue
450496
[#149](https://github.com/triton-inference-server/tensorrtllm_backend/issues/149)
451497
has not been reproduced by the TensorRT-LLM team. If it is caused by a bug

benchmarks/cpp/README.md

+22-6
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ For example, setting mean=100 and std dev=10 would generate requests where 95.4%
103103
--tokenizer <path/to/tokenizer> \
104104
token-norm-dist \
105105
--num-requests 100 \
106-
--input-mean 100 --input-stdev 10 --output-mean 15 --output-stdev 0 --num-requests 100
106+
--input-mean 100 --input-stdev 10 \
107+
--output-mean 15 --output-stdev 0
107108
```
108109

109110
For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
@@ -141,8 +142,25 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \
141142
--max_num_samples 500
142143
```
143144

144-
To emulate `gptSessionBenchmark` static batching, you can use the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
145-
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated-timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count.
145+
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
146+
147+
#### Emulated static batching
148+
149+
To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
150+
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
151+
152+
`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
153+
```
154+
python prepare_dataset.py \
155+
--output tokens-fixed-lengths.json \
156+
--request-rate -1 \
157+
--time-delay-dist constant \
158+
--tokenizer <path/to/tokenizer> \
159+
token-norm-dist \
160+
--num-requests 128 \
161+
--input-mean 60 --input-stdev 0 \
162+
--output-mean 20 --output-stdev 0
163+
```
146164

147165
Take GPT-350M as an example for single GPU with static batching
148166
```
@@ -152,7 +170,5 @@ Take GPT-350M as an example for single GPU with static batching
152170
--type IFB \
153171
--static_emulated_batch_size 32 \
154172
--static_emulated_timeout 100 \
155-
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
173+
--dataset ../../benchmarks/cpp/tokens-fixed-lengths.json
156174
```
157-
158-
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.

benchmarks/cpp/bertBenchmark.cpp

+4-4
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,12 @@ std::string engineFilename(
5757
std::filesystem::path const& dataPath, WorldConfig const& worldConfig, std::string const& model)
5858
{
5959
auto constexpr allowExceptions = true;
60-
auto constexpr ingoreComments = true;
60+
auto constexpr ignoreComments = true;
6161
auto const jsonFilePath = dataPath / "config.json";
6262
TLLM_CHECK_WITH_INFO(
6363
std::filesystem::exists(jsonFilePath), std::string("File does not exist: ") + jsonFilePath.string());
6464
std::ifstream jsonStream(jsonFilePath);
65-
auto const json = nlohmann::json::parse(jsonStream, nullptr, allowExceptions, ingoreComments);
65+
auto const json = nlohmann::json::parse(jsonStream, nullptr, allowExceptions, ignoreComments);
6666
auto const& builderConfig = json.at("builder_config");
6767
auto const precision = builderConfig.at("precision").template get<std::string>();
6868
auto const worldSize = builderConfig.at("tensor_parallel").template get<SizeType>();
@@ -97,9 +97,9 @@ void benchmarkBert(std::string const& modelName, std::filesystem::path const& da
9797
allocator.setZero(*inputIdsBuffer);
9898
tensorMap.insert(std::make_pair("input_ids", inputIdsBuffer));
9999
// input_lengths
100-
std::vector<SizeType> inputLenghtsHost(batchSize);
100+
std::vector<SizeType> inputLengthsHost(batchSize);
101101
auto inLensBuffer = std::shared_ptr<ITensor>{
102-
allocator.copyFrom(inputLenghtsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU)};
102+
allocator.copyFrom(inputLengthsHost, ITensor::makeShape({batchSize}), MemoryType::kGPU)};
103103
allocator.setZero(*inLensBuffer);
104104
tensorMap.insert(std::make_pair("input_lengths", inLensBuffer));
105105

benchmarks/cpp/gptManagerBenchmark.cpp

+1-5
Original file line numberDiff line numberDiff line change
@@ -1049,12 +1049,8 @@ int main(int argc, char* argv[])
10491049
padId = result["pad_id"].as<int>();
10501050
}
10511051

1052-
std::optional<int32_t> eosId;
10531052
// Argument: End-of-sentence token id
1054-
if (result.count("eos_id"))
1055-
{
1056-
eosId = result["eos_id"].as<int>();
1057-
}
1053+
std::optional<int32_t> eosId = result["eos_id"].as<int>();
10581054

10591055
std::optional<int> staticEmulatedBatchSize;
10601056
// Argument: Static emulated batch size

0 commit comments

Comments
 (0)