You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CHANGELOG.md
+36
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,41 @@
1
1
# Change Log
2
2
3
+
## Versions 0.7.0 / 0.7.1
4
+
5
+
* Models
6
+
- BART and mBART support in encoder-decoder models
7
+
- FairSeq Neural Machine Translation (NMT) family
8
+
- Mixtral-8x7B model
9
+
- Support weight loading for HuggingFace Mixtral model
10
+
- OpenAI Whisper
11
+
- Mixture of Experts support
12
+
- MPT - Int4 AWQ / SmoothQuant support
13
+
- Baichuan FP8 quantization support
14
+
* Features
15
+
-[Preview] Speculative decoding
16
+
- Add Python binding for `GptManager`
17
+
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
18
+
- System prompt caching
19
+
- Enable split-k for weight-only cutlass kernels
20
+
- FP8 KV cache support for XQA kernel
21
+
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
22
+
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
23
+
- fMHA support for chunked attention and paged kv cache
24
+
* Bug fixes
25
+
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
26
+
- Fix LLaMa with LoRA error #637
27
+
- Fix LLaMA GPTQ failure #580
28
+
- Fix Python binding for InferenceRequest issue #528
29
+
- Fix CodeLlama SQ accuracy issue #453
30
+
* Performance
31
+
- MMHA optimization for MQA and GQA
32
+
- LoRA optimization: cutlass grouped gemm
33
+
- Optimize Hopper warp specialized kernels
34
+
- Optimize AllReduce for parallel attention on Falcon and GPT-J
35
+
- Enable split-k for weight-only cutlass kernel when SM>=75
36
+
* Documentation
37
+
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)
* TensorRT-LLM requires TensorRT 9.2 and 23.10 containers.
410
+
* TensorRT-LLM requires TensorRT 9.2 and 23.12 containers.
406
411
407
412
### Change Log
408
413
409
-
#### Versions 0.7.0 / 0.7.1
410
-
411
-
*Models
412
-
-BART and mBART support in encoder-decoder models
413
-
-FairSeq Neural Machine Translation (NMT) family
414
-
- Mixtral-8x7B model
415
-
- Support weight loading for HuggingFace Mixtral model
416
-
-OpenAI Whisper
417
-
-Mixture of Experts support
418
-
-MPT - Int4 AWQ / SmoothQuant support
419
-
-Baichuan FP8 quantization support
414
+
#### Versions 0.8.0
415
+
416
+
*Model Support
417
+
-Phi-1.5/2.0
418
+
-Mamba support (see examples/mamba/README.md)
419
+
- The support is limited to beam width = 1 and single-node single-GPU
420
+
- Nougat support (see examples/multimodal/README.md#nougat)
421
+
-Qwen-VL support (see examples/qwenvl/README.md)
422
+
-RoBERTa support, thanks to the contribution from @erenup
423
+
-Skywork model support
424
+
-Add example for multimodal models (BLIP with OPT or T5, LlaVA)
420
425
* Features
421
-
-[Preview] Speculative decoding
422
-
- Add Python binding for `GptManager`
423
-
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
424
-
- System prompt caching
425
-
- Enable split-k for weight-only cutlass kernels
426
-
- FP8 KV cache support for XQA kernel
427
-
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
428
-
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
429
-
- fMHA support for chunked attention and paged kv cache
426
+
- Chunked context support (see docs/source/gpt_attention.md#chunked-context)
427
+
- LoRA support for C++ runtime (see docs/source/lora.md)
428
+
- Medusa decoding support (see examples/medusa/README.md)
429
+
- The support is limited to Python runtime for Ampere or newer GPUs with fp16 and bf16 accuracy, and the `temperature` parameter of sampling configuration should be 0
430
+
- StreamingLLM support for LLaMA (see docs/source/gpt_attention.md#streamingllm)
431
+
- Support for batch manager to return logits from context and/or generation phases
432
+
- Include support in the Triton backend
433
+
- Support AWQ and GPTQ for QWEN
434
+
- Support ReduceScatter plugin
435
+
- Support for combining `repetition_penalty` and `presence_penalty`#274
436
+
- Support for `frequency_penalty`#275
437
+
- OOTB functionality support:
438
+
- Baichuan
439
+
- InternLM
440
+
- Qwen
441
+
- BART
442
+
- LLaMA
443
+
- Support enabling INT4-AWQ along with FP8 KV Cache
444
+
- Support BF16 for weight-only plugin
445
+
- Baichuan
446
+
- P-tuning support
447
+
- INT4-AWQ and INT4-GPTQ support
448
+
- Decoder iteration-level profiling improvements
449
+
- Add `masked_select` and `cumsum` function for modeling
450
+
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
451
+
- Add Weight-Only Support To Whisper #794, thanks to the contribution from @Eddie-Wang1120
452
+
- Support FP16 fMHA on NVIDIA V100 GPU
453
+
* API
454
+
- Add a set of High-level APIs for end-to-end generation tasks (see examples/high-level-api/README.md)
455
+
-**[BREAKING CHANGES]** Migrate models to the new build workflow, including LLaMA, Mistral, Mixtral, InternLM, ChatGLM, Falcon, GPT-J, GPT-NeoX, Medusa, MPT, Baichuan and Phi (see docs/source/new_workflow.md)
456
+
-**[BREAKING CHANGES]** Deprecate `LayerNorm` and `RMSNorm` plugins and removed corresponding build parameters
457
+
-**[BREAKING CHANGES]** Remove optional parameter `maxNumSequences` for GPT manager
430
458
* Bug fixes
431
-
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
432
-
- Fix LLaMa with LoRA error #637
433
-
- Fix LLaMA GPTQ failure #580
434
-
- Fix Python binding for InferenceRequest issue #528
435
-
- Fix CodeLlama SQ accuracy issue #453
459
+
- Fix the first token being abnormal issue when `--gather_all_token_logits` is enabled #639
460
+
- Fix LLaMA with LoRA enabled build failure #673
461
+
- Fix InternLM SmoothQuant build failure #705
462
+
- Fix Bloom int8_kv_cache functionality #741
463
+
- Fix crash in `gptManagerBenchmark`#649
464
+
- Fix Blip2 build error #695
465
+
- Add pickle support for `InferenceRequest`#701
466
+
- Fix Mixtral-8x7b build failure with custom_all_reduce #825
467
+
- Fix INT8 GEMM shape #935
468
+
- Minor bug fixes
436
469
* Performance
437
-
- MMHA optimization for MQA and GQA
438
-
- LoRA optimization: cutlass grouped gemm
439
-
- Optimize Hopper warp specialized kernels
440
-
- Optimize AllReduce for parallel attention on Falcon and GPT-J
441
-
- Enable split-k for weight-only cutlass kernel when SM>=75
470
+
-**[BREAKING CHANGES]** Increase default `freeGpuMemoryFraction` parameter from 0.85 to 0.9 for higher throughput
471
+
-**[BREAKING CHANGES]** Disable `enable_trt_overlap` argument for GPT manager by default
472
+
- Performance optimization of beam search kernel
473
+
- Add bfloat16 and paged kv cache support for optimized generation MQA/GQA kernels
#### For history change log, please see [CHANGELOG.md](./CHANGELOG.md).
446
490
447
491
### Known Issues
448
492
493
+
* On windows, running context FMHA plugin with FP16 accumulation on LLaMA, Mistral and Phi models suffers from poor accuracy and the resulting inference output may be garbled. The suggestion to workaround these is to enable FP32 accumulation when building the models, i.e. passing the options `--context_fmha disable --context_fmha_fp32_acc enable` to `trtllm-build` command as a work-around, and this should be fixed in the next version
For `tokenizer`, specifying the path to the local tokenizer that have already been downloaded, or simply the name of the tokenizer from HuggingFace like `meta-llama/Llama-2-7b` will both work. The tokenizer will be downloaded automatically for the latter case.
To emulate `gptSessionBenchmark` static batching, you can use the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
145
-
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated-timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count.
145
+
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
146
+
147
+
#### Emulated static batching
148
+
149
+
To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
150
+
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
151
+
152
+
`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
153
+
```
154
+
python prepare_dataset.py \
155
+
--output tokens-fixed-lengths.json \
156
+
--request-rate -1 \
157
+
--time-delay-dist constant \
158
+
--tokenizer <path/to/tokenizer> \
159
+
token-norm-dist \
160
+
--num-requests 128 \
161
+
--input-mean 60 --input-stdev 0 \
162
+
--output-mean 20 --output-stdev 0
163
+
```
146
164
147
165
Take GPT-350M as an example for single GPU with static batching
148
166
```
@@ -152,7 +170,5 @@ Take GPT-350M as an example for single GPU with static batching
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
0 commit comments