|
| 1 | +<!-- |
| 2 | +# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +# |
| 4 | +# Redistribution and use in source and binary forms, with or without |
| 5 | +# modification, are permitted provided that the following conditions |
| 6 | +# are met: |
| 7 | +# * Redistributions of source code must retain the above copyright |
| 8 | +# notice, this list of conditions and the following disclaimer. |
| 9 | +# * Redistributions in binary form must reproduce the above copyright |
| 10 | +# notice, this list of conditions and the following disclaimer in the |
| 11 | +# documentation and/or other materials provided with the distribution. |
| 12 | +# * Neither the name of NVIDIA CORPORATION nor the names of its |
| 13 | +# contributors may be used to endorse or promote products derived |
| 14 | +# from this software without specific prior written permission. |
| 15 | +# |
| 16 | +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY |
| 17 | +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| 18 | +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR |
| 19 | +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR |
| 20 | +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, |
| 21 | +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, |
| 22 | +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR |
| 23 | +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY |
| 24 | +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
| 25 | +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
| 26 | +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| 27 | +--> |
| 28 | + |
| 29 | +# Speculative Decoding |
| 30 | + |
| 31 | +- [About Speculative Dampling](#about-speculative-decoding) |
| 32 | +- [Performance Improvements](#performance-improvements) |
| 33 | +- [Speculative Decoding with Triton Inference Server](#speculative-decoding-with-triton-inference-server) |
| 34 | + |
| 35 | + |
| 36 | +## About Speculative Decoding |
| 37 | + |
| 38 | +Speculative Decoding (also referred to as Speculative Sampling) is a set of techniques designed to allow generation of more than one token per forward pass iteration. This can lead to a reduction in the average per-token latency **in situations where the GPU is underutilized due to small batch sizes.** |
| 39 | + |
| 40 | +Speculative decoding involves predicting a sequence of future tokens, referred to as draft tokens, using a method that is substantially more efficient than repeatedly executing the target Large Language Model (LLM). |
| 41 | +These draft tokens are then collectively validated by processing them through the target LLM in a single forward pass. The underlying assumptions are twofold: |
| 42 | + |
| 43 | +1. processing multiple draft tokens concurrently will be as rapid as processing a single token |
| 44 | +2. multiple draft tokens will be validated successfully over the course of the full generation |
| 45 | + |
| 46 | +If the first assumption holds true, the latency of speculative decoding will no worse than the standard approach. If the second holds, output token generation advances by statistically more than one token per forward pass. |
| 47 | +The combination of both these allows speculative decoding to result in reduced latency. |
| 48 | + |
| 49 | +## Performance Improvements |
| 50 | + |
| 51 | +It's important to note that the effectiveness of speculative decoding techniques is highly dependent |
| 52 | +on the specific task at hand. For instance, forecasting subsequent tokens in a code-completion scenario |
| 53 | +may prove simpler than generating a summary for an article. [Spec-Bench](https://sites.google.com/view/spec-bench) |
| 54 | +shows the performance of different speculative decoding approaches on different tasks. |
| 55 | + |
| 56 | +## Speculative Decoding with Triton Inference Server |
| 57 | +Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). |
0 commit comments