Skip to content

Commit f6fd598

Browse files
authored
Docs: add tutorials on EAGLE, MEDUSA, vanilla speculative decoding using TRT-LLM (#131)
1 parent 5c11c28 commit f6fd598

File tree

3 files changed

+553
-0
lines changed

3 files changed

+553
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
<!--
2+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# Speculative Decoding
30+
31+
- [About Speculative Dampling](#about-speculative-decoding)
32+
- [Performance Improvements](#performance-improvements)
33+
- [Speculative Decoding with Triton Inference Server](#speculative-decoding-with-triton-inference-server)
34+
35+
36+
## About Speculative Decoding
37+
38+
Speculative Decoding (also referred to as Speculative Sampling) is a set of techniques designed to allow generation of more than one token per forward pass iteration. This can lead to a reduction in the average per-token latency **in situations where the GPU is underutilized due to small batch sizes.**
39+
40+
Speculative decoding involves predicting a sequence of future tokens, referred to as draft tokens, using a method that is substantially more efficient than repeatedly executing the target Large Language Model (LLM).
41+
These draft tokens are then collectively validated by processing them through the target LLM in a single forward pass. The underlying assumptions are twofold:
42+
43+
1. processing multiple draft tokens concurrently will be as rapid as processing a single token
44+
2. multiple draft tokens will be validated successfully over the course of the full generation
45+
46+
If the first assumption holds true, the latency of speculative decoding will no worse than the standard approach. If the second holds, output token generation advances by statistically more than one token per forward pass.
47+
The combination of both these allows speculative decoding to result in reduced latency.
48+
49+
## Performance Improvements
50+
51+
It's important to note that the effectiveness of speculative decoding techniques is highly dependent
52+
on the specific task at hand. For instance, forecasting subsequent tokens in a code-completion scenario
53+
may prove simpler than generating a summary for an article. [Spec-Bench](https://sites.google.com/view/spec-bench)
54+
shows the performance of different speculative decoding approaches on different tasks.
55+
56+
## Speculative Decoding with Triton Inference Server
57+
Follow [here](TRT-LLM/README.md) to learn how Triton Inference Server supports speculative decoding with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).

0 commit comments

Comments
 (0)