We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
1 parent fb1cfcd commit 8f54332Copy full SHA for 8f54332
06_gpu_and_ml/llm-serving/trtllm_latency.py
@@ -15,9 +15,14 @@
15
# With the out-of-the-box defaults we observe an unacceptable median time
16
# to last token of over a second, but with careful configuration,
17
# we'll bring that down to under 250ms -- over a 4x speed up!
18
+# These latencies were measured on a single NVIDIA H100 GPU
19
+# running LLaMA 3 8B on prompts and generations of a few dozen to a few hundred tokens.
20
-# These latencies were measured on a single NVIDIA H100 GPU with prompts and generations
-# of a few dozen to a few hundred tokens.
21
+# Here's what that looks like in a terminal chat interface:
22
+
23
+# <video controls autoplay loop muted>
24
+# <source src="https://modal-cdn.com/example-trtllm-latency.mp4" type="video/mp4">
25
+# </video>
26
27
# ## Overview
28
0 commit comments