Skip to content

cache quantization #4031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Aya-ZIbra
Copy link
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1115

Context

Enablement of FP8 attention involves FP8 QKV quantization, passing FP8 inputs and descales to FAv3.

Based on our study, FP8 quantization per kv head provides reasonable accuracy while maintaining performance gains in prefill TFLOPS and decode latency.

TODO
Benchmarks to estimate performance gains. For PD, num_generations = 4 and success rate = 0.5. The decode latency is improved for large batch sizes and sequence lengths.
| Batch Size | Sequence Position | bf16 | fp8_KV | fp8_perhead+fp8attn | fp8 attn TTIT estimated difference(ms)
| --- | --- | --- | --- | --- | --- | --- |
| 64 | 2048 | 41.43437743 | 89.68185633 | 35.24733707 | -0.1237408072
|64| 4096 | 72.47199118 | 170.5418229 | 56.11991137 | -0.3270415962
| 64|16384 | 252.3710728 | 521.9820738 | 182.0285916 | -1.406849623
| 64| 32767 | 487.8227115 | 800.8700609 | 356.9591939 | -2.61727035

This Diff

Introduces a kernel to quantize per head.

This Stack

  1. < This step >
  2. < Step 2 >
  3. < Step 3 >

Differential Revision: D73386673

Summary:
X-link: facebookresearch/FBGEMM#1115

# Context
Enablement of FP8 attention involves FP8 QKV quantization, passing FP8 inputs and descales to FAv3.

Based on our study, FP8 quantization per kv head provides reasonable accuracy while maintaining performance gains in prefill TFLOPS and decode latency. 

TODO
Benchmarks to estimate performance gains.  For PD, num_generations = 4 and success rate = 0.5. The decode latency is improved for large batch sizes and sequence lengths. 
| Batch Size | Sequence Position | bf16 | fp8_KV | fp8_perhead+fp8attn | fp8 attn TTIT estimated difference(ms) 
| --- | --- | --- | --- | --- | --- | --- |
| 64 | 2048 | 41.43437743 | 89.68185633 | 35.24733707 | -0.1237408072 
|64|  4096 | 72.47199118 | 170.5418229 | 56.11991137 | -0.3270415962 
| 64|16384 | 252.3710728 | 521.9820738 | 182.0285916 | -1.406849623 
| 64| 32767 | 487.8227115 | 800.8700609 | 356.9591939 | -2.61727035

# This Diff
Introduces a kernel to quantize per head. 

# This Stack
1. **< This step >**
2. < Step 2 >
3. < Step 3 >

Differential Revision: D73386673
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D73386673

Copy link

netlify bot commented Apr 26, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit a584cc1
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/680d03d73d5f840008a8ce2a
😎 Deploy Preview https://deploy-preview-4031--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants