cache quantization #4031

Aya-ZIbra · 2025-04-26T16:03:33Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1115

Context

Enablement of FP8 attention involves FP8 QKV quantization, passing FP8 inputs and descales to FAv3.

Based on our study, FP8 quantization per kv head provides reasonable accuracy while maintaining performance gains in prefill TFLOPS and decode latency.

TODO
Benchmarks to estimate performance gains. For PD, num_generations = 4 and success rate = 0.5. The decode latency is improved for large batch sizes and sequence lengths.
| Batch Size | Sequence Position | bf16 | fp8_KV | fp8_perhead+fp8attn | fp8 attn TTIT estimated difference(ms)
| --- | --- | --- | --- | --- | --- | --- |
| 64 | 2048 | 41.43437743 | 89.68185633 | 35.24733707 | -0.1237408072
|64| 4096 | 72.47199118 | 170.5418229 | 56.11991137 | -0.3270415962
| 64|16384 | 252.3710728 | 521.9820738 | 182.0285916 | -1.406849623
| 64| 32767 | 487.8227115 | 800.8700609 | 356.9591939 | -2.61727035

This Diff

Introduces a kernel to quantize per head.

This Stack

< This step >
< Step 2 >
< Step 3 >

Differential Revision: D73386673

Summary: X-link: facebookresearch/FBGEMM#1115 # Context Enablement of FP8 attention involves FP8 QKV quantization, passing FP8 inputs and descales to FAv3. Based on our study, FP8 quantization per kv head provides reasonable accuracy while maintaining performance gains in prefill TFLOPS and decode latency. TODO Benchmarks to estimate performance gains. For PD, num_generations = 4 and success rate = 0.5. The decode latency is improved for large batch sizes and sequence lengths. | Batch Size | Sequence Position | bf16 | fp8_KV | fp8_perhead+fp8attn | fp8 attn TTIT estimated difference(ms) | --- | --- | --- | --- | --- | --- | --- | | 64 | 2048 | 41.43437743 | 89.68185633 | 35.24733707 | -0.1237408072 |64| 4096 | 72.47199118 | 170.5418229 | 56.11991137 | -0.3270415962 | 64|16384 | 252.3710728 | 521.9820738 | 182.0285916 | -1.406849623 | 64| 32767 | 487.8227115 | 800.8700609 | 356.9591939 | -2.61727035 # This Diff Introduces a kernel to quantize per head. # This Stack 1. **< This step >** 2. < Step 2 > 3. < Step 3 > Differential Revision: D73386673

facebook-github-bot · 2025-04-26T16:03:45Z

This pull request was exported from Phabricator. Differential Revision: D73386673

netlify · 2025-04-26T16:04:00Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`a584cc1`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/680d03d73d5f840008a8ce2a
😎 Deploy Preview	https://deploy-preview-4031--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot added the cla signed label Apr 26, 2025

facebook-github-bot added the fb-exported label Apr 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache quantization #4031

cache quantization #4031

Aya-ZIbra commented Apr 26, 2025

facebook-github-bot commented Apr 26, 2025

netlify bot commented Apr 26, 2025 •

edited

Loading

cache quantization #4031

Are you sure you want to change the base?

cache quantization #4031

Conversation

Aya-ZIbra commented Apr 26, 2025

Context

This Diff

This Stack

facebook-github-bot commented Apr 26, 2025

netlify bot commented Apr 26, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

netlify bot commented Apr 26, 2025 •

edited

Loading