Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1115
Context
Enablement of FP8 attention involves FP8 QKV quantization, passing FP8 inputs and descales to FAv3.
Based on our study, FP8 quantization per kv head provides reasonable accuracy while maintaining performance gains in prefill TFLOPS and decode latency.
TODO
Benchmarks to estimate performance gains. For PD, num_generations = 4 and success rate = 0.5. The decode latency is improved for large batch sizes and sequence lengths.
| Batch Size | Sequence Position | bf16 | fp8_KV | fp8_perhead+fp8attn | fp8 attn TTIT estimated difference(ms)
| --- | --- | --- | --- | --- | --- | --- |
| 64 | 2048 | 41.43437743 | 89.68185633 | 35.24733707 | -0.1237408072
|64| 4096 | 72.47199118 | 170.5418229 | 56.11991137 | -0.3270415962
| 64|16384 | 252.3710728 | 521.9820738 | 182.0285916 | -1.406849623
| 64| 32767 | 487.8227115 | 800.8700609 | 356.9591939 | -2.61727035
This Diff
Introduces a kernel to quantize per head.
This Stack
Differential Revision: D73386673