Skip to content

Performance Optimization: Optimized TileShape Configuration for f8 (#3617) #3735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jiawenliu64
Copy link
Member

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/816

Performance Issue with Current F8 TileShape Configuration

The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense f8 tensor core on H100 is m64n256k32.
The current configuration leads to suboptimal performance for
tensor cores and bandwidth usage.

Optimized TileShape (128x256x128) Implementation

Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 for f8.

Benchmark Results on H100 GPU

Benchmark configuration:

PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

Benchmark

f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)

TileShape TFlops
128-128-128 1244
128-256-128 1374

f8f8bf16_rowwise (M = N = K = 8,192)

TileShape TFlops
128-128-128 1300
128-256-128 1480

f8f8bf16_tensorwise (M=N=K = 8,192)

TileShape TFlops
128-128-128 1271
128-256-128 1463

Technical Implementation

Modified TileShape from 128-128-128 to 128-256-128 for:

  • f8f8bf16_grouped
  • f8f8bf16_rowwise
  • f8f8bf16_tensorwise

Added cooperative kernel by default for:

  • f8f8bf16_rowwise
  • f8f8bf16_tensorwise

f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise

The modifications only affect large where M > 128 and N > 128 and M or N > 2,048.
The matrices are divided into tiles twice as large, but with kernels using 3
SMs instead of 2. The smaller heuristics of large kernels may experience a
slight reduced efficiency compared to the previous configuration.
An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM.

These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

Test Coverage

Unit Tests Results

The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Thank you!

Differential Revision: D68719476

Pulled By: jiawenliu64

…ytorch#3617)

Summary:
X-link: facebookresearch/FBGEMM#816

## Performance Issue with Current F8 TileShape Configuration
The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128,  
while the optimal shape for dense f8 tensor core on H100 is m64n256k32.  
The current configuration leads to suboptimal performance for  
tensor cores and bandwidth usage.

## Optimized TileShape (128x256x128) Implementation 
Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM  
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.  
This configuration is notably used in Flash Attention V3 for f8.

## Benchmark Results on H100 GPU
### Benchmark configuration:
PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations  
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

### Benchmark
#### f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)
| TileShape   | TFlops  |
|-------------|-------- |
| 128-128-128 |    1244 |
| 128-256-128 |    1374 |
  
#### f8f8bf16_rowwise (M = N = K = 8,192)
| TileShape   | TFlops |
|-------------|------- |
| 128-128-128 |   1300 |
| 128-256-128 |   1480 |

#### f8f8bf16_tensorwise (M=N=K = 8,192)
| TileShape   | TFlops |
|-------------|------- |
| 128-128-128 |   1271 |
| 128-256-128 |   1463 |

## Technical Implementation
Modified TileShape from 128-128-128 to 128-256-128 for:
 - f8f8bf16_grouped
 - f8f8bf16_rowwise
 - f8f8bf16_tensorwise

Added cooperative kernel by default for:
 - f8f8bf16_rowwise
 - f8f8bf16_tensorwise
 
f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise

The modifications only affect large where M > 128 and N > 128 and M or N > 2,048.  
The matrices are divided into tiles twice as large, but with kernels using 3   
SMs instead of 2. The smaller heuristics of large kernels may experience a  
slight reduced efficiency compared to the previous configuration.  
An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM.  

These changes were made by modifying the minimum necessary code while respecting  
existing coding practices in FBGEMM.


## Test Coverage
### Unit Tests Results
The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize  
have been verified for the modified kernels.

jiawenliu64 jwfromm Thank you!


Differential Revision: D68719476

Pulled By: jiawenliu64
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68719476

Copy link

netlify bot commented Feb 25, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 66cff30
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67be3dc023b0000008b24184
😎 Deploy Preview https://deploy-preview-3735--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

@jiawenliu64 merged this pull request in b9acfeb.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
…ytorch#816)

Summary:
X-link: pytorch#3735

Pull Request resolved: facebookresearch/FBGEMM#816

## Performance Issue with Current F8 TileShape Configuration
The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense f8 tensor core on H100 is m64n256k32.
The current configuration leads to suboptimal performance for
tensor cores and bandwidth usage.

## Optimized TileShape (128x256x128) Implementation
Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 for f8.

## Benchmark Results on H100 GPU
### Benchmark configuration:
PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

### Benchmark
#### f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)
| TileShape   | TFlops  |
|-------------|-------- |
| 128-128-128 |    1244 |
| 128-256-128 |    1374 |

#### f8f8bf16_rowwise (M = N = K = 8,192)
| TileShape   | TFlops |
|-------------|------- |
| 128-128-128 |   1300 |
| 128-256-128 |   1480 |

#### f8f8bf16_tensorwise (M=N=K = 8,192)
| TileShape   | TFlops |
|-------------|------- |
| 128-128-128 |   1271 |
| 128-256-128 |   1463 |

## Technical Implementation
Modified TileShape from 128-128-128 to 128-256-128 for:
 - f8f8bf16_grouped
 - f8f8bf16_rowwise
 - f8f8bf16_tensorwise

Added cooperative kernel by default for:
 - f8f8bf16_rowwise
 - f8f8bf16_tensorwise

f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise

The modifications only affect large where M > 128 and N > 128 and M or N > 2,048.
The matrices are divided into tiles twice as large, but with kernels using 3
SMs instead of 2. The smaller heuristics of large kernels may experience a
slight reduced efficiency compared to the previous configuration.
An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM.

These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

## Test Coverage
### Unit Tests Results
The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Thank you!

X-link: pytorch#3617

Reviewed By: sunfish2010

Differential Revision: D68719476

Pulled By: jiawenliu64

fbshipit-source-id: 60705574aa1779e0171fea01addf8f20788c4749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants