-
Notifications
You must be signed in to change notification settings - Fork 568
Add Preshuffled FP8 x INT4 Grouped Gemm Kernel #3800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D70870933 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
8c4ef51
to
071a184
Compare
Summary: Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933
071a184
to
e86a553
Compare
Summary: Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
e86a553
to
9391323
Compare
Summary: Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
9391323
to
ad12177
Compare
Summary: X-link: facebookresearch/FBGEMM#897 Pull Request resolved: pytorch#3800 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
ad12177
to
d39247d
Compare
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
d39247d
to
d9c8ad4
Compare
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
d9c8ad4
to
3fa97dc
Compare
Summary: X-link: facebookresearch/FBGEMM#897 Pull Request resolved: pytorch#3800 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
Summary: Pull Request resolved: pytorch#3800 Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
Summary: X-link: facebookresearch/FBGEMM#897 Pull Request resolved: pytorch#3800 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Differential Revision: D70870933
3fa97dc
to
7e88324
Compare
Summary: X-link: facebookresearch/FBGEMM#847 Pull Request resolved: pytorch#3766 One of the new interesting changes in the preshuffled F8I4 kernel is that group scales are downcast to FP8. This has the risk of running into dynamic range issues and impacting accuracy. We can mitigate this risk by adding FP32 columnwise scaling to the output. Fortunately, we can do this using EVT so the performance impact is negligible. Differential Revision: D70587477
Summary: X-link: facebookresearch/FBGEMM#855 Pull Request resolved: pytorch#3775 This diff introduces a set of quantization helper functions to fbgemm_gpu/experimental/gen_ai to make it easier to apply the new Int4 packing and preshuffling to weights. Differential Revision: D70643388 Reviewed By: summerdengfb
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933
7e88324
to
107e9f6
Compare
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933
107e9f6
to
864fb11
Compare
Summary: X-link: facebookresearch/FBGEMM#897 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933
Summary: X-link: facebookresearch/FBGEMM#897 Pull Request resolved: pytorch#3800 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933
Summary: Pull Request resolved: pytorch#3800 Working on adding support for stacked mixed dtype grouped gemm with preshuffling. Differential Revision: D70870933
This pull request was exported from Phabricator. Differential Revision: D70870933 |
864fb11
to
46acfbc
Compare
This pull request has been merged in a39d2cc. |
Summary: X-link: https://github.com/facebookresearch/FBGEMM/pull/897 Pull Request resolved: pytorch#3800 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933 fbshipit-source-id: 195fb9feb993ffa7efe27b038173bd70a1db57ed
hey @jwfromm - thanks for the PR What exactly the intended use of |
Summary: Pull Request resolved: facebookresearch/FBGEMM#897 X-link: pytorch#3800 Efficient FP8xINT4 grouped gemm with preshuffling and scale packing. This implementation uses the "stacked" API where inputs and outputs are single contiguous tensors and the group boundaries are indicated with an `M_sizes` tensor that contains the number of rows in each group. Reviewed By: jiawenliu64 Differential Revision: D70870933 fbshipit-source-id: 195fb9feb993ffa7efe27b038173bd70a1db57ed
Summary: Working on adding support for stacked mixed dtype grouped gemm with preshuffling.
Differential Revision: D70870933