-
Notifications
You must be signed in to change notification settings - Fork 622
Add torch compliant grouped gemm API for CK FP8 rowwise #4486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D78119166 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, as well as leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - [Natalias grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | not sure use-case for this yet | | 2D-2D | I think this is for backward? | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
This pull request was exported from Phabricator. Differential Revision: D78119166 |
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, as well as leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - [Natalias grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | not sure use-case for this yet | | 2D-2D | I think this is for backward? | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
This pull request was exported from Phabricator. Differential Revision: D78119166 |
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, and leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - ngimel's [grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | not sure use-case for this yet | | 2D-2D | I think this is for backward? | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
This pull request was exported from Phabricator. Differential Revision: D78119166 |
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, and leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - ngimel's [grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | not sure use-case for this yet | | 2D-2D | I think this is for backward? | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
This pull request was exported from Phabricator. Differential Revision: D78119166 |
Summary: X-link: facebookresearch/FBGEMM#1543 Pull Request resolved: pytorch#4486 For PyTorch integration we will need to support several additional cases, and leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - ngimel's [grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | not sure use-case for this yet | | 2D-2D | I think this is for backward? | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
89cb88e
to
a920a74
Compare
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, and leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - ngimel's [grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | needed for backwards | | 2D-2D | needed for backwards | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
This pull request was exported from Phabricator. Differential Revision: D78119166 |
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, and leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - ngimel's [grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | needed for backwards | | 2D-2D | needed for backwards | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
Summary: X-link: facebookresearch/FBGEMM#1543 For PyTorch integration we will need to support several additional cases, and leverage slightly different API. This is best observed through the torch test cases, e.g. [test_scaled_grouped_gemm_2d_3d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1793), [test_scaled_grouped_gemm_3d_2d](https://www.internalfb.com/code/fbsource/[fbdb0063f1c1ecca30f5eab8b5341643f680ed51]/fbcode/caffe2/test/test_matmul_cuda.py?lines=1854) - ngimel's [grouped gemm API doc](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9) **A summary is we need these cases:** |**Input Type** | Notes | | 2D-3D | same as fbgemm stacked for MoE | | 3D-2D | needed for backwards | | 2D-2D | needed for backwards | | 3D-3D (BMM) | [Could alternatively leverage FBGEMM BMM kernel](https://www.internalfb.com/code/fbsource/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/) | Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases. - For BMM we could alternatively leverage the AMD FP8 BMM kernel in FBGEMM. But we can get some "free" support by doing this in the grouped kernel. - I've not yet updated the heuristics to account for the new cases properly. This will come after with a re-tune for generic shapes, as opposed to llama specific. Differential Revision: D78119166
2b51947
to
4758f24
Compare
This pull request has been merged in 53cde4a. |
Summary:
For PyTorch integration we will need to support several additional cases, as well as leverage slightly different API. This is best observed through the torch test cases, e.g. test_scaled_grouped_gemm_2d_3d, test_scaled_grouped_gemm_3d_2d
A summary is we need these cases:
|Input Type | Notes |
| 2D-3D | same as fbgemm stacked for MoE |
| 3D-2D | not sure use-case for this yet |
| 2D-2D | I think this is for backward? |
| 3D-3D (BMM) | Could alternatively leverage FBGEMM BMM kernel |
Pytorch API uses offsets instead of sizes, so we update the kernel setting the grouped gemm parameters to take in offsets as well, and support the above cases.
Differential Revision: D78119166