Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 Grouped Gemm Optimization #3655

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Feb 4, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/731

While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping quantize_fp8_row and having to slice input tensors before calling f8f8bf16_rowwise_grouped.

To fix the former, we enable triton_quantize_fp8_row to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead.

To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor.

In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads.

Reviewed By: jiawenliu64

Differential Revision: D69072529

Josh Fromm and others added 3 commits February 2, 2025 17:11
Summary:
When benchmarking quantize functions, we'd like the overhead to mimic e2e behavior as closely as possible. For example, weights should be quantized ahead of time. The current design of quantize_bench does not allow this.

To accomodate it, I've added a new optional preprocess phase that allows some transformations to be applied independently from benchmarking. Here we use it to prepare data for grouped gemm benchmarks to more accurately capture the e2e behavior.

Differential Revision: D68964950
Summary: Adds support for the --trace option which will produce gpu traces for each benchmarked operator. This only works internally so if tried in OSS we fall back to nullcontext.

Differential Revision: D68980020
Summary:
X-link: facebookresearch/FBGEMM#731

While optimizing MOE, we found that small overheads were a major bottleneck for grouped gemm performance. This diff tackles a few of them, specifically overhead from torch.dynamo wrapping `quantize_fp8_row` and having to slice input tensors before calling `f8f8bf16_rowwise_grouped`.

To fix the former, we enable `triton_quantize_fp8_row` to be directly called, skipping dynamo compatibility. In cases where AOTI isnt needed, this removes a bit of overhead.

To fix the latter, we templatize f8f8fbf16_rowwise_grouped_dynamic to accept at::Tensor instead of lists. We introduce a new wrapper called f8f8bf16_rowwise_grouped_stacked to maintain the behavior where zero_start_index_M isnt provided but a user wants a single contiguous output tensor.

In microbenchmarks, we've found these seemingly small changes can improve TFLOPs by 2X for small workloads.

Reviewed By: jiawenliu64

Differential Revision: D69072529
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69072529

Copy link

netlify bot commented Feb 4, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 606449f
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67a176bd3db5b30008e811b2
😎 Deploy Preview https://deploy-preview-3655--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants