-
Notifications
You must be signed in to change notification settings - Fork 497
[PyTorch] Let GroupedLinear
accept MXFP8 input and gradient
#2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0163f2f
to
35a25f1
Compare
/te-ci pytorch |
1 similar comment
/te-ci pytorch |
@yaox12 What do you mean by the "padding scaling factor kernels"? Is that swizzle? If so, then we definitely need to optimize them, they should take max a few percent of the time to do the quantize. |
No, it's not swizzle. What I'm doing here is trying to quantize (or act and quantize) multiple input tensors as a whole versus splitting them into chunks and then quantize them one by one. Both methods produce exact the same quantized data, but the scaling factors may be different due to padding. So for the "quantize as a whole" way, we need to pad the sf manually. For example, if we have two input tensors (for two experts) both in shape [64, 128]:
|
ef68f89
to
fcd52fc
Compare
@@ -28,7 +30,7 @@ __device__ inline OType dgelu(const IType val, const Empty&) { | |||
template <typename OType, typename IType> | |||
__device__ inline OType sigmoid(const IType val, const Empty&) { | |||
const float cval = val; | |||
return 1.f / (1.f + expf(-cval)); | |||
return sigmoidf(cval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify the implementation with that in cast_gated_kernels.cuh
.
ComputeType after_dgate = grad_val * Activation(gelu_in, p); | ||
ComputeType act_in, dact_in; | ||
if constexpr ((Activation == &silu<fp32, fp32>) && (Dactivation == &dsilu<fp32, fp32>)) { | ||
const float s = sigmoidf(gelu_in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify the implementation with that in cast_gated_kernels.cuh.
GroupedLinear
accept MXFP8 inputGroupedLinear
accept MXFP8 input and gradient
/te-ci |
…inear Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
c1b8230
to
9461308
Compare
for more information, see https://pre-commit.ci
/te-ci |
Description
The functionality is ready but we're not seeing perf gain due to the performance regression of fused activation and quantization kernels, take the input in shape (8*4000, 4096) for example
SwiGLU
+ MXFP8 QuantizationSwiGLU
+ 8x MXFP8 Quantization: ~256usSwiGLU
+ MXFP8 fusion: ~343us + 2 padding scaling factor kernels (84 us)SReLU
+ MXFP8 QuantizationSReLU
+ 8x MXFP8 Quantization: ~187usSReLU
+ MXFP8 fusion: ~142us + 2 padding scaling factor kernels (75 us)For the
SReLU
case, the CPU overhead of the fused version is slower so overall we can get a slight speedup.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: