[JAX] Use TE quantization when TE fused norm is disable #2303

phu0ngng · 2025-10-24T17:09:43Z

Description

Perf improvement over the current main branch for E2E training of LLama 3.18B on a GB200.

Recipe	Speedup
te_fp8_delayedscaling	0.23%
te_fp8_currentscaling	0.06%
te_mxfp8	0.45%
te_nvfp4	0.52%

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring
Performance improvement

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR refactors the JAX normalization layer (layernorm and rmsnorm) to decouple quantization from fused normalization kernels by switching from the internal _quantize_dbias_impl to the public quantize function. When TE's fused normalization primitives are disabled (fallback path), the code now performs normalization first using _jax_layernorm / _jax_rmsnorm and then applies quantization via the standalone quantize helper. The quantizer parameter is made optional (defaults to None) to support cases where quantization is not needed. This change unifies the quantization API surface across the codebase and delivers 0.06%-0.52% E2E training speedup on GB200 hardware for LLama 3.18B across four quantization recipes (FP8 delayed/current scaling, MXFP8, NVFP4) by enabling optimized quantization (e.g., flatten_axis=-1) in the fallback path that was previously missing.

Important Files Changed

Filename	Score	Overview
transformer_engine/jax/cpp_extensions/normalization.py	3/5	Refactored layernorm_fwd and rmsnorm_fwd to use public `quantize` function instead of `_quantize_dbias_impl`; made `quantizer` optional; decoupled normalization from quantization when fused ops are disabled

Confidence score: 3/5

This PR introduces performance optimization with a clear refactoring goal but contains subtle API inconsistencies that need verification before merge.
Score reduced primarily due to inconsistent handling of quantize return values (some call sites unpack a tuple on line 1039 while others treat it as a single value on line 1060) and the removal of is_dbias parameter without clear documentation of whether this affects backward compatibility with existing quantizer implementations.
Pay close attention to lines 1039 and 1060 in transformer_engine/jax/cpp_extensions/normalization.py to verify the quantize function's return signature is handled correctly in all code paths, and confirm that removing is_dbias and dq_dtype parameters doesn't break any existing quantizer implementations.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

for more information, see https://pre-commit.ci

greptile-apps

Greptile Overview

Greptile Summary

This PR refactors the JAX normalization layer to use the public quantize() API instead of the internal _quantize_dbias_impl() function when TransformerEngine's fused normalization is disabled. The change affects both LayerNorm and RMSNorm code paths, routing all quantization operations through a unified entry point. When NormFwdPrimitive.enabled() returns False, the code now performs normalization using JAX's native implementation, then applies TE's optimized quantization kernel separately. This decouples normalization from quantization, enabling TE's performant quantization even when the fused norm kernel is unavailable (e.g., due to cuDNN version constraints or environment flags). The refactor also simplifies the API by removing is_dbias and dq_dtype parameters from quantize calls. The change maintains all existing fallback paths for MXFP8, current tensor scaling, and NVFP4 formats, demonstrating measurable performance improvements (0.06-0.52%) for end-to-end LLaMA 3.1 8B training on GB200 hardware.

Important Files Changed

Filename	Score	Overview
transformer_engine/jax/cpp_extensions/normalization.py	4/5	Replaces internal `_quantize_dbias_impl()` calls with public `quantize()` API across LayerNorm and RMSNorm fallback paths, simplifying quantization interface while routing all quantization through unified TE kernel

Confidence score: 4/5

This PR is safe to merge with minimal risk, as it's primarily a refactoring that consolidates quantization logic through a well-tested public API
Score reflects that the change is well-structured and maintains backward compatibility across all fallback paths (MXFP8, current scaling, NVFP4), though the PR checklist indicates tests were not added to specifically validate the refactored quantization paths
Pay close attention to transformer_engine/jax/cpp_extensions/normalization.py to ensure the quantize() API behaves identically to the previous _quantize_dbias_impl() implementation across all quantization formats, especially for edge cases involving cuDNN version constraints and transpose_batch_sequence flags

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

jberchtold-nvidia

Good idea! LGTM

phu0ngng · 2025-10-24T18:20:07Z

/te-ci JAX L0

jax norm + te quant

fb68f04

Signed-off-by: Phuong Nguyen <[email protected]>

phu0ngng requested a review from jberchtold-nvidia October 24, 2025 17:09

phu0ngng marked this pull request as ready for review October 24, 2025 17:09

Merge branch 'main' into norm_quant

c7a0ac8

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

67bf6d0

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

jberchtold-nvidia approved these changes Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[JAX] Use TE quantization when TE fused norm is disable #2303

[JAX] Use TE quantization when TE fused norm is disable #2303

Uh oh!

phu0ngng commented Oct 24, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot left a comment

Uh oh!

jberchtold-nvidia left a comment

Uh oh!

phu0ngng commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[JAX] Use TE quantization when TE fused norm is disable #2303

Are you sure you want to change the base?

[JAX] Use TE quantization when TE fused norm is disable #2303

Uh oh!

Conversation

phu0ngng commented Oct 24, 2025

Description

Type of change

Checklist:

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 3/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

jberchtold-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

phu0ngng commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants