[Common] Split cast/gated kernels by scaling mode #2248

Oleg-Goncharov · 2025-10-08T13:43:58Z

Description

Breaks up the large cast_kernels.cuh and cast_gated_kernels.cuh into smaller headers organized by scaling mode.
No functional or behavior changes: code is moved, not modified. This improves structure, readability, and maintainability (easier to navigate/extend specific scaling paths). Build includes/exports updated accordingly; tests unaffected.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Broke up the large cast_kernels.cuh and cast_gated_kernels.cuh into smaller headers organized by scaling mode.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Copilot

Pull Request Overview

This pull request refactors the large cast_kernels.cuh and cast_gated_kernels.cuh files into smaller, more organized header files structured by scaling mode. This improves code maintainability, readability, and navigation by creating specialized headers for different quantization and scaling implementations.

Breaks down monolithic headers into focused, scaling-mode-specific files
Reorganizes code structure without modifying functionality or behavior
Creates dispatcher files to coordinate between different scaling implementations

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
transformer_engine/common/util/cast_kernels.cuh	Removed all content - entire file deleted as part of refactoring
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh	NVFP4 quantize with transpose functionality, updated file path and namespacing
transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh	New file containing NVFP4-specific quantization kernels
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	New file containing NVFP4 dequantization functionality
transformer_engine/common/cast/nvfp4/core_nvfp4.cuh	New file with core NVFP4 utility functions and device operations
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	New file containing MXFP8 quantization kernels
transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh	MXFP8 gated operations, significantly reduced from original gated kernels file
transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh	New file containing MXFP8 dequantization functionality
transformer_engine/common/cast/fp8/quantize_fp8.cuh	New file containing FP8 quantization kernels
transformer_engine/common/cast/fp8/gated_fp8.cuh	New file containing FP8 gated operations
transformer_engine/common/cast/fp8/dequantize_fp8.cuh	New file containing FP8 dequantization functionality
transformer_engine/common/cast/dispatch/quantize.cuh	New dispatcher file coordinating quantization across scaling modes
transformer_engine/common/cast/dispatch/gated.cuh	New dispatcher file coordinating gated operations across scaling modes
transformer_engine/common/cast/dispatch/dequantize.cuh	New dispatcher file coordinating dequantization across scaling modes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

timmoon10 · 2025-10-10T21:08:38Z

transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh

+// This kernel supports only two scaling cases:
+// 1. r16c0  - Rowwise NVFP4
+// 2. r16c32 - Rowwise NVFP4 AND Colwise MXFP8
+template <bool COMPUTE_ACTIVATIONS, typename ParamOP, float (*OP)(float, const ParamOP &)>


Do we actually support fused activation-cast kernels for NVFP4? If not, we should remove these template arguments so that we don't compile unnecessary kernels and so we prevent users from accidentally calling them. We should also remove them from the kernel, and modify quantize_helper so it errors out if you attempt something invalid.

Suggested change

template <bool COMPUTE_ACTIVATIONS, typename ParamOP, float (*OP)(float, const ParamOP &)>

I intentionally left activation template arguments and all the activation related logic untouched, so we can easily enable it when/if it becomes the part of the FP4 recipe.
@ptrendx, should we keep it, or I just go ahead and clean up the kernel?
I also didn't want to add any functionality related modifications to this PR to not overwhelm it, and to do it separately in a following PRs. Since there are some parts of the NVFP4 code that need to be reviewed/changed anyways

If we don't support them, we should at least error out if you attempt to run them. Avoiding unnecessary compilations would also be useful so we don't blow up compile time and binary size.

I'm fine deferring this if we want this PR to minimize functional changes, but we should aim to catch more of these errors.

@timmoon10 @Oleg-Goncharov Let's minimize changes in this PR and just do the code movement here. Otherwise it will be very hard to properly review if the functionality was not altered.

tests/cpp/operator/test_cast_nvfp4_transpose.cu

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

transformer_engine/common/cast/dispatch/quantize.cuh

timmoon10

Overall LGTM once we iron out the test failures.

transformer_engine/common/cast/nvfp4/core_nvfp4.cuh

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

…s from the NVFP4 transpose test suite Signed-off-by: Oleg Goncharov <[email protected]>

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest update applies formatting-only changes to transformer_engine/common/cast/nvfp4/core_nvfp4.cuh, aligning the file with the project's clang-format configuration (Google-based style, 100-char column limit, 2-space indentation). No functional or behavioral modifications were made—function signatures and error macros were reformatted to improve consistency and readability. This change ensures that the NVFP4 core utilities, which handle FP4 quantization and conversion operations via inline PTX assembly, adhere to the repository's established formatting standards.

Important Files Changed

Filename	Score	Overview
transformer_engine/common/cast/nvfp4/core_nvfp4.cuh	5/5	Formatting-only changes: function signatures and error macros reformatted to match clang-format style; no functional modifications.

Confidence score: 5/5

This PR update is safe to merge with minimal risk, as it contains only formatting changes with no functional modifications.
Score reflects that the changes are purely cosmetic (clang-format enforcement) and cannot introduce bugs, regressions, or behavioral changes; all function logic remains identical.
No files require special attention; this is a straightforward formatting pass to ensure style consistency across the NVFP4 core utilities.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Oleg Goncharov <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. This update addresses formatting inconsistencies in core_nvfp4.cuh by reformatting function signatures to comply with the project's .clang-format style guide (Google-based, 100-character column limit). The changes are purely cosmetic—multi-line function signatures like compute_decoding_scaling_factor and mul_cvt_bf16_to_fp4_4x_with_stochastic_rounding are now consistently split across lines with proper indentation, while compute_global_encode_scaling_factor_FP4 is collapsed to a single line. Additionally, the #else branches that threw errors when FP4_TYPE_SUPPORTED is undefined have been removed, simplifying the code structure. This refactoring is part of the broader PR goal to split large cast kernel headers into smaller, more maintainable files organized by scaling mode. The reformatting improves readability and navigation within device code, aligning with the project's style enforcement strategy (cpplint, clang-format, pre-commit hooks).

Important Files Changed

Filename	Score	Overview
transformer_engine/common/cast/nvfp4/core_nvfp4.cuh	5/5	Formatting-only changes: function signatures reformatted for readability and `#else` error branches removed.

Confidence score: 5/5

This PR is safe to merge with minimal risk as the changes are purely cosmetic and do not modify any logic.
Score reflects formatting-only changes with no impact on compiled code or behavior; all modifications align with project style guidelines.
No files require special attention; this is a straightforward formatting cleanup.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Oleg-Goncharov · 2025-10-24T22:27:41Z

/te-ci

Oleg-Goncharov requested a review from ptrendx October 8, 2025 13:43

Oleg-Goncharov changed the title ~~[common] Refactor: split cast/gated kernels by scaling mode~~ [common] Split cast/gated kernels by scaling mode Oct 8, 2025

ptrendx requested a review from Copilot October 9, 2025 16:03

Copilot AI reviewed Oct 9, 2025

View reviewed changes

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Outdated Show resolved Hide resolved

ptrendx requested a review from timmoon10 October 9, 2025 17:20

timmoon10 reviewed Oct 10, 2025

View reviewed changes

timmoon10 reviewed Oct 15, 2025

View reviewed changes

transformer_engine/common/cast/nvfp4/core_nvfp4.cuh Outdated Show resolved Hide resolved

Oleg-Goncharov changed the title ~~[common] Split cast/gated kernels by scaling mode~~ [Common] Split cast/gated kernels by scaling mode Oct 16, 2025

Oleg-Goncharov and others added 22 commits October 24, 2025 19:39

Separated gated and dequantize kernels

85f6adc

Signed-off-by: Oleg Goncharov <[email protected]>

Separated quantize, dequantize and gated functions

1d6d713

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c5687b5

for more information, see https://pre-commit.ci

Fixed lint issues

0a69b0d

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e4126eb

for more information, see https://pre-commit.ci

Fixed persistent lint issues

ddfdd59

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8820eb5

for more information, see https://pre-commit.ci

Added missing compute capability 10.0 check for Quantize FP8 TMA kernels

43e13e3

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

eb758f8

for more information, see https://pre-commit.ci

Fixed the issue which was added again by autofix

37b135d

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0ef698f

for more information, see https://pre-commit.ci

Changed files description. Completely removed non-identity activation…

2308be8

…s from the NVFP4 transpose test suite Signed-off-by: Oleg Goncharov <[email protected]>

Removed unsupported template arguments in NVFP4 quantize

8e4d0af

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1024f2a

for more information, see https://pre-commit.ci

Fixed undefined symbol error

64622c2

Signed-off-by: Oleg Goncharov <[email protected]>

Fixed condition

4997bb9

Signed-off-by: Oleg Goncharov <[email protected]>

Fixed CUDA version check

a24f342

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d01870b

for more information, see https://pre-commit.ci

Changed arch conditions order

918308f

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

314dd17

for more information, see https://pre-commit.ci

Fix

7e3412d

Signed-off-by: Oleg Goncharov <[email protected]>

Clean up

8bad907

Signed-off-by: Oleg Goncharov <[email protected]>

Small fix

86dd987

Signed-off-by: Oleg Goncharov <[email protected]>

Oleg-Goncharov force-pushed the pr_cast_kernels_cleanup branch from 9202e6d to 86dd987 Compare October 24, 2025 20:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

2688064

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

Small fix

c0f4a1e

Signed-off-by: Oleg Goncharov <[email protected]>

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

Uh oh!

[Common] Split cast/gated kernels by scaling mode #2248

Are you sure you want to change the base?

[Common] Split cast/gated kernels by scaling mode #2248

Conversation

Oleg-Goncharov commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

timmoon10 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

ptrendx Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

Oleg-Goncharov commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oleg-Goncharov commented Oct 8, 2025 •

edited

Loading