-
Couldn't load subscription status.
- Fork 532
[Common] Split cast/gated kernels by scaling mode #2248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Common] Split cast/gated kernels by scaling mode #2248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request refactors the large cast_kernels.cuh and cast_gated_kernels.cuh files into smaller, more organized header files structured by scaling mode. This improves code maintainability, readability, and navigation by creating specialized headers for different quantization and scaling implementations.
- Breaks down monolithic headers into focused, scaling-mode-specific files
- Reorganizes code structure without modifying functionality or behavior
- Creates dispatcher files to coordinate between different scaling implementations
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| transformer_engine/common/util/cast_kernels.cuh | Removed all content - entire file deleted as part of refactoring |
| transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh | NVFP4 quantize with transpose functionality, updated file path and namespacing |
| transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh | New file containing NVFP4-specific quantization kernels |
| transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh | New file containing NVFP4 dequantization functionality |
| transformer_engine/common/cast/nvfp4/core_nvfp4.cuh | New file with core NVFP4 utility functions and device operations |
| transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh | New file containing MXFP8 quantization kernels |
| transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh | MXFP8 gated operations, significantly reduced from original gated kernels file |
| transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh | New file containing MXFP8 dequantization functionality |
| transformer_engine/common/cast/fp8/quantize_fp8.cuh | New file containing FP8 quantization kernels |
| transformer_engine/common/cast/fp8/gated_fp8.cuh | New file containing FP8 gated operations |
| transformer_engine/common/cast/fp8/dequantize_fp8.cuh | New file containing FP8 dequantization functionality |
| transformer_engine/common/cast/dispatch/quantize.cuh | New dispatcher file coordinating quantization across scaling modes |
| transformer_engine/common/cast/dispatch/gated.cuh | New dispatcher file coordinating gated operations across scaling modes |
| transformer_engine/common/cast/dispatch/dequantize.cuh | New dispatcher file coordinating dequantization across scaling modes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh
Outdated
Show resolved
Hide resolved
| // This kernel supports only two scaling cases: | ||
| // 1. r16c0 - Rowwise NVFP4 | ||
| // 2. r16c32 - Rowwise NVFP4 AND Colwise MXFP8 | ||
| template <bool COMPUTE_ACTIVATIONS, typename ParamOP, float (*OP)(float, const ParamOP &)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually support fused activation-cast kernels for NVFP4? If not, we should remove these template arguments so that we don't compile unnecessary kernels and so we prevent users from accidentally calling them. We should also remove them from the kernel, and modify quantize_helper so it errors out if you attempt something invalid.
| template <bool COMPUTE_ACTIVATIONS, typename ParamOP, float (*OP)(float, const ParamOP &)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally left activation template arguments and all the activation related logic untouched, so we can easily enable it when/if it becomes the part of the FP4 recipe.
@ptrendx, should we keep it, or I just go ahead and clean up the kernel?
I also didn't want to add any functionality related modifications to this PR to not overwhelm it, and to do it separately in a following PRs. Since there are some parts of the NVFP4 code that need to be reviewed/changed anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't support them, we should at least error out if you attempt to run them. Avoiding unnecessary compilations would also be useful so we don't blow up compile time and binary size.
I'm fine deferring this if we want this PR to minimize functional changes, but we should aim to catch more of these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timmoon10 @Oleg-Goncharov Let's minimize changes in this PR and just do the code movement here. Otherwise it will be very hard to properly review if the functionality was not altered.
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh
Outdated
Show resolved
Hide resolved
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM once we iron out the test failures.
Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
…s from the NVFP4 transpose test suite Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
9202e6d to
86dd987
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest update applies formatting-only changes to transformer_engine/common/cast/nvfp4/core_nvfp4.cuh, aligning the file with the project's clang-format configuration (Google-based style, 100-char column limit, 2-space indentation). No functional or behavioral modifications were made—function signatures and error macros were reformatted to improve consistency and readability. This change ensures that the NVFP4 core utilities, which handle FP4 quantization and conversion operations via inline PTX assembly, adhere to the repository's established formatting standards.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/cast/nvfp4/core_nvfp4.cuh | 5/5 | Formatting-only changes: function signatures and error macros reformatted to match clang-format style; no functional modifications. |
Confidence score: 5/5
- This PR update is safe to merge with minimal risk, as it contains only formatting changes with no functional modifications.
- Score reflects that the changes are purely cosmetic (clang-format enforcement) and cannot introduce bugs, regressions, or behavioral changes; all function logic remains identical.
- No files require special attention; this is a straightforward formatting pass to ensure style consistency across the NVFP4 core utilities.
1 file reviewed, no comments
Signed-off-by: Oleg Goncharov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. This update addresses formatting inconsistencies in core_nvfp4.cuh by reformatting function signatures to comply with the project's .clang-format style guide (Google-based, 100-character column limit). The changes are purely cosmetic—multi-line function signatures like compute_decoding_scaling_factor and mul_cvt_bf16_to_fp4_4x_with_stochastic_rounding are now consistently split across lines with proper indentation, while compute_global_encode_scaling_factor_FP4 is collapsed to a single line. Additionally, the #else branches that threw errors when FP4_TYPE_SUPPORTED is undefined have been removed, simplifying the code structure. This refactoring is part of the broader PR goal to split large cast kernel headers into smaller, more maintainable files organized by scaling mode. The reformatting improves readability and navigation within device code, aligning with the project's style enforcement strategy (cpplint, clang-format, pre-commit hooks).
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/cast/nvfp4/core_nvfp4.cuh | 5/5 | Formatting-only changes: function signatures reformatted for readability and #else error branches removed. |
Confidence score: 5/5
- This PR is safe to merge with minimal risk as the changes are purely cosmetic and do not modify any logic.
- Score reflects formatting-only changes with no impact on compiled code or behavior; all modifications align with project style guidelines.
- No files require special attention; this is a straightforward formatting cleanup.
1 file reviewed, no comments
|
/te-ci |
Description
Breaks up the large
cast_kernels.cuhandcast_gated_kernels.cuhinto smaller headers organized by scaling mode.No functional or behavior changes: code is moved, not modified. This improves structure, readability, and maintainability (easier to navigate/extend specific scaling paths). Build includes/exports updated accordingly; tests unaffected.
Fixes # (issue)
Type of change
Changes
cast_kernels.cuhandcast_gated_kernels.cuhinto smaller headers organized by scaling mode.Checklist: