[PyTorch] Enabling Per-Tensor Current Scaling Recipe #1471

zhongbozhu · 2025-02-11T01:14:53Z

Description

[WIP] Enable per-tensor current scaling recipe, as an alternative to delayed scaling.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Check in kernels and C++ wrappers needed for FP8 current scaling recipe
Wire the C++ interface to python level quantizer API
Current scaling recipe setup
Expose detailed numerical configs in quantizer API
Layer-level test finished, both single GPU and multi-GPU
Run E2E training

Optional:

Zero tolerance with respect to golden values
Support current scaling also in layernorm_mlp & transformer module

Unit Tests

C++ Unit Tests

# build C++ tests
TE_PATH=<your_path> ./qa/L0_cppunittest/test.sh
# list all test cases
ctest --test-dir tests/cpp/build --show-only
# test cast
ctest --test-dir tests/cpp/build -R OperatorTest/CastTestSuite.TestCast --output-on-failure
# test cast transpose
ctest --test-dir tests/cpp/build -R OperatorTest/CTTestSuite.TestCastTranspose --output-on-failure

Python Unit Tests

# test the quantizer
pytest tests/pytorch/test_float8tensor.py::TestCurrentScalingFloat8Tensor::test_quantize -s -v

# test the layer forward backward
pytest tests/pytorch/test_float8_current_scaling_exact.py::TestFP8CurrentScalingRecipeLinear::test_fp8_current_scaling_with_linear_module -s -v

pytest tests/pytorch/test_float8_current_scaling_exact.py::TestFP8CurrentScalingRecipeLayerNormLinear::test_fp8_current_scaling_with_layernorm_linear_module -s -v

# distributed layer test
pytest tests/pytorch/distributed/test_numerics.py -s -v

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

for more information, see https://pre-commit.ci

…tor config

timmoon10

We should also include FP8 current-scaling in our existing numerics tests. Even if they're not as strict as the golden-value tests, they cover more use-cases. Try adding the current-scaling recipe in test_numerics.py at

TransformerEngine/tests/pytorch/test_numerics.py

Line 100 in 8da4d20

fp8_recipes = [

timmoon10 · 2025-02-20T22:12:13Z

transformer_engine/pytorch/tensor/float8_tensor.py

+    """Scaling factor to multiply when quantizing to FP8"""
+    scale: torch.Tensor
+    """Max-abs value from last FP8 cast"""
+    amax: torch.Tensor


These are not actually necessary, since we can allocate temporary buffers when doing the FP8 cast. That said, I would expect this helps avoid some CPU overhead from dealing with PyTorch's memory pool. This design trades off reduced CPU overhead (maybe) with increased complexity and surface area for bugs.

I did that in the early stage of development for simplicity. I was also useful for me to debug.

transformer_engine/pytorch/tensor/float8_tensor.py

transformer_engine/common/recipe/__init__.py

transformer_engine/common/include/transformer_engine/transformer_engine.h

timmoon10 · 2025-02-21T01:25:00Z

tests/cpp/operator/test_cast.cu

@@ -21,7 +21,7 @@ using namespace transformer_engine;

 namespace {

-template <typename InputType, typename OutputType>
+template <typename InputType, typename OutputType, bool UPDATE_AMAX>


The template arg probably isn't giving us much benefit compared to passing in a bool. It doesn't affect anything within the inner loop.

timmoon10 · 2025-02-21T01:25:53Z

tests/cpp/operator/test_cast.cu

+      // current tensor scaling
+      performTestCurrentScaling<InputType, OutputType>(size);


It would be better to have a separate test suite (perhaps in a different file) to make it easier to debug test failures.

timmoon10 · 2025-02-21T01:29:00Z

tests/pytorch/test_float8tensor.py

+
+
+@pytest.mark.skipif(not fp8_available, reason=reason_for_no_fp8)
+class TestCurrentScalingFloat8Tensor:


I don't think it's necessary to duplicate these tests. TestFloat8Tensor is more testing the basic tensor infrastructure and not really the quantization.

I removed the test cases about float8tensor itself (set_data, etc.) in this class and only tested quantize/dequantize to avoid repetitive code.

tests/pytorch/test_recipe.py

timmoon10 · 2025-02-21T01:40:46Z

tests/pytorch/test_recipe.py

+from recipe_numerics_base import TestFP8RecipeLinearBase
+from recipe_numerics_base import GetRecipes
+
+TENSOR_DUMP_DIR = pathlib.Path(__file__).resolve().parent.parent.parent / "tensor_dumps"


We should make this configurable so that we can keep a copy of these tensors in our testing systems. Also, we'll need to make corresponding changes in our CI infrastructure.

how about using an env variable? what name would you like?

…ayernormlinear tests

…rrent scaling

kwyss-nvidia · 2025-02-25T22:09:36Z

tests/cpp/operator/test_cast.cu

+  float scale = 1.f;
+  float scale_inv = 1.f;
+
+  if (isinf(clamp_amax) || clamp_amax == 0.f) {


I have two differing cases:

qscale inf -> qscale = finfo(input_dtype).max
qscale nan or amax == 0 -> qscale = 1.0

@timmoon10 What do you think? When we have scale=inf, should we return the max(input_type) or just 1?

Ideally it will be good that every upcoming recipe can share the same compute_scale_from_amax function.

Haven't really triggered this case in test case though.

tests/cpp/operator/test_cast_transpose.cu

tests/pytorch/test_recipe.py

transformer_engine/common/recipe/__init__.py

transformer_engine/common/transpose/cast_transpose.cu

transformer_engine/common/util/cast.cu

transformer_engine/common/util/vectorized_pointwise.h

transformer_engine/pytorch/csrc/common.h

kwyss-nvidia

It is really helpful to see all the recipe setup in this MR! Thanks Zhongbo.

transformer_engine/pytorch/csrc/extensions/quantizer.cpp

ksivaman · 2025-02-26T05:18:52Z

transformer_engine/pytorch/fp8.py

@@ -743,6 +756,8 @@ def create(
            cls = DelayedScalingRecipeState
        elif recipe.mxfp8():
            cls = MXFP8BlockScalingRecipeState
+        elif recipe.current_scaled():


Suggested change

elif recipe.current_scaled():

elif recipe.current():

I think we should go in the other direction and be more explicit: #1471 (comment)
current is very unclear to me and makes me think it's a proxy class or something asynchronous.

sounds good, replaced.

ksivaman · 2025-02-26T05:28:27Z

transformer_engine/common/recipe/__init__.py

+    fp8_gemm_fprop: MMParams, default MMParams.use_split_accumulator=False
+                    used for calculating output y in forward pass
+    fp8_gemm_dgrad: MMParams, default MMParams.use_split_accumulator=True
+                    use for calculating dgrad in backward pass
+    fp8_gemm_wgrad: MMParams, default MMParams.use_split_accumulator=True
+                    use for calculating dgrad in backward pass


Not too fond of the name MMParams (assuming it's matmul params).

For Delayed Scaling, it was by choice that these knobs were on the python side for easy toggling but not fully exposed in the recipe APIs (as they are here), because these are low level GEMM details. Is this something that is required as a part of the recipe? If these need to be modified for studies then could using an envvar be a better option?

These gemm configs were indeed hand-picked and tested by the research folks (I set split accumulator to all True here for simplicity to pass the test, but in training we can afford to set it to be False in forward pass). Since research folks want to control the split accumulator config for each gemm, that will add many more env var configs.

I agree that we can later expose some of knobs in python level through some megatron params, but for now I just write

Researchers want access to control even low level details if they affect numerics substantially. I agree with Zhongbo that environment variables don't scale well versus an abstraction like Recipe that already offers per-layer and per-gemm control granularity.

zhongboz and others added 2 commits February 10, 2025 17:09

check in per-tensor current scaling C++/CUDA level code

bc97b09

[pre-commit.ci] auto fixes from pre-commit.com hooks

2b09de3

for more information, see https://pre-commit.ci

zhongbozhu changed the title ~~Enabling Per-Tensor Current Scaling Recipe~~ [PyTorch] Enabling Per-Tensor Current Scaling Recipe Feb 11, 2025

zhongboz and others added 9 commits February 13, 2025 11:09

setup basics of current scaling quantizer in python level

c18a355

[pre-commit.ci] auto fixes from pre-commit.com hooks

20a3c34

for more information, see https://pre-commit.ci

add test case for current scaling dequantize

77b0ad6

[pre-commit.ci] auto fixes from pre-commit.com hooks

7d292c6

for more information, see https://pre-commit.ci

finish linear layer fwd bwd test, determined error with bf16

c5cb1db

[pre-commit.ci] auto fixes from pre-commit.com hooks

f417bda

for more information, see https://pre-commit.ci

achieved zero tolerance for Linear by specify gemm use_split_accumula…

8da4d20

…tor config

enable layernormlinear with current scaling, pass bitwise test

446b6d1

refactor test case code

b68ad36

timmoon10 reviewed Feb 21, 2025

View reviewed changes

zhongboz added 3 commits February 21, 2025 16:07

make current scaling quantizers distrbuted, pass distributed linear&l…

af0b835

…ayernormlinear tests

bug fix: use cached fp8 recipe in backward

42f3eba

fix layernorm_mlp with current scaling, fix activation_helper with cu…

a5dc569

…rrent scaling

kwyss-nvidia reviewed Feb 25, 2025

View reviewed changes

kwyss-nvidia reviewed Feb 26, 2025

View reviewed changes

transformer_engine/pytorch/csrc/extensions/quantizer.cpp Show resolved Hide resolved

ksivaman reviewed Feb 26, 2025

View reviewed changes

zhongboz added 3 commits February 25, 2025 21:47

support detailed numerical settings from recipe to quantization kernel

9f0581e

resolving MR comments

26af7c2

recipe naming

5331fac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Enabling Per-Tensor Current Scaling Recipe #1471

[PyTorch] Enabling Per-Tensor Current Scaling Recipe #1471

zhongbozhu commented Feb 11, 2025 •

edited

Loading

timmoon10 left a comment

timmoon10 Feb 20, 2025

zhongbozhu Feb 22, 2025

timmoon10 Feb 21, 2025

timmoon10 Feb 21, 2025

timmoon10 Feb 21, 2025

zhongbozhu Feb 27, 2025

timmoon10 Feb 21, 2025

zhongbozhu Feb 27, 2025

kwyss-nvidia Feb 25, 2025

zhongbozhu Feb 27, 2025

kwyss-nvidia left a comment

ksivaman Feb 26, 2025

timmoon10 Feb 26, 2025 •

edited

Loading

zhongbozhu Feb 27, 2025

ksivaman Feb 26, 2025

zhongbozhu Feb 27, 2025 •

edited

Loading

kwyss-nvidia Feb 27, 2025

		// current tensor scaling
		performTestCurrentScaling<InputType, OutputType>(size);



		@pytest.mark.skipif(not fp8_available, reason=reason_for_no_fp8)
		class TestCurrentScalingFloat8Tensor:

[PyTorch] Enabling Per-Tensor Current Scaling Recipe #1471

Are you sure you want to change the base?

[PyTorch] Enabling Per-Tensor Current Scaling Recipe #1471

Conversation

zhongbozhu commented Feb 11, 2025 • edited Loading

Description

Type of change

Changes

Unit Tests

C++ Unit Tests

Python Unit Tests

Checklist:

timmoon10 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwyss-nvidia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timmoon10 Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhongbozhu Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhongbozhu commented Feb 11, 2025 •

edited

Loading

timmoon10 Feb 26, 2025 •

edited

Loading

zhongbozhu Feb 27, 2025 •

edited

Loading