[PyTorch Debug] Fix issue with negative underflow. #2107

pggPL · 2025-08-25T08:15:27Z

Description

The issue with negative percentage of underflow was observed. The reason is that we count only 0 in fp8_tensor._data tensor, but we need to take into account also -0, represented by 128 (10000000 in binary, only sign bit is 1).

This PR also fixes issue with computation of percentage of underflows on one device - I forgot to add - (x == 0).sum().

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

pggPL · 2025-08-25T08:36:55Z

/te-ci pytorch

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2025-08-25T13:44:06Z

/te-ci pytorch

Copilot

Pull Request Overview

This PR fixes an issue with negative underflow percentage calculations in PyTorch FP8 debugging functionality. The problem was that the code only counted 0 values but missed -0 values (represented as 128 in FP8 format). Additionally, it corrects a missing subtraction in the underflow percentage computation.

Key changes:

Updated underflow detection to include both 0 and -0 values using torch.isin with [0, 128]
Fixed missing - (x == 0).sum() in percentage calculation
Updated tests to use random tensors and adjusted tolerance values

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
transformer_engine/debug/features/utils/stats_computation.py	Updated underflow detection logic to handle both positive and negative zero values
tests/pytorch/debug/test_log.py	Modified test to use random tensors and updated MSE tolerance
tests/pytorch/debug/test_api_features.py	Fixed test to use dequantized tensor for underflow calculation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

transformer_engine/debug/features/utils/stats_computation.py

Copilot · 2025-08-25T18:45:07Z

transformer_engine/debug/features/utils/stats_computation.py

-            )
-            == 0
+            ),
+            zero_values,


Hard-coding the device as 'cuda' assumes CUDA availability. Consider using the device of the input tensor or making the device configurable to improve portability.

I agree here (although not quite for the stated reason) - consider non distributed data parallel case where the tensors in the same process could be coming from different devices. Then we would need to send the zero_values to the device of the first input to torch.isin.

tests/pytorch/debug/test_log.py

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-08-29T18:31:42Z

transformer_engine/debug/features/utils/stats_computation.py

+FP8_NEGATIVE_ZERO = 128  # represnts -0.0 in fp8
+zero_values = {
+    device: torch.tensor([0, FP8_NEGATIVE_ZERO], device=device)
+    for device in [torch.device(f"cuda:{i}") for i in range(torch.cuda.device_count())]
+}


I'm worried that it could be a bad idea for every rank to touch every local GPU at import time. At the very least this will incur some unnecessary initialization.

How about we add a helper function that resembles torch.count_nonzero:

def count_nonzero_fp8(fp8_data: torch.Tensor) -> torch.Tensor: fp8_data = fp8_data.view(dtype=torch.uint8) zero_vals = torch.tensor([0, 128], device=fp8_data.device, dtype=torch.uint8) return fp8_data.numel() - torch.isin(fp8_data, zero_vals).sum()

Calling it would look something like:

lambda x, aux_dict: ( x.count_nonzero() - count_nonzero_fp8(aux_dict...) )

pggPL added 2 commits August 25, 2025 08:11

fix underflows log issue

d474723

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

fecc23c

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL force-pushed the fix_underflows branch from 5426866 to fecc23c Compare August 25, 2025 08:31

pre-commit-ci bot and others added 2 commits August 25, 2025 08:31

[pre-commit.ci] auto fixes from pre-commit.com hooks

7b3d4f5

for more information, see https://pre-commit.ci

Merge branch 'main' into fix_underflows

47ec965

fix

eab8fea

Signed-off-by: Pawel Gadzinski <[email protected]>

ptrendx requested a review from Copilot August 25, 2025 18:44

Copilot AI reviewed Aug 25, 2025

View reviewed changes

ptrendx reviewed Aug 25, 2025

View reviewed changes

tests/pytorch/debug/test_log.py Show resolved Hide resolved

pggPL and others added 3 commits August 27, 2025 14:25

fix

d8eb5f0

Signed-off-by: Pawel Gadzinski <[email protected]>

Merge branch 'main' into fix_underflows

6f3a70e

[pre-commit.ci] auto fixes from pre-commit.com hooks

48b935c

for more information, see https://pre-commit.ci

timmoon10 reviewed Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch Debug] Fix issue with negative underflow. #2107

[PyTorch Debug] Fix issue with negative underflow. #2107

Uh oh!

pggPL commented Aug 25, 2025 •

edited

Loading

Uh oh!

pggPL commented Aug 25, 2025

Uh oh!

pggPL commented Aug 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Aug 25, 2025

Uh oh!

ptrendx Aug 26, 2025

Uh oh!

pggPL Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

timmoon10 Aug 29, 2025

Uh oh!

Uh oh!

[PyTorch Debug] Fix issue with negative underflow. #2107

Are you sure you want to change the base?

[PyTorch Debug] Fix issue with negative underflow. #2107

Uh oh!

Conversation

pggPL commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

pggPL commented Aug 25, 2025

Uh oh!

pggPL commented Aug 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

ptrendx Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timmoon10 Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pggPL commented Aug 25, 2025 •

edited

Loading