Skip to content

Conversation

pggPL
Copy link
Collaborator

@pggPL pggPL commented Aug 25, 2025

Description

The issue with negative percentage of underflow was observed. The reason is that we count only 0 in fp8_tensor._data tensor, but we need to take into account also -0, represented by 128 (10000000 in binary, only sign bit is 1).

This PR also fixes issue with computation of percentage of underflows on one device - I forgot to add - (x == 0).sum().

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL added 2 commits August 25, 2025 08:11
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL
Copy link
Collaborator Author

pggPL commented Aug 25, 2025

/te-ci pytorch

Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL
Copy link
Collaborator Author

pggPL commented Aug 25, 2025

/te-ci pytorch

@ptrendx ptrendx requested a review from Copilot August 25, 2025 18:44
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with negative underflow percentage calculations in PyTorch FP8 debugging functionality. The problem was that the code only counted 0 values but missed -0 values (represented as 128 in FP8 format). Additionally, it corrects a missing subtraction in the underflow percentage computation.

Key changes:

  • Updated underflow detection to include both 0 and -0 values using torch.isin with [0, 128]
  • Fixed missing - (x == 0).sum() in percentage calculation
  • Updated tests to use random tensors and adjusted tolerance values

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
transformer_engine/debug/features/utils/stats_computation.py Updated underflow detection logic to handle both positive and negative zero values
tests/pytorch/debug/test_log.py Modified test to use random tensors and updated MSE tolerance
tests/pytorch/debug/test_api_features.py Fixed test to use dequantized tensor for underflow calculation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

)
== 0
),
zero_values,
Copy link
Preview

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding the device as 'cuda' assumes CUDA availability. Consider using the device of the input tensor or making the device configurable to improve portability.

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here (although not quite for the stated reason) - consider non distributed data parallel case where the tensors in the same process could be coming from different devices. Then we would need to send the zero_values to the device of the first input to torch.isin.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +202 to +206
FP8_NEGATIVE_ZERO = 128 # represnts -0.0 in fp8
zero_values = {
device: torch.tensor([0, FP8_NEGATIVE_ZERO], device=device)
for device in [torch.device(f"cuda:{i}") for i in range(torch.cuda.device_count())]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that it could be a bad idea for every rank to touch every local GPU at import time. At the very least this will incur some unnecessary initialization.

How about we add a helper function that resembles torch.count_nonzero:

def count_nonzero_fp8(fp8_data: torch.Tensor) -> torch.Tensor:
    fp8_data = fp8_data.view(dtype=torch.uint8)
    zero_vals = torch.tensor([0, 128], device=fp8_data.device, dtype=torch.uint8)
    return fp8_data.numel() - torch.isin(fp8_data, zero_vals).sum()

Calling it would look something like:

lambda x, aux_dict: (
    x.count_nonzero()
    - count_nonzero_fp8(aux_dict...)
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants