-
Notifications
You must be signed in to change notification settings - Fork 497
[PyTorch Debug] Fix issue with negative underflow. #2107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
/te-ci pytorch |
Signed-off-by: Pawel Gadzinski <[email protected]>
/te-ci pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue with negative underflow percentage calculations in PyTorch FP8 debugging functionality. The problem was that the code only counted 0
values but missed -0
values (represented as 128
in FP8 format). Additionally, it corrects a missing subtraction in the underflow percentage computation.
Key changes:
- Updated underflow detection to include both
0
and-0
values usingtorch.isin
with[0, 128]
- Fixed missing
- (x == 0).sum()
in percentage calculation - Updated tests to use random tensors and adjusted tolerance values
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
transformer_engine/debug/features/utils/stats_computation.py | Updated underflow detection logic to handle both positive and negative zero values |
tests/pytorch/debug/test_log.py | Modified test to use random tensors and updated MSE tolerance |
tests/pytorch/debug/test_api_features.py | Fixed test to use dequantized tensor for underflow calculation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
) | ||
== 0 | ||
), | ||
zero_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard-coding the device as 'cuda' assumes CUDA availability. Consider using the device of the input tensor or making the device configurable to improve portability.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here (although not quite for the stated reason) - consider non distributed data parallel case where the tensors in the same process could be coming from different devices. Then we would need to send the zero_values to the device of the first input to torch.isin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Signed-off-by: Pawel Gadzinski <[email protected]>
for more information, see https://pre-commit.ci
FP8_NEGATIVE_ZERO = 128 # represnts -0.0 in fp8 | ||
zero_values = { | ||
device: torch.tensor([0, FP8_NEGATIVE_ZERO], device=device) | ||
for device in [torch.device(f"cuda:{i}") for i in range(torch.cuda.device_count())] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that it could be a bad idea for every rank to touch every local GPU at import time. At the very least this will incur some unnecessary initialization.
How about we add a helper function that resembles torch.count_nonzero
:
def count_nonzero_fp8(fp8_data: torch.Tensor) -> torch.Tensor:
fp8_data = fp8_data.view(dtype=torch.uint8)
zero_vals = torch.tensor([0, 128], device=fp8_data.device, dtype=torch.uint8)
return fp8_data.numel() - torch.isin(fp8_data, zero_vals).sum()
Calling it would look something like:
lambda x, aux_dict: (
x.count_nonzero()
- count_nonzero_fp8(aux_dict...)
)
Description
The issue with negative percentage of underflow was observed. The reason is that we count only
0
infp8_tensor._data
tensor, but we need to take into account also-0
, represented by128
(10000000
in binary, only sign bit is 1).This PR also fixes issue with computation of percentage of underflows on one device - I forgot to add
- (x == 0).sum()
.Type of change
Checklist: