Fix a crash in NeMo 2.0 during module._apply(lambda t: t.cpu()) #1502

guyueh1 · 2025-02-22T17:57:34Z

Description

In Nemo 2.0 during job exit, lightning calls a module._apply(lambda t: t.cpu()) on the GPT model which triggers an illegal memory access error in the TE dequantize kernel. This PR fixes the issue.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Guyue Huang <[email protected]>

timmoon10

Can you explain the race condition that this is fixing? From what I can tell, Float8Tensor.cpu should already synchronize the GPU:

TransformerEngine/transformer_engine/pytorch/tensor/quantized_tensor.py

Lines 321 to 323 in e70f913

    
           def cpu(self, memory_format=torch.preserve_format) -> torch.Tensor: 
        
               # pylint: disable=missing-function-docstring 
        
               return self.dequantize().cpu(memory_format=memory_format)

The actual problem is later in Float8Tensor._set_data:

TransformerEngine/transformer_engine/pytorch/tensor/float8_tensor.py

Line 528 in e70f913

self.data = self._quantizer.quantize(tensor)

We are passing a CPU tensor into the quantize kernel, and I don't think we ever move it to GPU. This doesn't explain why this PR fixes the IMA, so I could have missed something.

If my interpretation is the actual root cause, the quickest fix is to modify Float8Tensor._set_data with:

self.data = self._quantizer.quantize(tensor.to(device=self.device))

More long-term fixes are to handle CPU tensors in the quantize function or to support CPU Float8Tensors.

guyueh1 · 2025-02-24T19:54:49Z

@timmoon10
Re: We are passing a CPU tensor into the quantize kernel, and I don't think we ever move it to GPU.
I checked the device of 'tensor' in debug mode, after torch.cuda.synchronize() it is on cpu, however, the self._quantizer.quantize(tensor) worked fine. So there should be some code to handle cpu or move it to gpu somewhere.

I also don't fully understand the race condition, it just happened to work. I will dig into it, and reply here my findings.

guyueh1 · 2025-02-24T20:06:24Z

Findings:

Adding torch.cuda.synchronize() at beginning of _set_data works (the original proposal of this PR)
@timmoon10 's suggestion (self.data = self._quantizer.quantize(tensor.to(device=self.device))) also works
Still digging why IMA happens

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 · 2025-02-24T21:33:17Z

I decide to revert the torch.cuda.synchronize() change because I can't understand why it would work; I apply a new way to fix it by making sure the tensor is moved to self.device if it were on CPU. I confirmed it fixed the IMA.
@timmoon10 what you think about the current version

timmoon10 · 2025-02-25T19:11:09Z

/te-ci pytorch

timmoon10

LGTM

* Fix a crash with module._apply(lambda t: t.cpu()) Signed-off-by: Guyue Huang <[email protected]> * Add comments Signed-off-by: Guyue Huang <[email protected]> * Make sure tensor is moved to dst device before quantizer quantizes Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Co-authored-by: Tim Moon <[email protected]>

guyueh1 and others added 3 commits February 21, 2025 16:27

Fix a crash with module._apply(lambda t: t.cpu())

fd76d53

Signed-off-by: Guyue Huang <[email protected]>

Add comments

f6ccf80

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'main' into fix_crash_with_nemo_lightning

e70f913

timmoon10 reviewed Feb 24, 2025

View reviewed changes

Make sure tensor is moved to dst device before quantizer quantizes

2e35ba6

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'main' into fix_crash_with_nemo_lightning

05415c1

guyueh1 requested a review from timmoon10 February 25, 2025 15:58

Merge branch 'main' into fix_crash_with_nemo_lightning

5df1dfe

timmoon10 approved these changes Feb 25, 2025

View reviewed changes

timmoon10 merged commit 9351a17 into NVIDIA:main Feb 25, 2025
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a crash in NeMo 2.0 during module._apply(lambda t: t.cpu()) #1502

Fix a crash in NeMo 2.0 during module._apply(lambda t: t.cpu()) #1502

guyueh1 commented Feb 22, 2025 •

edited

Loading

timmoon10 left a comment

guyueh1 commented Feb 24, 2025

guyueh1 commented Feb 24, 2025

guyueh1 commented Feb 24, 2025

timmoon10 commented Feb 25, 2025

timmoon10 left a comment

	def cpu(self, memory_format=torch.preserve_format) -> torch.Tensor:
	# pylint: disable=missing-function-docstring
	return self.dequantize().cpu(memory_format=memory_format)

Fix a crash in NeMo 2.0 during module._apply(lambda t: t.cpu()) #1502

Fix a crash in NeMo 2.0 during module._apply(lambda t: t.cpu()) #1502

Conversation

guyueh1 commented Feb 22, 2025 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 left a comment

Choose a reason for hiding this comment

guyueh1 commented Feb 24, 2025

guyueh1 commented Feb 24, 2025

guyueh1 commented Feb 24, 2025

timmoon10 commented Feb 25, 2025

timmoon10 left a comment

Choose a reason for hiding this comment

guyueh1 commented Feb 22, 2025 •

edited

Loading