E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

pbchekin · 2024-08-23T16:02:48Z

The following E2E tests fail:

GPTJForCausalLM
GPTJForQuestionAnswering

for at least the following scenarios:

E2E accuracy huggingface, training, float32, LTS, PyTorch 2.5
E2E accuracy huggingface, training, float32, Rolling, PyTorch 2.5

Error:

RuntimeError: XPU out of memory, please use `empty_cache` to release all unoccupied cached memory.
...
RuntimeError: Eager run failed

Note that with IPEX the error is slightly different:

RuntimeError: XPU out of memory. Tried to allocate 256.00 MiB (GPU 0; 64.00 GiB total capacity; 63.22 GiB already allocated; 63.70 GiB reserved in total by PyTorch)
...
NotImplementedError: Eager model failed to run

The text was updated successfully, but these errors were encountered:

vlad-penkin · 2024-08-23T20:14:35Z

@riverliuintel, @Stonepia do you observe the same error?

Stonepia · 2024-08-26T01:35:41Z

Hi @vlad-penkin , Yes, we have the same error and we have a tracker at intel/torch-xpu-ops#701.

Besides, formerly these test don't OOM may because we didn't implement some XPU backend kernels (Thus they fall to CPU kernels). After we implemented more kernels, the GPU memory may not be sufficient, thus OOM.

vlad-penkin added tests: e2e bug Something isn't working ci labels Aug 23, 2024

vlad-penkin added this to the 3.0 [PT 2.5 E2E] Pass rate milestone Aug 23, 2024

vlad-penkin assigned riverliuintel and Stonepia and unassigned riverliuintel Aug 23, 2024

vlad-penkin unassigned riverliuintel Sep 1, 2024

vlad-penkin added the dependencies: pytorch label Sep 1, 2024

vlad-penkin modified the milestones: 3.0 [PT 2.5 E2E] Pass rate, 3.0 [PT 2.6 E2E] Pass rate Nov 12, 2024

vlad-penkin modified the milestones: 3.0 [PT 2.6 E2E] Pass rate, 3.0 [PT 2.7 E2E] Pass rate Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

pbchekin commented Aug 23, 2024

vlad-penkin commented Aug 23, 2024

Stonepia commented Aug 26, 2024

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

Comments

pbchekin commented Aug 23, 2024

vlad-penkin commented Aug 23, 2024

Stonepia commented Aug 26, 2024