Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

Open
pbchekin opened this issue Aug 23, 2024 · 2 comments
Open

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

pbchekin opened this issue Aug 23, 2024 · 2 comments
Assignees

Comments

@pbchekin
Copy link
Contributor

The following E2E tests fail:

  • GPTJForCausalLM
  • GPTJForQuestionAnswering

for at least the following scenarios:

  • E2E accuracy huggingface, training, float32, LTS, PyTorch 2.5
  • E2E accuracy huggingface, training, float32, Rolling, PyTorch 2.5

Error:

RuntimeError: XPU out of memory, please use `empty_cache` to release all unoccupied cached memory.
...
RuntimeError: Eager run failed

Note that with IPEX the error is slightly different:

RuntimeError: XPU out of memory. Tried to allocate 256.00 MiB (GPU 0; 64.00 GiB total capacity; 63.22 GiB already allocated; 63.70 GiB reserved in total by PyTorch)
...
NotImplementedError: Eager model failed to run
@vlad-penkin vlad-penkin added tests: e2e bug Something isn't working ci labels Aug 23, 2024
@vlad-penkin
Copy link
Contributor

@riverliuintel, @Stonepia do you observe the same error?

@Stonepia
Copy link
Contributor

Hi @vlad-penkin , Yes, we have the same error and we have a tracker at intel/torch-xpu-ops#701.

Besides, formerly these test don't OOM may because we didn't implement some XPU backend kernels (Thus they fall to CPU kernels). After we implemented more kernels, the GPU memory may not be sufficient, thus OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants