MegaScale discovery is ran twice again #8954

tengyifei · 2025-04-09T20:21:43Z

One of the PyTorch/XLA operations used by https://github.com/AI-Hypercomputer/torchprime Llama is triggering JAX MegaScale discovery again. This bug tracks identifying that operation and removing the jax_env_context workaround.

The text was updated successfully, but these errors were encountered:

ysiraichi · 2025-04-16T13:58:52Z

Could you either provide more context or label this issue accordingly?

tengyifei · 2025-04-16T23:34:31Z

Basically PyTorch/XLA uses JAX for pallas kernels etc. It's really easy to accidentally trigger JAX backend initialization. When PyTorch/XLA already initialized its own TPU backend and JAX also tries to initialize its own TPU backend, the second initialization will hang in multi-slice (DCN network) environments. We have seen this bug again and again and this is the latest incarnation.

tengyifei added the bug Something isn't working label Apr 16, 2025

ysiraichi added the xla:tpu TPU specific issues and PRs label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MegaScale discovery is ran twice again #8954

MegaScale discovery is ran twice again #8954

tengyifei commented Apr 9, 2025

ysiraichi commented Apr 16, 2025

tengyifei commented Apr 16, 2025

MegaScale discovery is ran twice again #8954

MegaScale discovery is ran twice again #8954

Comments

tengyifei commented Apr 9, 2025

ysiraichi commented Apr 16, 2025

tengyifei commented Apr 16, 2025