Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR #23347: [gpu] Allow explicitly setting slice_index in se_gpu_pjrt_client #24503

Merged
merged 1 commit into from
Apr 2, 2025

Conversation

copybara-service[bot]
Copy link

PR #23347: [gpu] Allow explicitly setting slice_index in se_gpu_pjrt_client

Imported from GitHub PR #23347

Allows overriding the slice index used by se_gpu_pjrt_client.

More explicit control over which slice a device ends up in is desirable:

  • Various parts of the ecosystem equate slices with "devices communicating via fast interconnect". With the arrival of NVL72 we want devices managed by multiple hosts to form a single slice.
  • For debugging purposes it can be useful to allow devices on the same host (managed in separate processes) to be treated as different slices. For example, Orbax's local checkpointing presumes the existence of at least two slices, so overriding the boot id will allow us to test local checkpointing on a single host.

(Companion PR in JAX: jax-ml/jax#26906)
Copybara import of the project:

--
8d16790 by Georg Stefan Schmid [email protected]:

[gpu] Allow overriding XLA slice_index

Merging this change closes #23347

FUTURE_COPYBARA_INTEGRATE_REVIEW=#23347 from gspschmid:gschmid/xla-override-boot-id 8d16790

…client

Imported from GitHub PR #23347

Allows overriding the slice index used by se_gpu_pjrt_client.

More explicit control over which slice a device ends up in is desirable:
- Various parts of the ecosystem equate slices with "devices communicating via fast interconnect". With the arrival of NVL72 we want devices managed by multiple hosts to form a single slice.
- For debugging purposes it can be useful to allow devices on the same host (managed in separate processes) to be treated as different slices. For example, [Orbax](https://github.com/google/orbax)'s local checkpointing presumes the existence of at least two slices, so overriding the boot id will allow us to test local checkpointing on a single host.

(Companion PR in JAX: jax-ml/jax#26906)
Copybara import of the project:

--
8d16790 by Georg Stefan Schmid <[email protected]>:

[gpu] Allow overriding XLA slice_index

Merging this change closes #23347

COPYBARA_INTEGRATE_REVIEW=#23347 from gspschmid:gschmid/xla-override-boot-id 8d16790
PiperOrigin-RevId: 743085978
@copybara-service copybara-service bot merged commit c3087e0 into main Apr 2, 2025
1 check passed
@copybara-service copybara-service bot deleted the test_743055829 branch April 2, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant