PR #23619: Fix the EmitSort checking after enabling NVLS and user buffer #23935

copybara-service · 2025-03-19T16:06:54Z

PR #23619: Fix the EmitSort checking after enabling NVLS and user buffer

Imported from GitHub PR #23619

There is a reported bug from NVIDIA that running Midjourney model triggers XLA error after enabling NVLS and user buffer by setting NCCL_NVLS_ENABLE=1 and --xla_gpu_enable_nccl_user_buffers=true:

jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure (external/xla/xla/service/gpu/ir_emitter_unnested.cc:1676) LayoutUtil::LayoutsInShapesEqual(keys_shape, sort->operand(i)->shape()).

This is because after enabling NVLS and user buffer, one of the operand of sort operation is from a different memory space (user buffer), and the previous LayoutsInShapesEqual check is too strong to pass as it also checks if operands are from the same memory space.

This MR makes the sort layout check weaker as operands do not have to be in the same memory space as long as they all on the device.
Copybara import of the project:

--
8265352 by Chenhao Jiang [email protected]:

Making the sort layout check weaker as operands do not have to be in the same memory space as long as they all on the device.

Merging this change closes #23619

FUTURE_COPYBARA_INTEGRATE_REVIEW=#23619 from serach24:chenhao/fix_nvsl_check_failed 8265352

Imported from GitHub PR #23619 There is a reported bug from NVIDIA that running Midjourney model triggers XLA error after enabling NVLS and user buffer by setting NCCL_NVLS_ENABLE=1 and --xla_gpu_enable_nccl_user_buffers=true: ``` jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure (external/xla/xla/service/gpu/ir_emitter_unnested.cc:1676) LayoutUtil::LayoutsInShapesEqual(keys_shape, sort->operand(i)->shape()). ``` This is because after enabling NVLS and user buffer, one of the operand of `sort` operation is from a different memory space (user buffer), and the previous `LayoutsInShapesEqual` check is too strong to pass as it also checks if operands are from the same memory space. This MR makes the sort layout check weaker as operands do not have to be in the same memory space as long as they all on the device. Copybara import of the project: -- 8265352 by Chenhao Jiang <[email protected]>: Making the sort layout check weaker as operands do not have to be in the same memory space as long as they all on the device. Merging this change closes #23619 FUTURE_COPYBARA_INTEGRATE_REVIEW=#23619 from serach24:chenhao/fix_nvsl_check_failed 8265352 PiperOrigin-RevId: 738419324

copybara-service bot force-pushed the test_738419324 branch 2 times, most recently from d24947d to f5741c4 Compare March 19, 2025 17:31

copybara-service bot force-pushed the test_738419324 branch from f5741c4 to 6d4f1d5 Compare March 19, 2025 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR #23619: Fix the EmitSort checking after enabling NVLS and user buffer #23935

PR #23619: Fix the EmitSort checking after enabling NVLS and user buffer #23935

copybara-service bot commented Mar 19, 2025

PR #23619: Fix the EmitSort checking after enabling NVLS and user buffer #23935

Are you sure you want to change the base?

PR #23619: Fix the EmitSort checking after enabling NVLS and user buffer #23935

Conversation

copybara-service bot commented Mar 19, 2025