Skip to content

Conversation

@casteryh
Copy link
Contributor

@casteryh casteryh commented Oct 25, 2025

Summary:
This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new is_extended parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable MONARCH_RDMA_MLX5DV_DISABLED to test the new code path on dev machine.

Changes in Latest Revision

Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:

API Changes:

  • Replaced uint8_t is_extended parameter with rdma_qp_type_t enum in C API
  • Added RdmaQpType enum to Rust with three variants:
    • Auto: Auto-detect based on device capabilities (default)
    • Standard: Force standard ibverbs queue pairs
    • Mlx5dv: Force mlx5dv extended queue pairs
  • Added qp_type field to IbverbsConfig for explicit QP type control
  • C code uses switch statement with proper default case for unknown types

Architecture:

  • Rust resolves Auto mode before calling C (single source of truth for detection)
  • C function becomes a pure executor - no capability detection logic
  • Removed environment variable approach in favor of configuration

Testing:

  • Added setup_with_qp_type() helper function in test utilities
  • Added 4 new unit tests to verify standard QP fallback path:
    • test_rdma_read_into_standard_qp (CPU-to-CPU)
    • test_rdma_write_from_standard_qp (CPU-to-CPU)
    • test_rdma_read_into_standard_qp_cuda (GPU-to-GPU)
    • test_rdma_write_from_standard_qp_cuda (GPU-to-GPU)

Differential Revision: D85504061

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 25, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 25, 2025

@casteryh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85504061.

casteryh added a commit to casteryh/monarch that referenced this pull request Oct 27, 2025
Summary:

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

Differential Revision: D85504061
casteryh added a commit to casteryh/monarch that referenced this pull request Oct 28, 2025
Summary:

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

Differential Revision: D85504061
casteryh added a commit to casteryh/monarch that referenced this pull request Oct 29, 2025
Summary:

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

## Changes in Latest Revision

Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:

**API Changes:**
  - Replaced `uint8_t is_extended` parameter with `rdma_qp_type_t` enum in C API
  - Added `RdmaQpType` enum to Rust with three variants:
    - `Auto`: Auto-detect based on device capabilities (default)
    - `Standard`: Force standard ibverbs queue pairs
    - `Mlx5dv`: Force mlx5dv extended queue pairs
  - Added `qp_type` field to `IbverbsConfig` for explicit QP type control
  - C code uses switch statement with proper default case for unknown types

**Architecture:**
  - Rust resolves `Auto` mode before calling C (single source of truth for detection)
  - C function becomes a pure executor - no capability detection logic
  - Removed environment variable approach in favor of configuration

**Testing:**
  - Added `setup_with_qp_type()` helper function in test utilities
  - Added 4 new unit tests to verify standard QP fallback path:
    - `test_rdma_read_into_standard_qp` (CPU-to-CPU)
    - `test_rdma_write_from_standard_qp` (CPU-to-CPU)
    - `test_rdma_read_into_standard_qp_cuda` (GPU-to-GPU)
    - `test_rdma_write_from_standard_qp_cuda` (GPU-to-GPU)

Reviewed By: dstaay-fb

Differential Revision: D85504061
casteryh added a commit to casteryh/monarch that referenced this pull request Oct 29, 2025
Summary:

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

## Changes in Latest Revision

Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:

**API Changes:**
  - Replaced `uint8_t is_extended` parameter with `rdma_qp_type_t` enum in C API
  - Added `RdmaQpType` enum to Rust with three variants:
    - `Auto`: Auto-detect based on device capabilities (default)
    - `Standard`: Force standard ibverbs queue pairs
    - `Mlx5dv`: Force mlx5dv extended queue pairs
  - Added `qp_type` field to `IbverbsConfig` for explicit QP type control
  - C code uses switch statement with proper default case for unknown types

**Architecture:**
  - Rust resolves `Auto` mode before calling C (single source of truth for detection)
  - C function becomes a pure executor - no capability detection logic
  - Removed environment variable approach in favor of configuration

**Testing:**
  - Added `setup_with_qp_type()` helper function in test utilities
  - Added 4 new unit tests to verify standard QP fallback path:
    - `test_rdma_read_into_standard_qp` (CPU-to-CPU)
    - `test_rdma_write_from_standard_qp` (CPU-to-CPU)
    - `test_rdma_read_into_standard_qp_cuda` (GPU-to-GPU)
    - `test_rdma_write_from_standard_qp_cuda` (GPU-to-GPU)

Reviewed By: dstaay-fb

Differential Revision: D85504061
casteryh added a commit to casteryh/monarch that referenced this pull request Oct 30, 2025
Summary:

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

## Changes in Latest Revision

Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:

**API Changes:**
  - Replaced `uint8_t is_extended` parameter with `rdma_qp_type_t` enum in C API
  - Added `RdmaQpType` enum to Rust with three variants:
    - `Auto`: Auto-detect based on device capabilities (default)
    - `Standard`: Force standard ibverbs queue pairs
    - `Mlx5dv`: Force mlx5dv extended queue pairs
  - Added `qp_type` field to `IbverbsConfig` for explicit QP type control
  - C code uses switch statement with proper default case for unknown types

**Architecture:**
  - Rust resolves `Auto` mode before calling C (single source of truth for detection)
  - C function becomes a pure executor - no capability detection logic
  - Removed environment variable approach in favor of configuration

**Testing:**
  - Added `setup_with_qp_type()` helper function in test utilities
  - Added 4 new unit tests to verify standard QP fallback path:
    - `test_rdma_read_into_standard_qp` (CPU-to-CPU)
    - `test_rdma_write_from_standard_qp` (CPU-to-CPU)
    - `test_rdma_read_into_standard_qp_cuda` (GPU-to-GPU)
    - `test_rdma_write_from_standard_qp_cuda` (GPU-to-GPU)

Reviewed By: dstaay-fb

Differential Revision: D85504061
Summary:

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

## Changes in Latest Revision

Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:

**API Changes:**
  - Replaced `uint8_t is_extended` parameter with `rdma_qp_type_t` enum in C API
  - Added `RdmaQpType` enum to Rust with three variants:
    - `Auto`: Auto-detect based on device capabilities (default)
    - `Standard`: Force standard ibverbs queue pairs
    - `Mlx5dv`: Force mlx5dv extended queue pairs
  - Added `qp_type` field to `IbverbsConfig` for explicit QP type control
  - C code uses switch statement with proper default case for unknown types

**Architecture:**
  - Rust resolves `Auto` mode before calling C (single source of truth for detection)
  - C function becomes a pure executor - no capability detection logic
  - Removed environment variable approach in favor of configuration

**Testing:**
  - Added `setup_with_qp_type()` helper function in test utilities
  - Added 4 new unit tests to verify standard QP fallback path:
    - `test_rdma_read_into_standard_qp` (CPU-to-CPU)
    - `test_rdma_write_from_standard_qp` (CPU-to-CPU)
    - `test_rdma_read_into_standard_qp_cuda` (GPU-to-GPU)
    - `test_rdma_write_from_standard_qp_cuda` (GPU-to-GPU)

Reviewed By: dstaay-fb

Differential Revision: D85504061
@meta-codesync
Copy link

meta-codesync bot commented Oct 30, 2025

This pull request has been merged in 434e447.

AlirezaShamsoshoara pushed a commit to AlirezaShamsoshoara/monarch that referenced this pull request Oct 30, 2025
Summary:
Pull Request resolved: meta-pytorch#1665

This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.

Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.

## Changes in Latest Revision

Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:

**API Changes:**
  - Replaced `uint8_t is_extended` parameter with `rdma_qp_type_t` enum in C API
  - Added `RdmaQpType` enum to Rust with three variants:
    - `Auto`: Auto-detect based on device capabilities (default)
    - `Standard`: Force standard ibverbs queue pairs
    - `Mlx5dv`: Force mlx5dv extended queue pairs
  - Added `qp_type` field to `IbverbsConfig` for explicit QP type control
  - C code uses switch statement with proper default case for unknown types

**Architecture:**
  - Rust resolves `Auto` mode before calling C (single source of truth for detection)
  - C function becomes a pure executor - no capability detection logic
  - Removed environment variable approach in favor of configuration

**Testing:**
  - Added `setup_with_qp_type()` helper function in test utilities
  - Added 4 new unit tests to verify standard QP fallback path:
    - `test_rdma_read_into_standard_qp` (CPU-to-CPU)
    - `test_rdma_write_from_standard_qp` (CPU-to-CPU)
    - `test_rdma_read_into_standard_qp_cuda` (GPU-to-GPU)
    - `test_rdma_write_from_standard_qp_cuda` (GPU-to-GPU)

Reviewed By: dstaay-fb

Differential Revision: D85504061

fbshipit-source-id: a54466a309ff086eae96a63f7edf994655664826
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants