You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
This change adds fallback support when mlx5dv (Mellanox device-specific extensions) is not available for RDMA operations. It modifies the queue pair creation logic to conditionally use either extended mlx5dv-based queue pairs (when supported) or standard ibverbs queue pairs (as fallback). The pt_cuda_alloc flag is updated to require mlx5dv support since it's necessary for merging memory segments when using PyTorch's CUDA allocator. The change adds a new `is_extended` parameter to control whether to create extended or standard queue pairs at runtime.
Adds an env variable `MONARCH_RDMA_MLX5DV_DISABLED` to test the new code path on dev machine.
## Changes in Latest Revision
Based on reviewer feedback, the implementation has been updated with a cleaner, configuration-based approach:
**API Changes:**
- Replaced `uint8_t is_extended` parameter with `rdma_qp_type_t` enum in C API
- Added `RdmaQpType` enum to Rust with three variants:
- `Auto`: Auto-detect based on device capabilities (default)
- `Standard`: Force standard ibverbs queue pairs
- `Mlx5dv`: Force mlx5dv extended queue pairs
- Added `qp_type` field to `IbverbsConfig` for explicit QP type control
- C code uses switch statement with proper default case for unknown types
**Architecture:**
- Rust resolves `Auto` mode before calling C (single source of truth for detection)
- C function becomes a pure executor - no capability detection logic
- Removed environment variable approach in favor of configuration
**Testing:**
- Added `setup_with_qp_type()` helper function in test utilities
- Added 4 new unit tests to verify standard QP fallback path:
- `test_rdma_read_into_standard_qp` (CPU-to-CPU)
- `test_rdma_write_from_standard_qp` (CPU-to-CPU)
- `test_rdma_read_into_standard_qp_cuda` (GPU-to-GPU)
- `test_rdma_write_from_standard_qp_cuda` (GPU-to-GPU)
Reviewed By: dstaay-fb
Differential Revision: D85504061
0 commit comments