-
Notifications
You must be signed in to change notification settings - Fork 503
UCP/PERF: Support for WARP level and improvements #10906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
06f6d41 to
012b942
Compare
012b942 to
7eca345
Compare
WalkthroughThis PR introduces device flow control window configuration for UCX performance testing. Changes span test configuration, performance API parameters, CUDA kernel infrastructure refactoring, command-line argument parsing, and kernel implementation redesign for flow-control-aware request batching. Changes
Sequence Diagram(s)sequenceDiagram
participant Test Runner
participant Kernel Manager
participant Request Manager
participant CUDA Device
Test Runner->>Kernel Manager: Initialize with device_fc_window
Kernel Manager->>Request Manager: Create with fc_window, size
loop Per Iteration
Test Runner->>Device: Launch kernel
Device->>Request Manager: get_request()
Request Manager->>Request Manager: Check m_pending_count vs m_fc_window
alt Flow Control Window Available
Request Manager-->>Device: Return available request slot
else Window Full or No Pending Slot
Request Manager->>Request Manager: progress_one()
Request Manager-->>Device: Return progressed slot
end
Device->>Device: Execute async send/operation
Device->>Request Manager: update_report(completed)
Request Manager->>Request Manager: Advance to next reporting target
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
contrib/ucx_perftest_config/test_types_ucp_device_cuda(2 hunks)src/tools/perf/api/libperf.h(3 hunks)src/tools/perf/cuda/cuda_kernel.cuh(4 hunks)src/tools/perf/cuda/ucp_cuda_kernel.cu(6 hunks)src/tools/perf/perftest.c(2 hunks)src/tools/perf/perftest.h(1 hunks)src/tools/perf/perftest_params.c(2 hunks)test/gtest/common/test_perf.cc(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/tools/perf/cuda/ucp_cuda_kernel.cu (2)
src/ucp/api/device/ucp_device_impl.h (5)
ucs_status_t(83-104)ucp_device_progress_req(441-451)ucp_device_put_single(143-166)ucp_device_put_multi(264-290)ucp_device_put_multi_partial(347-377)src/tools/perf/lib/libperf.c (1)
ucx_perf_get_message_size(2280-2292)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
| status = progress_one<level, fc, 1>(index); | ||
| } while (status == UCS_INPROGRESS); | ||
|
|
||
| if (ucs_unlikely(status != UCS_OK)) { | ||
| ucs_device_error("progress failed: %d", status); | ||
| return status; | ||
| } | ||
| } else { | ||
| index = __ffs(~m_pending_map) - 1; | ||
| ++m_pending[index]; | ||
| ++m_pending_count; | ||
| } | ||
|
|
||
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | ||
| req = nullptr; | ||
| flags = static_cast<ucp_device_flags_t>(0); | ||
| } else { | ||
| req = &m_requests[index]; | ||
| m_pending_map |= UCS_BIT(index); | ||
| } | ||
| return UCS_OK; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the FC path returning a null request pointer
With device_fc_window > 1 (default is 4), the first iterations of get_request() take the true branch, set req to nullptr, and still call ucp_perf_cuda_send_async(). That helper unconditionally dereferences req and eventually passes it into ucp_device_put_*(), whose ucp_device_prepare_send() immediately expects a valid ucp_device_request_t* (see src/ucp/api/device/ucp_device_impl.h). The kernel will therefore dereference nullptr on the very first send and crash. Even if it didn’t crash, we would silently skip issuing real device requests for device_fc_window-1 iterations, so no progress would be made.
Always hand out a real request object and use the flags to control flow-control behavior instead:
- if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
- req = nullptr;
- flags = static_cast<ucp_device_flags_t>(0);
- } else {
- req = &m_requests[index];
- m_pending_map |= UCS_BIT(index);
- }
+ req = &m_requests[index];
+ if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+ flags = static_cast<ucp_device_flags_t>(0);
+ } else {
+ m_pending_map |= UCS_BIT(index);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| status = progress_one<level, fc, 1>(index); | |
| } while (status == UCS_INPROGRESS); | |
| if (ucs_unlikely(status != UCS_OK)) { | |
| ucs_device_error("progress failed: %d", status); | |
| return status; | |
| } | |
| } else { | |
| index = __ffs(~m_pending_map) - 1; | |
| ++m_pending[index]; | |
| ++m_pending_count; | |
| } | |
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | |
| req = nullptr; | |
| flags = static_cast<ucp_device_flags_t>(0); | |
| } else { | |
| req = &m_requests[index]; | |
| m_pending_map |= UCS_BIT(index); | |
| } | |
| return UCS_OK; | |
| } | |
| status = progress_one<level, fc, 1>(index); | |
| } while (status == UCS_INPROGRESS); | |
| if (ucs_unlikely(status != UCS_OK)) { | |
| ucs_device_error("progress failed: %d", status); | |
| return status; | |
| } | |
| } else { | |
| index = __ffs(~m_pending_map) - 1; | |
| +m_pending[index]; | |
| +m_pending_count; | |
| } | |
| req = &m_requests[index]; | |
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | |
| flags = static_cast<ucp_device_flags_t>(0); | |
| } else { | |
| m_pending_map |= UCS_BIT(index); | |
| } | |
| return UCS_OK; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We intentionally assign nullptr to req, because device API supports nullptr request - this is actually the fastest workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for ucp_device_put_single, ucp_device_put_multi implementation and documentation
# to verify if nullptr is supported for the request parameter
# First, find the device API header/implementation files
fd -e h -e hpp -e cuh "ucp_device" --exec grep -l "ucp_device_put_single\|ucp_device_put_multi" {}
# Look for the implementation and any nullptr handling
rg -n -A 10 -B 2 "ucp_device_put_single|ucp_device_put_multi" --type cpp --type c -g "*/device/*"
# Search for ucp_device_prepare_send to see if it handles nullptr
rg -n -A 5 -B 2 "ucp_device_prepare_send" --type cpp --type c -g "*/device/*"
# Look for any documentation or comments about nullptr request support
rg -n -B 2 -A 2 "nullptr.*request|request.*nullptr|NULL.*request|request.*NULL" -g "*/device/*" --type cpp --type c
Length of output: 485
🏁 Script executed:
#!/bin/bash
# Search without file type restrictions since device code uses .cuh extension
# First, examine the device API header file we found
cat src/ucp/api/device/ucp_device_impl.h | head -200
# Search for ucp_device_put functions and their signatures
rg -n "ucp_device_put_single|ucp_device_put_multi|ucp_device_put_multi_partial" src/ucp/api/device/
# Look for ucp_device_prepare_send implementation
rg -n -A 10 "ucp_device_prepare_send" src/ucp/api/device/
# Search for any nullptr or NULL handling in device API
rg -n "nullptr.*request|request.*nullptr" src/ucp/api/device/ -A 3 -B 3
Length of output: 12518
@iyastreb, you're absolutely correct—I apologize for the incorrect analysis. The device API does explicitly support nullptr for the request parameter.
Looking at src/ucp/api/device/ucp_device_impl.h, I can confirm:
ucp_device_request_initchecksif (req != nullptr)and setscomp = nullptrwhen no request is providedUCP_DEVICE_SEND_BLOCKINGmacro guards the status update withif (_req != nullptr)- Documentation for
ucp_device_put_singlestates: "If @A req is not NULL, use @ref ucp_device_progress_req"
This is indeed an intentional optimization—operations are still posted to the device, but without the overhead of per-request tracking when fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window) is true. The flow-control logic in get_request() correctly uses this fast path.
Thank you for the clarification!
🧬 Code Graph Analysis Results
-
File: src/tools/perf/cuda/ucp_cuda_kernel.cu
- Snippet: ucp_perf_cuda_send_sync function
- Description: Performs an asynchronous send using ucp_perf_cuda_send_async. If the provided req pointer is nullptr, the function returns UCS_OK immediately (indicating the operation completes without a per-request handle). This demonstrates the "nullptr request" workflow mentioned in the user comment.
- Parameters:
- params: const ucp_perf_cuda_params &
- idx: ucx_perf_counter_t
- req: ucp_device_request_t *
- Return: ucs_status_t
- Key behavior:
- Calls ucp_perf_cuda_send_async<level, cmd>(...) with flags that disable extra delays.
- If the status indicates an error, returns the status.
- If req is nullptr, returns UCS_OK without progressing any request.
- Otherwise, progresses the request until completion (status != UCS_INPROGRESS) and returns the final status.
- Snippet location: ucp_perf_cuda_send_sync function body
- Snippet: ucp_perf_cuda_send_sync function
-
File: src/tools/perf/cuda/ucp_cuda_kernel.cu
- Snippet: ucp_perf_cuda_request_manager::get_request (device function)
- Description: Chooses a request from the internal pool to use for the next operation. If all requests are pending, it calls progress_one in a loop until one completes. Crucially, it can return a nullptr for the req and a zeroed flags set when flow-control allows skipping an explicit request handle, illustrating the nullptr-request fast-path.
- Parameters:
- level: ucs_device_level_t (template parameter)
- fc: bool (flow-control)
- req: ucp_device_request_t *& (output)
- flags: ucp_device_flags_t & (output)
- Return: UCS status (ucs_status_t)
- Key behavior:
- If all slots are in use (m_pending_count == m_size), advances progress until a completion occurs or an error.
- If not all slots are in flight, selects the next free index, increments m_pending_count, and marks it as pending.
- If fc and there is room for more, sets req to nullptr and flags to 0 to signal the device API can proceed without an explicit request handle; otherwise, returns the pointer to the selected request and updates the pending map.
- Snippet location: ucp_perf_cuda_request_manager::get_request (device function)
- Snippet: ucp_perf_cuda_request_manager::get_request (device function)
-
File: src/tools/perf/cuda/ucp_cuda_kernel.cu
- Snippet: ucp_perf_cuda_put_bw_iter function (shows usage of get_request)
- Description: Obtains a request via req_mgr.get_request<level, fc>(...), then dispatches the appropriate CUDA device send operation. This function illustrates how the code path may proceed with a nullptr request (when allowed by flow-control and window conditions) to use the device API’s nullptr-request fast path.
- Snippet location: ucp_perf_cuda_put_bw_iter (device function)
- Snippet: ucp_perf_cuda_put_bw_iter function (shows usage of get_request)
✏️ Learnings added
Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.186Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
* UCP/PERF: Addressed PR comments * UCP/PERF: Replaced popcount with int field to improve performance * UCP/PERF: Refactored dispatch macro * UCP/PERF: Support for warp level * UCP/PERF: Added warp tests in CI * UCP/PERF: Cleanup * UCP/PERF: Speed up intermediate report * UCP/PERF: Reduce progress latency by 2us * UCP/PERF: Fix for ULL * UCP/PERF: Allow NULL comp * UCP/PERF: Flow control window * UCP/PERF: Flow control * UCP/PERF: Added pending_map * UCP: Coverity fix * UCP/PERF: Removed unused var * PERF: Use __threadfence_system * UCP/PERF: Enabled jenkins tests
What?
Summary by CodeRabbit
Release Notes
New Features
-Fcommand-line option.Changes