Skip to content

Conversation

@iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Sep 24, 2025

What?

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable flow-control window size for device tests via new -F command-line option.
    • Enabled 32-thread test variants for bandwidth and latency benchmarking at 1k message size.
    • Added 8-byte single-thread latency test configuration.
    • Introduced 1-warp test variants with optimized settings.
  • Changes

    • Updated default latency parameters for UCP PUT multi and partial latency tests.

@iyastreb iyastreb force-pushed the ucp-perf-warp-fixes branch 2 times, most recently from 06f6d41 to 012b942 Compare October 1, 2025 06:56
@iyastreb iyastreb force-pushed the ucp-perf-warp-fixes branch from 012b942 to 7eca345 Compare October 2, 2025 06:46
@iyastreb iyastreb marked this pull request as ready for review October 15, 2025 15:02
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

This PR introduces device flow control window configuration for UCX performance testing. Changes span test configuration, performance API parameters, CUDA kernel infrastructure refactoring, command-line argument parsing, and kernel implementation redesign for flow-control-aware request batching.

Changes

Cohort / File(s) Summary
Test Configuration
contrib/ucx_perftest_config/test_types_ucp_device_cuda
Enabled 32-thread variants, added 8-byte 1-thread latency test, and expanded 1k-size warp-level test set with explicit configurations.
Performance API Parameters
src/tools/perf/api/libperf.h
Added UCP_PERF_FC_WINDOW_DEFAULT macro (value 4) and device_fc_window field to ucx_perf_params_t struct; updated field description comments.
CUDA Kernel Infrastructure
src/tools/perf/cuda/cuda_kernel.cuh
Replaced ucx_perf_cuda_update_report function with ucx_perf_cuda_reporter class; added device_fc_window field to context; introduced ucx_perf_cuda_thread_index template helper; replaced macro set for kernel dispatch and indexing; updated wait_for_kernel reporting logic and thread-level accounting.
CUDA Kernel Implementation
src/tools/perf/cuda/ucp_cuda_kernel.cu
Redesigned ucp_perf_cuda_request_manager with per-request batching, flow-control windowing, and pending tracking; replaced device_clone with device_vector; renamed and refactored send/kernel functions (ucp_perf_cuda_send_nbxucp_perf_cuda_send_async); updated kernel dispatch to use new macro set.
Parameter Initialization & Defaults
src/tools/perf/perftest.c, test/gtest/common/test_perf.cc
Initialized device_fc_window to UCP_PERF_FC_WINDOW_DEFAULT; changed latency test defaults from 32 to 1 for ucp_put_multi_lat and ucp_put_partial_lat.
Command-Line Interface
src/tools/perf/perftest.h, src/tools/perf/perftest_params.c
Added -F command-line option to configure device flow control window size; updated option parsing in parse_test_params and usage output.

Sequence Diagram(s)

sequenceDiagram
    participant Test Runner
    participant Kernel Manager
    participant Request Manager
    participant CUDA Device
    
    Test Runner->>Kernel Manager: Initialize with device_fc_window
    Kernel Manager->>Request Manager: Create with fc_window, size
    
    loop Per Iteration
        Test Runner->>Device: Launch kernel
        Device->>Request Manager: get_request()
        Request Manager->>Request Manager: Check m_pending_count vs m_fc_window
        alt Flow Control Window Available
            Request Manager-->>Device: Return available request slot
        else Window Full or No Pending Slot
            Request Manager->>Request Manager: progress_one()
            Request Manager-->>Device: Return progressed slot
        end
        Device->>Device: Execute async send/operation
        Device->>Request Manager: update_report(completed)
        Request Manager->>Request Manager: Advance to next reporting target
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

  • Request manager redesign (ucp_cuda_kernel.cu): Complete refactoring of batching logic, pending bitmap tracking, and flow-control windowing—verify correctness of progress_one template logic and get_request state management
  • Reporter class transition (cuda_kernel.cuh): Verify stateful reporter correctly advances iterations and respects thread-0 guarding; check memory fencing semantics
  • Macro set replacement: Cross-reference old dispatch logic (UCX_KERNEL_CMD, etc.) against new macros (UCX_PERF_KERNEL_DISPATCH_CMD, etc.) to ensure no behavioral drift
  • report_interval_ns calculation change: Removed division by 100—confirm this is intentional and doesn't break timing calibration
  • Latency default change (32 → 1): Determine rationale for changing test defaults; verify test coverage expectations
  • Device memory management: Validate device_vector allocation, copy, and deallocation paths match prior device_clone semantics

Poem

🐰 A whisker-twitch of flow control delight,
Windows aligned, requests in flight,
Reporters now class-bound with grace,
Kernels dispatched at warp-level pace—
Performance blooming, byte by byte! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'UCP/PERF: Support for WARP level and improvements' accurately reflects the main changes in the changeset, which include adding WARP level support and various performance optimizations including flow-control improvements.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e88f7 and fe5a631.

📒 Files selected for processing (8)
  • contrib/ucx_perftest_config/test_types_ucp_device_cuda (2 hunks)
  • src/tools/perf/api/libperf.h (3 hunks)
  • src/tools/perf/cuda/cuda_kernel.cuh (4 hunks)
  • src/tools/perf/cuda/ucp_cuda_kernel.cu (6 hunks)
  • src/tools/perf/perftest.c (2 hunks)
  • src/tools/perf/perftest.h (1 hunks)
  • src/tools/perf/perftest_params.c (2 hunks)
  • test/gtest/common/test_perf.cc (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/tools/perf/cuda/ucp_cuda_kernel.cu (2)
src/ucp/api/device/ucp_device_impl.h (5)
  • ucs_status_t (83-104)
  • ucp_device_progress_req (441-451)
  • ucp_device_put_single (143-166)
  • ucp_device_put_multi (264-290)
  • ucp_device_put_multi_partial (347-377)
src/tools/perf/lib/libperf.c (1)
  • ucx_perf_get_message_size (2280-2292)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)

Comment on lines +70 to 91
status = progress_one<level, fc, 1>(index);
} while (status == UCS_INPROGRESS);

if (ucs_unlikely(status != UCS_OK)) {
ucs_device_error("progress failed: %d", status);
return status;
}
} else {
index = __ffs(~m_pending_map) - 1;
++m_pending[index];
++m_pending_count;
}

if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
req = nullptr;
flags = static_cast<ucp_device_flags_t>(0);
} else {
req = &m_requests[index];
m_pending_map |= UCS_BIT(index);
}
return UCS_OK;
}
Copy link

@coderabbitai coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the FC path returning a null request pointer

With device_fc_window > 1 (default is 4), the first iterations of get_request() take the true branch, set req to nullptr, and still call ucp_perf_cuda_send_async(). That helper unconditionally dereferences req and eventually passes it into ucp_device_put_*(), whose ucp_device_prepare_send() immediately expects a valid ucp_device_request_t* (see src/ucp/api/device/ucp_device_impl.h). The kernel will therefore dereference nullptr on the very first send and crash. Even if it didn’t crash, we would silently skip issuing real device requests for device_fc_window-1 iterations, so no progress would be made.

Always hand out a real request object and use the flags to control flow-control behavior instead:

-        if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
-            req   = nullptr;
-            flags = static_cast<ucp_device_flags_t>(0);
-        } else {
-            req            = &m_requests[index];
-            m_pending_map |= UCS_BIT(index);
-        }
+        req = &m_requests[index];
+        if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+            flags = static_cast<ucp_device_flags_t>(0);
+        } else {
+            m_pending_map |= UCS_BIT(index);
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
status = progress_one<level, fc, 1>(index);
} while (status == UCS_INPROGRESS);
if (ucs_unlikely(status != UCS_OK)) {
ucs_device_error("progress failed: %d", status);
return status;
}
} else {
index = __ffs(~m_pending_map) - 1;
++m_pending[index];
++m_pending_count;
}
if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
req = nullptr;
flags = static_cast<ucp_device_flags_t>(0);
} else {
req = &m_requests[index];
m_pending_map |= UCS_BIT(index);
}
return UCS_OK;
}
status = progress_one<level, fc, 1>(index);
} while (status == UCS_INPROGRESS);
if (ucs_unlikely(status != UCS_OK)) {
ucs_device_error("progress failed: %d", status);
return status;
}
} else {
index = __ffs(~m_pending_map) - 1;
+m_pending[index];
+m_pending_count;
}
req = &m_requests[index];
if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
flags = static_cast<ucp_device_flags_t>(0);
} else {
m_pending_map |= UCS_BIT(index);
}
return UCS_OK;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intentionally assign nullptr to req, because device API supports nullptr request - this is actually the fastest workflow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for ucp_device_put_single, ucp_device_put_multi implementation and documentation
# to verify if nullptr is supported for the request parameter

# First, find the device API header/implementation files
fd -e h -e hpp -e cuh "ucp_device" --exec grep -l "ucp_device_put_single\|ucp_device_put_multi" {}

# Look for the implementation and any nullptr handling
rg -n -A 10 -B 2 "ucp_device_put_single|ucp_device_put_multi" --type cpp --type c -g "*/device/*"

# Search for ucp_device_prepare_send to see if it handles nullptr
rg -n -A 5 -B 2 "ucp_device_prepare_send" --type cpp --type c -g "*/device/*"

# Look for any documentation or comments about nullptr request support
rg -n -B 2 -A 2 "nullptr.*request|request.*nullptr|NULL.*request|request.*NULL" -g "*/device/*" --type cpp --type c

Length of output: 485


🏁 Script executed:

#!/bin/bash
# Search without file type restrictions since device code uses .cuh extension

# First, examine the device API header file we found
cat src/ucp/api/device/ucp_device_impl.h | head -200

# Search for ucp_device_put functions and their signatures
rg -n "ucp_device_put_single|ucp_device_put_multi|ucp_device_put_multi_partial" src/ucp/api/device/

# Look for ucp_device_prepare_send implementation
rg -n -A 10 "ucp_device_prepare_send" src/ucp/api/device/

# Search for any nullptr or NULL handling in device API
rg -n "nullptr.*request|request.*nullptr" src/ucp/api/device/ -A 3 -B 3

Length of output: 12518


@iyastreb, you're absolutely correct—I apologize for the incorrect analysis. The device API does explicitly support nullptr for the request parameter.

Looking at src/ucp/api/device/ucp_device_impl.h, I can confirm:

  1. ucp_device_request_init checks if (req != nullptr) and sets comp = nullptr when no request is provided
  2. UCP_DEVICE_SEND_BLOCKING macro guards the status update with if (_req != nullptr)
  3. Documentation for ucp_device_put_single states: "If @A req is not NULL, use @ref ucp_device_progress_req"

This is indeed an intentional optimization—operations are still posted to the device, but without the overhead of per-request tracking when fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window) is true. The flow-control logic in get_request() correctly uses this fast path.

Thank you for the clarification!


🧬 Code Graph Analysis Results
  • File: src/tools/perf/cuda/ucp_cuda_kernel.cu

    • Snippet: ucp_perf_cuda_send_sync function
      • Description: Performs an asynchronous send using ucp_perf_cuda_send_async. If the provided req pointer is nullptr, the function returns UCS_OK immediately (indicating the operation completes without a per-request handle). This demonstrates the "nullptr request" workflow mentioned in the user comment.
      • Parameters:
        • params: const ucp_perf_cuda_params &
        • idx: ucx_perf_counter_t
        • req: ucp_device_request_t *
      • Return: ucs_status_t
      • Key behavior:
        • Calls ucp_perf_cuda_send_async<level, cmd>(...) with flags that disable extra delays.
        • If the status indicates an error, returns the status.
        • If req is nullptr, returns UCS_OK without progressing any request.
        • Otherwise, progresses the request until completion (status != UCS_INPROGRESS) and returns the final status.
    • Snippet location: ucp_perf_cuda_send_sync function body
  • File: src/tools/perf/cuda/ucp_cuda_kernel.cu

    • Snippet: ucp_perf_cuda_request_manager::get_request (device function)
      • Description: Chooses a request from the internal pool to use for the next operation. If all requests are pending, it calls progress_one in a loop until one completes. Crucially, it can return a nullptr for the req and a zeroed flags set when flow-control allows skipping an explicit request handle, illustrating the nullptr-request fast-path.
      • Parameters:
        • level: ucs_device_level_t (template parameter)
        • fc: bool (flow-control)
        • req: ucp_device_request_t *& (output)
        • flags: ucp_device_flags_t & (output)
      • Return: UCS status (ucs_status_t)
      • Key behavior:
        • If all slots are in use (m_pending_count == m_size), advances progress until a completion occurs or an error.
        • If not all slots are in flight, selects the next free index, increments m_pending_count, and marks it as pending.
        • If fc and there is room for more, sets req to nullptr and flags to 0 to signal the device API can proceed without an explicit request handle; otherwise, returns the pointer to the selected request and updates the pending map.
    • Snippet location: ucp_perf_cuda_request_manager::get_request (device function)
  • File: src/tools/perf/cuda/ucp_cuda_kernel.cu

    • Snippet: ucp_perf_cuda_put_bw_iter function (shows usage of get_request)
      • Description: Obtains a request via req_mgr.get_request<level, fc>(...), then dispatches the appropriate CUDA device send operation. This function illustrates how the code path may proceed with a nullptr request (when allowed by flow-control and window conditions) to use the device API’s nullptr-request fast path.
    • Snippet location: ucp_perf_cuda_put_bw_iter (device function)

✏️ Learnings added
Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.186Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@ofirfarjun7 ofirfarjun7 merged commit e29e5f5 into openucx:master Nov 6, 2025
148 checks passed
@iyastreb iyastreb deleted the ucp-perf-warp-fixes branch November 6, 2025 13:40
zzhang37 pushed a commit to intel-staging/ucx that referenced this pull request Nov 14, 2025
* UCP/PERF: Addressed PR comments

* UCP/PERF: Replaced popcount with int field to improve performance

* UCP/PERF: Refactored dispatch macro

* UCP/PERF: Support for warp level

* UCP/PERF: Added warp tests in CI

* UCP/PERF: Cleanup

* UCP/PERF: Speed up intermediate report

* UCP/PERF: Reduce progress latency by 2us

* UCP/PERF: Fix for ULL

* UCP/PERF: Allow NULL comp

* UCP/PERF: Flow control window

* UCP/PERF: Flow control

* UCP/PERF: Added pending_map

* UCP: Coverity fix

* UCP/PERF: Removed unused var

* PERF: Use __threadfence_system

* UCP/PERF: Enabled jenkins tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants