UCP/PERF: Support for WARP level and improvements #10906

iyastreb · 2025-09-24T08:33:10Z

What?

Fixes from previous PR UCP/PERF: UCP tests with configurable level/api/batch #10893
Support for WARP level
Many performance optimizations (100x faster report time and others)
Added flow control optimization: send request + NODELAY every N iterations (4 by default)

Summary by CodeRabbit

Release Notes

New Features
- Added configurable flow-control window size for device tests via new -F command-line option.
- Enabled 32-thread test variants for bandwidth and latency benchmarking at 1k message size.
- Added 8-byte single-thread latency test configuration.
- Introduced 1-warp test variants with optimized settings.
Changes
- Updated default latency parameters for UCP PUT multi and partial latency tests.

contrib/ucx_perftest_config/test_types_ucp_device_cuda

src/tools/perf/cuda/cuda_kernel.cuh

src/tools/perf/cuda/ucp_cuda_kernel.cu

contrib/test_jenkins.sh

coderabbitai · 2025-11-06T08:23:29Z

Walkthrough

This PR introduces device flow control window configuration for UCX performance testing. Changes span test configuration, performance API parameters, CUDA kernel infrastructure refactoring, command-line argument parsing, and kernel implementation redesign for flow-control-aware request batching.

Changes

Cohort / File(s)	Summary
Test Configuration `contrib/ucx_perftest_config/test_types_ucp_device_cuda`	Enabled 32-thread variants, added 8-byte 1-thread latency test, and expanded 1k-size warp-level test set with explicit configurations.
Performance API Parameters `src/tools/perf/api/libperf.h`	Added `UCP_PERF_FC_WINDOW_DEFAULT` macro (value 4) and `device_fc_window` field to `ucx_perf_params_t` struct; updated field description comments.
CUDA Kernel Infrastructure `src/tools/perf/cuda/cuda_kernel.cuh`	Replaced `ucx_perf_cuda_update_report` function with `ucx_perf_cuda_reporter` class; added `device_fc_window` field to context; introduced `ucx_perf_cuda_thread_index` template helper; replaced macro set for kernel dispatch and indexing; updated `wait_for_kernel` reporting logic and thread-level accounting.
CUDA Kernel Implementation `src/tools/perf/cuda/ucp_cuda_kernel.cu`	Redesigned `ucp_perf_cuda_request_manager` with per-request batching, flow-control windowing, and pending tracking; replaced `device_clone` with `device_vector`; renamed and refactored send/kernel functions (`ucp_perf_cuda_send_nbx` → `ucp_perf_cuda_send_async`); updated kernel dispatch to use new macro set.
Parameter Initialization & Defaults `src/tools/perf/perftest.c`, `test/gtest/common/test_perf.cc`	Initialized `device_fc_window` to `UCP_PERF_FC_WINDOW_DEFAULT`; changed latency test defaults from 32 to 1 for `ucp_put_multi_lat` and `ucp_put_partial_lat`.
Command-Line Interface `src/tools/perf/perftest.h`, `src/tools/perf/perftest_params.c`	Added `-F` command-line option to configure device flow control window size; updated option parsing in `parse_test_params` and usage output.

Sequence Diagram(s)

sequenceDiagram
    participant Test Runner
    participant Kernel Manager
    participant Request Manager
    participant CUDA Device
    
    Test Runner->>Kernel Manager: Initialize with device_fc_window
    Kernel Manager->>Request Manager: Create with fc_window, size
    
    loop Per Iteration
        Test Runner->>Device: Launch kernel
        Device->>Request Manager: get_request()
        Request Manager->>Request Manager: Check m_pending_count vs m_fc_window
        alt Flow Control Window Available
            Request Manager-->>Device: Return available request slot
        else Window Full or No Pending Slot
            Request Manager->>Request Manager: progress_one()
            Request Manager-->>Device: Return progressed slot
        end
        Device->>Device: Execute async send/operation
        Device->>Request Manager: update_report(completed)
        Request Manager->>Request Manager: Advance to next reporting target
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

Request manager redesign (ucp_cuda_kernel.cu): Complete refactoring of batching logic, pending bitmap tracking, and flow-control windowing—verify correctness of progress_one template logic and get_request state management
Reporter class transition (cuda_kernel.cuh): Verify stateful reporter correctly advances iterations and respects thread-0 guarding; check memory fencing semantics
Macro set replacement: Cross-reference old dispatch logic (UCX_KERNEL_CMD, etc.) against new macros (UCX_PERF_KERNEL_DISPATCH_CMD, etc.) to ensure no behavioral drift
report_interval_ns calculation change: Removed division by 100—confirm this is intentional and doesn't break timing calibration
Latency default change (32 → 1): Determine rationale for changing test defaults; verify test coverage expectations
Device memory management: Validate device_vector allocation, copy, and deallocation paths match prior device_clone semantics

Poem

🐰 A whisker-twitch of flow control delight,
Windows aligned, requests in flight,
Reporters now class-bound with grace,
Kernels dispatched at warp-level pace—
Performance blooming, byte by byte! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'UCP/PERF: Support for WARP level and improvements' accurately reflects the main changes in the changeset, which include adding WARP level support and various performance optimizations including flow-control improvements.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e88f7 and fe5a631.

📒 Files selected for processing (8)

contrib/ucx_perftest_config/test_types_ucp_device_cuda (2 hunks)
src/tools/perf/api/libperf.h (3 hunks)
src/tools/perf/cuda/cuda_kernel.cuh (4 hunks)
src/tools/perf/cuda/ucp_cuda_kernel.cu (6 hunks)
src/tools/perf/perftest.c (2 hunks)
src/tools/perf/perftest.h (1 hunks)
src/tools/perf/perftest_params.c (2 hunks)
test/gtest/common/test_perf.cc (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/tools/perf/cuda/ucp_cuda_kernel.cu (2)

src/ucp/api/device/ucp_device_impl.h (5)

ucs_status_t (83-104)

ucp_device_progress_req (441-451)

ucp_device_put_single (143-166)

ucp_device_put_multi (264-290)

ucp_device_put_multi_partial (347-377)

src/tools/perf/lib/libperf.c (1)

ucx_perf_get_message_size (2280-2292)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: UCX PR (Static_check Static checks)
GitHub Check: UCX PR (Codestyle ctags check)
GitHub Check: UCX PR (Codestyle commit title)
GitHub Check: UCX PR (Codestyle codespell check)
GitHub Check: UCX PR (Codestyle AUTHORS file update check)
GitHub Check: UCX PR (Codestyle format code)
GitHub Check: UCX release DRP (Prepare CheckRelease)
GitHub Check: UCX release (Prepare CheckRelease)
GitHub Check: UCX snapshot (Prepare Check)

coderabbitai · 2025-11-06T08:33:30Z

src/tools/perf/cuda/ucp_cuda_kernel.cu

+                status = progress_one<level, fc, 1>(index);
+            } while (status == UCS_INPROGRESS);
+
+            if (ucs_unlikely(status != UCS_OK)) {
+                ucs_device_error("progress failed: %d", status);
+                return status;
+            }
+        } else {
+            index = __ffs(~m_pending_map) - 1;
+            ++m_pending[index];
+            ++m_pending_count;
+        }
+
+        if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+            req   = nullptr;
+            flags = static_cast<ucp_device_flags_t>(0);
+        } else {
+            req            = &m_requests[index];
+            m_pending_map |= UCS_BIT(index);
+        }
+        return UCS_OK;
    }


⚠️ Potential issue | 🔴 Critical

Fix the FC path returning a null request pointer

With device_fc_window > 1 (default is 4), the first iterations of get_request() take the true branch, set req to nullptr, and still call ucp_perf_cuda_send_async(). That helper unconditionally dereferences req and eventually passes it into ucp_device_put_*(), whose ucp_device_prepare_send() immediately expects a valid ucp_device_request_t* (see src/ucp/api/device/ucp_device_impl.h). The kernel will therefore dereference nullptr on the very first send and crash. Even if it didn’t crash, we would silently skip issuing real device requests for device_fc_window-1 iterations, so no progress would be made.

Always hand out a real request object and use the flags to control flow-control behavior instead:

- if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { - req = nullptr; - flags = static_cast<ucp_device_flags_t>(0); - } else { - req = &m_requests[index]; - m_pending_map |= UCS_BIT(index); - } + req = &m_requests[index]; + if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { + flags = static_cast<ucp_device_flags_t>(0); + } else { + m_pending_map |= UCS_BIT(index); + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

status = progress_one<level, fc, 1>(index);

} while (status == UCS_INPROGRESS);

if (ucs_unlikely(status != UCS_OK)) {

ucs_device_error("progress failed: %d", status);

return status;

}

} else {

index = __ffs(~m_pending_map) - 1;

++m_pending[index];

++m_pending_count;

}

if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {

req = nullptr;

flags = static_cast<ucp_device_flags_t>(0);

} else {

req = &m_requests[index];

m_pending_map |= UCS_BIT(index);

}

return UCS_OK;

}

status = progress_one<level, fc, 1>(index);

} while (status == UCS_INPROGRESS);

if (ucs_unlikely(status != UCS_OK)) {

ucs_device_error("progress failed: %d", status);

return status;

}

} else {

index = __ffs(~m_pending_map) - 1;

+m_pending[index];

+m_pending_count;

}

req = &m_requests[index];

if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {

flags = static_cast<ucp_device_flags_t>(0);

} else {

m_pending_map |= UCS_BIT(index);

}

return UCS_OK;

}

We intentionally assign nullptr to req, because device API supports nullptr request - this is actually the fastest workflow.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for ucp_device_put_single, ucp_device_put_multi implementation and documentation # to verify if nullptr is supported for the request parameter # First, find the device API header/implementation files fd -e h -e hpp -e cuh "ucp_device" --exec grep -l "ucp_device_put_single\|ucp_device_put_multi" {} # Look for the implementation and any nullptr handling rg -n -A 10 -B 2 "ucp_device_put_single|ucp_device_put_multi" --type cpp --type c -g "*/device/*" # Search for ucp_device_prepare_send to see if it handles nullptr rg -n -A 5 -B 2 "ucp_device_prepare_send" --type cpp --type c -g "*/device/*" # Look for any documentation or comments about nullptr request support rg -n -B 2 -A 2 "nullptr.*request|request.*nullptr|NULL.*request|request.*NULL" -g "*/device/*" --type cpp --type c

Length of output: 485

🏁 Script executed:

#!/bin/bash # Search without file type restrictions since device code uses .cuh extension # First, examine the device API header file we found cat src/ucp/api/device/ucp_device_impl.h | head -200 # Search for ucp_device_put functions and their signatures rg -n "ucp_device_put_single|ucp_device_put_multi|ucp_device_put_multi_partial" src/ucp/api/device/ # Look for ucp_device_prepare_send implementation rg -n -A 10 "ucp_device_prepare_send" src/ucp/api/device/ # Search for any nullptr or NULL handling in device API rg -n "nullptr.*request|request.*nullptr" src/ucp/api/device/ -A 3 -B 3

Length of output: 12518

@iyastreb, you're absolutely correct—I apologize for the incorrect analysis. The device API does explicitly support nullptr for the request parameter.

Looking at src/ucp/api/device/ucp_device_impl.h, I can confirm:

ucp_device_request_init checks if (req != nullptr) and sets comp = nullptr when no request is provided

UCP_DEVICE_SEND_BLOCKING macro guards the status update with if (_req != nullptr)

Documentation for ucp_device_put_single states: "If @A req is not NULL, use @ref ucp_device_progress_req"

This is indeed an intentional optimization—operations are still posted to the device, but without the overhead of per-request tracking when fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window) is true. The flow-control logic in get_request() correctly uses this fast path.

Thank you for the clarification!

🧬 Code Graph Analysis Results

File: src/tools/perf/cuda/ucp_cuda_kernel.cu

Snippet: ucp_perf_cuda_send_sync function

Description: Performs an asynchronous send using ucp_perf_cuda_send_async. If the provided req pointer is nullptr, the function returns UCS_OK immediately (indicating the operation completes without a per-request handle). This demonstrates the "nullptr request" workflow mentioned in the user comment.

Parameters:

params: const ucp_perf_cuda_params &

idx: ucx_perf_counter_t

req: ucp_device_request_t *

Return: ucs_status_t

Key behavior:

Calls ucp_perf_cuda_send_async<level, cmd>(...) with flags that disable extra delays.

If the status indicates an error, returns the status.

If req is nullptr, returns UCS_OK without progressing any request.

Otherwise, progresses the request until completion (status != UCS_INPROGRESS) and returns the final status.

Snippet location: ucp_perf_cuda_send_sync function body

File: src/tools/perf/cuda/ucp_cuda_kernel.cu

Snippet: ucp_perf_cuda_request_manager::get_request (device function)

Description: Chooses a request from the internal pool to use for the next operation. If all requests are pending, it calls progress_one in a loop until one completes. Crucially, it can return a nullptr for the req and a zeroed flags set when flow-control allows skipping an explicit request handle, illustrating the nullptr-request fast-path.

Parameters:

level: ucs_device_level_t (template parameter)

fc: bool (flow-control)

req: ucp_device_request_t *& (output)

flags: ucp_device_flags_t & (output)

Return: UCS status (ucs_status_t)

Key behavior:

If all slots are in use (m_pending_count == m_size), advances progress until a completion occurs or an error.

If not all slots are in flight, selects the next free index, increments m_pending_count, and marks it as pending.

If fc and there is room for more, sets req to nullptr and flags to 0 to signal the device API can proceed without an explicit request handle; otherwise, returns the pointer to the selected request and updates the pending map.

Snippet location: ucp_perf_cuda_request_manager::get_request (device function)

File: src/tools/perf/cuda/ucp_cuda_kernel.cu

Snippet: ucp_perf_cuda_put_bw_iter function (shows usage of get_request)

Description: Obtains a request via req_mgr.get_request<level, fc>(...), then dispatches the appropriate CUDA device send operation. This function illustrates how the code path may proceed with a nullptr request (when allowed by flow-control and window conditions) to use the device API’s nullptr-request fast path.

Snippet location: ucp_perf_cuda_put_bw_iter (device function)

✏️ Learnings added

Learnt from: iyastreb Repo: openucx/ucx PR: 10906 File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91 Timestamp: 2025-11-06T09:04:19.186Z Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

* UCP/PERF: Addressed PR comments * UCP/PERF: Replaced popcount with int field to improve performance * UCP/PERF: Refactored dispatch macro * UCP/PERF: Support for warp level * UCP/PERF: Added warp tests in CI * UCP/PERF: Cleanup * UCP/PERF: Speed up intermediate report * UCP/PERF: Reduce progress latency by 2us * UCP/PERF: Fix for ULL * UCP/PERF: Allow NULL comp * UCP/PERF: Flow control window * UCP/PERF: Flow control * UCP/PERF: Added pending_map * UCP: Coverity fix * UCP/PERF: Removed unused var * PERF: Use __threadfence_system * UCP/PERF: Enabled jenkins tests

iyastreb mentioned this pull request Sep 24, 2025

UCP/PERF: UCP tests with configurable level/api/batch #10893

Merged

iyastreb force-pushed the ucp-perf-warp-fixes branch 2 times, most recently from 06f6d41 to 012b942 Compare October 1, 2025 06:56

iyastreb added 10 commits October 2, 2025 09:45

UCP/PERF: Addressed PR comments

ece8612

UCP/PERF: Replaced popcount with int field to improve performance

19d307e

UCP/PERF: Refactored dispatch macro

8f1367e

UCP/PERF: Support for warp level

7009472

UCP/PERF: Added warp tests in CI

9f92aa6

UCP/PERF: Cleanup

3bab527

UCP/PERF: Speed up intermediate report

825f7de

UCP/PERF: Reduce progress latency by 2us

2816236

UCP/PERF: Fix for ULL

05f7b20

UCP/PERF: Allow NULL comp

7eca345

iyastreb force-pushed the ucp-perf-warp-fixes branch from 012b942 to 7eca345 Compare October 2, 2025 06:46

iyastreb added 6 commits October 6, 2025 10:45

UCP/PERF: Merged with master

36fe9e3

UCP/PERF: Merged with master

f3e93e2

UCP/PERF: Flow control window

13c2d8b

UCP/PERF: Flow control

df5ba8f

UCP/PERF: Fix

3928d6e

UCP/PERF: Added pending_map

6a393a0

iyastreb marked this pull request as ready for review October 15, 2025 15:02

UCP: Coverity fix

03959ee

iyastreb mentioned this pull request Oct 16, 2025

UCP/DEVICE: Make memh and local_addr optional for counter elements #10945

Merged

iyastreb added 4 commits October 16, 2025 11:00

UCP/PERF: Removed unused var

46633a4

UCP/PERF: Disable CI perftest

03b7da4

UCP: Merge branch 'master' into ucp-perf-warp-fixes

464887d

UCP: Merged with master

1f59fbd

iyastreb requested review from brminich and ofirfarjun7 October 29, 2025 05:55

PERF: Use __threadfence_system

59ea068

ofirfarjun7 reviewed Oct 29, 2025

View reviewed changes

contrib/ucx_perftest_config/test_types_ucp_device_cuda Show resolved Hide resolved

src/tools/perf/cuda/cuda_kernel.cuh Show resolved Hide resolved

src/tools/perf/cuda/cuda_kernel.cuh Outdated Show resolved Hide resolved

src/tools/perf/cuda/cuda_kernel.cuh Outdated Show resolved Hide resolved

iyastreb added 3 commits October 30, 2025 09:45

UCP/PERF: CR fixes

8d28eb6

UCP/PERF: Merge branch 'master' into ucp-perf-warp-fixes

d675046

UCP/PERF: Enabled jenkins tests

9313ce3

ofirfarjun7 reviewed Nov 5, 2025

View reviewed changes

src/tools/perf/cuda/cuda_kernel.cuh Outdated Show resolved Hide resolved

src/tools/perf/cuda/cuda_kernel.cuh Show resolved Hide resolved

src/tools/perf/cuda/ucp_cuda_kernel.cu Outdated Show resolved Hide resolved

ofirfarjun7 reviewed Nov 5, 2025

View reviewed changes

contrib/test_jenkins.sh Outdated Show resolved Hide resolved

jeynmann mentioned this pull request Nov 6, 2025

TOOLS/DEVICE: support channel id in perftest #10993

Open

iyastreb added 2 commits November 6, 2025 09:53

UCP: Merge branch 'master' into ucp-perf-warp-fixes

ca1a574

UCP/PERF: CR fixes

fe5a631

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

ofirfarjun7 approved these changes Nov 6, 2025

View reviewed changes

ofirfarjun7 merged commit e29e5f5 into openucx:master Nov 6, 2025
148 checks passed

iyastreb deleted the ucp-perf-warp-fixes branch November 6, 2025 13:40

This was referenced Nov 14, 2025

Gaudi gdr integration #11008

Closed

UCT/GDA: channel_id implementation #11013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UCP/PERF: Support for WARP level and improvements #10906

UCP/PERF: Support for WARP level and improvements #10906

Uh oh!

iyastreb commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 6, 2025 •

edited

Loading

Uh oh!

iyastreb Nov 6, 2025

Uh oh!

coderabbitai bot Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UCP/PERF: Support for WARP level and improvements #10906

UCP/PERF: Support for WARP level and improvements #10906

Uh oh!

Conversation

iyastreb commented Sep 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iyastreb Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iyastreb commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

coderabbitai bot Nov 6, 2025 •

edited

Loading