-
Notifications
You must be signed in to change notification settings - Fork 504
TOOLS/DEVICE: support channel id in perftest #10993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdded a device endpoint channel count parameter and propagated it through perf test initialization into CUDA context and kernel code. CUDA runtime now computes a per-thread channel_id (threadIdx.x % num_channels) and uses it in device_put_* calls; device-side arrays are copied via a new device_vector helper. Changes
Sequence DiagramsequenceDiagram
participant Config as Config (libperf)
participant Test as Test Init
participant Host as Host runtime
participant GPU as CUDA Kernel
participant NIC as Device ops
Config->>Test: device_ep_channel_count
Test->>Host: populate params (num_channels)
Host->>Host: device_vector(copy indices/offsets/lengths)
Host->>GPU: launch kernel (params, device arrays)
GPU->>GPU: channel_id = threadIdx.x % num_channels
GPU->>NIC: device_put_* (with channel_id)
NIC-->>GPU: completion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-11-06T09:04:19.215ZApplied to files:
🧬 Code graph analysis (1)src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
🔇 Additional comments (3)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
contrib/test_jenkins.sh(1 hunks)contrib/ucx_perftest_config/test_types_ucp_device_cuda(2 hunks)src/tools/perf/api/libperf.h(3 hunks)src/tools/perf/cuda/cuda_kernel.cuh(4 hunks)src/tools/perf/cuda/ucp_cuda_kernel.cu(7 hunks)src/tools/perf/perftest.c(2 hunks)src/tools/perf/perftest.h(1 hunks)src/tools/perf/perftest_params.c(3 hunks)test/gtest/common/test_perf.cc(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/tools/perf/perftest.c (2)
src/tools/perf/lib/ucp_tests.cc (4)
params(197-219)params(197-199)params(221-231)params(221-221)src/ucs/sys/sys.c (1)
ucs_get_page_size(530-543)
src/tools/perf/perftest_params.c (1)
src/tools/perf/lib/uct_tests.cc (2)
params(118-130)params(118-118)
src/tools/perf/cuda/ucp_cuda_kernel.cu (1)
src/ucp/api/device/ucp_device_impl.h (5)
ucs_status_t(83-104)ucp_device_progress_req(441-451)ucp_device_put_single(143-166)ucp_device_put_multi(264-290)ucp_device_put_multi_partial(347-377)
src/tools/perf/cuda/cuda_kernel.cuh
Outdated
| if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { | ||
| ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); | ||
| ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) / | ||
| (completed - m_ctx.completed_iters); | ||
| m_last_report_time = cur_time; | ||
| m_ctx.completed_iters = completed; | ||
| __threadfence_system(); | ||
|
|
||
| m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time), | ||
| m_max_iters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against zero deltas before dividing.
Here we divide by both (completed - m_ctx.completed_iters) and iter_time. If the progress callback fires twice with the same completed (e.g. because m_report_interval_ns / iter_time rounded to zero) or if the GPU clock hasn’t advanced yet, both denominators become zero and the SM traps with a divide-by-zero. Please bail out when delta == 0 and clamp iter_time to at least 1 before the final division.
@@
- if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
- ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
- ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) /
- (completed - m_ctx.completed_iters);
+ if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) {
+ ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters;
+ if (delta_iters == 0) {
+ return;
+ }
+
+ ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns();
+ ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time;
+ if (delta_time == 0) {
+ delta_time = 1;
+ }
+
+ ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1);
m_last_report_time = cur_time;
m_ctx.completed_iters = completed;
__threadfence_system();
- m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time),
- m_max_iters);
+ ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time;
+ if (next_interval == 0) {
+ next_interval = 1;
+ }
+ m_next_report_iter = ucs_min(completed + next_interval, m_max_iters);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { | |
| ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); | |
| ucx_perf_cuda_time_t iter_time = (cur_time - m_last_report_time) / | |
| (completed - m_ctx.completed_iters); | |
| m_last_report_time = cur_time; | |
| m_ctx.completed_iters = completed; | |
| __threadfence_system(); | |
| m_next_report_iter = ucs_min(completed + (m_report_interval_ns / iter_time), | |
| m_max_iters); | |
| if ((threadIdx.x == 0) && ucs_unlikely(completed >= m_next_report_iter)) { | |
| ucx_perf_counter_t delta_iters = completed - m_ctx.completed_iters; | |
| if (delta_iters == 0) { | |
| return; | |
| } | |
| ucx_perf_cuda_time_t cur_time = ucx_perf_cuda_get_time_ns(); | |
| ucx_perf_cuda_time_t delta_time = cur_time - m_last_report_time; | |
| if (delta_time == 0) { | |
| delta_time = 1; | |
| } | |
| ucx_perf_cuda_time_t iter_time = ucs_max<ucx_perf_cuda_time_t>(delta_time / delta_iters, 1); | |
| m_last_report_time = cur_time; | |
| m_ctx.completed_iters = completed; | |
| __threadfence_system(); | |
| ucx_perf_cuda_time_t next_interval = m_report_interval_ns / iter_time; | |
| if (next_interval == 0) { | |
| next_interval = 1; | |
| } | |
| m_next_report_iter = ucs_min(completed + next_interval, m_max_iters); | |
| } |
🤖 Prompt for AI Agents
In src/tools/perf/cuda/cuda_kernel.cuh around lines 52 to 61, the code can
divide by zero when computing per-iteration times because both delta =
(completed - m_ctx.completed_iters) and iter_time can be zero; compute and store
delta first, if delta == 0 bail out/skip the reporting block to avoid dividing
by zero, then compute iter_time and clamp it to at least 1 (e.g., use a max with
1) before using it in any division or in calculating m_next_report_iter so no
division by zero occurs.
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | ||
| req = nullptr; | ||
| flags = static_cast<ucp_device_flags_t>(0); | ||
| } else { | ||
| req = &m_requests[index]; | ||
| m_pending_map |= UCS_BIT(index); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not hand out a null request under flow control
When get_request() returns req == nullptr, the very next call to ucp_perf_cuda_send_async() passes that pointer straight into ucp_device_put_single/multi/multi_partial, all of which dereference req unconditionally. With device_fc_window > 1, we therefore crash every time this branch fires. Please always return a valid ucp_device_request_t* here and keep the existing bitmap semantics for the FC “prefetch” case.
- if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
- req = nullptr;
- flags = static_cast<ucp_device_flags_t>(0);
- } else {
- req = &m_requests[index];
- m_pending_map |= UCS_BIT(index);
- }
+ req = &m_requests[index];
+ if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) {
+ flags = static_cast<ucp_device_flags_t>(0);
+ } else {
+ m_pending_map |= UCS_BIT(index);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | |
| req = nullptr; | |
| flags = static_cast<ucp_device_flags_t>(0); | |
| } else { | |
| req = &m_requests[index]; | |
| m_pending_map |= UCS_BIT(index); | |
| } | |
| req = &m_requests[index]; | |
| if (fc && (m_pending_count < m_size) && (m_pending[index] < m_fc_window)) { | |
| flags = static_cast<ucp_device_flags_t>(0); | |
| } else { | |
| m_pending_map |= UCS_BIT(index); | |
| } |
| case 'F': | ||
| return parse_int(opt_arg, ¶ms->super.device_fc_window, | ||
| "device flow control window size", 1, INT_MAX); | ||
| case 'y': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the -F parser’s target type.
parse_int() expects an int*, but we are handing it an unsigned*. With -Werror (the default in CI) this is an incompatible pointer type error, so the whole build breaks. Please parse into a local int and assign it back once validation succeeds.
@@
- case 'F':
- return parse_int(opt_arg, ¶ms->super.device_fc_window,
- "device flow control window size", 1, INT_MAX);
+ case 'F': {
+ int fc_window;
+
+ status = parse_int(opt_arg, &fc_window,
+ "device flow control window size", 1, INT_MAX);
+ if (status == UCS_OK) {
+ params->super.device_fc_window = fc_window;
+ }
+ return status;
+ }🤖 Prompt for AI Agents
In src/tools/perf/perftest_params.c around lines 728 to 731, the -F option
passes an unsigned int pointer to parse_int which expects an int*, causing an
incompatible pointer type error under -Werror; fix it by parsing into a local
int variable (call parse_int with &local_int and the same bounds), check
parse_int succeeded, then assign local_int to params->super.device_fc_window
(cast/convert to the unsigned field) after validation; ensure you preserve the
same bounds and error handling behavior.
7355ab4 to
84da1e1
Compare
|
|
||
| void init_params(const ucx_perf_context_t &perf) | ||
| { | ||
| m_params.num_channels = perf.params.device_ep_channel_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just set it in ucp_perf_cuda_params_handler and remove this func.
src/tools/perf/perftest.c
Outdated
| params->super.ucp.send_datatype = UCP_PERF_DATATYPE_CONTIG; | ||
| params->super.ucp.recv_datatype = UCP_PERF_DATATYPE_CONTIG; | ||
| params->super.ucp.am_hdr_size = 0; | ||
| params->super.device_ep_channel_count = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the default should be some large value like UINT_MAX, so by default each thread would use a different channels, since this is also controlled by the NUM_CHANNELS configuration of GDAKI transport
| ucx_perf_counter_t idx, ucp_device_request_t *req, | ||
| ucp_device_flags_t flags = UCP_DEVICE_FLAG_NODELAY) | ||
| { | ||
| const unsigned channel_id = threadIdx.x % params.num_channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the channel id should be some random value (or at least have a "mode" for channel where it's generated randomly)
channel_mode = enum { single, random, per-thread }
because in real scenario the "expert" index is "random"
yosefe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iyastreb can you pls review as well?
| typedef enum { | ||
| UCX_PERF_CHANNEL_MODE_SINGLE, /* Use a single fixed channel ID (0) */ | ||
| UCX_PERF_CHANNEL_MODE_RANDOM, /* Use random channel ID per operation */ | ||
| UCX_PERF_CHANNEL_MODE_PER_THREAD,/* Use thread ID modulo num_channels */ | ||
| UCX_PERF_CHANNEL_MODE_LAST | ||
| } ucx_perf_channel_mode_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
src/tools/perf/perftest_params.c
Outdated
| printf(" request is sent.\n"); | ||
| printf(" -N <mode> channel selection mode for device tests (single)\n"); | ||
| printf(" single - use a single fixed channel (channel 0, default)\n"); | ||
| printf(" random - use random channel per operation\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add the seed as part of the random mode string:
random:1245
to not consume another argument letter ('S')
| unsigned reqs_count = ucs_div_round_up(ctx.max_outstanding, | ||
| ctx.device_fc_window); | ||
| ucp_device_request_t *reqs = &shared_requests[reqs_count * thread_index]; | ||
| curandState rand_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add rand_state inside ucp_perf_cuda_request_manager, to not pass another parameter to all functions
| m_params.num_threads = perf.params.device_thread_count; | ||
| m_params.num_channels = perf.params.device_num_channels; | ||
| m_params.channel_mode = perf.params.device_channel_mode; | ||
| m_params.random_seed = perf.params.random_seed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that params represent a request parameters.
Now it's a mixture of static config + request params. Since perf context is already passed to the kernel, what's the point of duplicating config in params?
| uint64_t *counter_recv; | ||
| }; | ||
|
|
||
| template<ucs_device_level_t level> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO making the entire class a a template is not worth it:
- it's needed just for 2 functions. Expanding the scope just makes compilation longer for no reason
- The callers code becomes ugly:
ucs_status_t status = req_mgr.template get_request<fc>(req, flags);
What's the rationale for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, if it was needed just because of adding m_rand_state, it is not worth it (or add it as a pointer)
src/tools/perf/perftest_params.c
Outdated
| printf(" request is sent.\n"); | ||
| printf(" -N <mode> channel selection mode for device tests (single)\n"); | ||
| printf(" single - use a single fixed channel (channel 0, default)\n"); | ||
| printf(" random:<seed> - use random channel per operation with the given seed\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| printf(" random:<seed> - use random channel per operation with the given seed\n"); | |
| printf(" random[:<seed>] - use random channel per operation with optional random seed\n"); |
src/tools/perf/perftest.c
Outdated
| params->super.ucp.am_hdr_size = 0; | ||
| params->super.device_num_channels = UINT_MAX; | ||
| params->super.device_channel_mode = UCX_PERF_CHANNEL_MODE_SINGLE; | ||
| params->super.random_seed = time(0) ^ getpid();; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| params->super.random_seed = time(0) ^ getpid();; | |
| params->super.random_seed = ucs_generate_uuid((uintptr_t)params); |
| unsigned channel_id; | ||
|
|
||
| switch (ctx.channel_mode) { | ||
| case UCX_PERF_CHANNEL_MODE_SINGLE: | ||
| channel_id = 0; | ||
| break; | ||
| case UCX_PERF_CHANNEL_MODE_RANDOM: | ||
| channel_id = curand(rand_state) % ctx.num_channels; | ||
| break; | ||
| case UCX_PERF_CHANNEL_MODE_PER_THREAD: | ||
| default: | ||
| channel_id = (blockIdx.x * | ||
| ucx_perf_cuda_thread_index<level>(ctx.num_threads) + | ||
| ucx_perf_cuda_thread_index<level>(threadIdx.x)) % | ||
| ctx.num_channels; | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change adds 2 more dependencies to the function, and makes it violating a single responsibility principle.
I propose 3 things:
- This function should take only a precomputed channel_id
- channel_id should be calculated in request manager. So that manager returns the valid channel_id, not
req_mgr.get_rand_state(); - Inside the manager we may cache
num_threadsandnum_channelsto avoid expensive reads from the shared memory (ctx)
| const char *channel_str = getenv("UCX_RC_GDA_NUM_CHANNELS"); | ||
| if (channel_str) { | ||
| params->super.device_num_channels = atoi(channel_str); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems very weird, why is this needed?
we can pass arbitrarily large channel id to the UCP/UCT device APIs, and the transport should anyway do % operation to select the right channel according to the number of QPs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCP/UCT doesn't do %. Perftest needs to know how many channels created and do % at user-app side.
Maybe add an option for perftest and adjust ucp config in perftest? Or any better suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCT should do %.
because from API perspective, there is no function that returns number of channels. So any number should work.
| return (blockIdx.x * | ||
| ucx_perf_cuda_thread_index<level>(m_num_threads) + | ||
| ucx_perf_cuda_thread_index<level>(threadIdx.x)) % | ||
| m_num_channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do smth like
threadIdx.x + blockIdx.x * blockDim.x
IMO, no need to calculate the modulo by number of channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we can remove m_num_threads/m_num_channels member variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed m_num_threads. We may still need ucx_perf_cuda_thread_index<level>(..) to get thread_id, because the perftest might run at warp level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simplify the calculation to be
ucx_perf_cuda_thread_index<level>(threadIdx.x + blockIdx.x * blockDim.x)
| case UCX_PERF_CHANNEL_MODE_SINGLE: | ||
| return 0; | ||
| case UCX_PERF_CHANNEL_MODE_RANDOM: | ||
| return curand(m_rand_state) % m_num_channels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
| return curand(m_rand_state) % m_num_channels; | |
| return curand(m_rand_state) % (gridDim.x * blockDim.x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to save num_channels?
can we calculate ucx_perf_cuda_thread_index<level>(gridDim.x * blockDim.x) ?
| unsigned thread_index = ucx_perf_cuda_thread_index<level>(threadIdx.x); | ||
| unsigned num_threads = ucx_perf_cuda_thread_index<level>( | ||
| ctx.num_threads); | ||
| unsigned global_thread_id = blockIdx.x * num_threads + thread_index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the global thread id can be calculated in a much simpler way
threadIdx.x + blockIdx.x * blockDim.x
so the helper function ucp_perf_cuda_init_rand_state is not needed
What?
Support channel id in ucx perftest.
Why?
Improve performance by distribute request on qps.
How?
Select channel by thread id modulo numbers of channel.
Summary by CodeRabbit
New Features
Chores