Skip to content

[nccl-profiler] fix profiling logic for proxyCtrl idle/active events #1662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jushg
Copy link

@jushg jushg commented Mar 26, 2025

Context for this change

ProxyCtrl event state is only recorded in exampleProfilerRecordEventState, and not exampleProfilerStartEvent or exampleProfilerStopEvent.

This is fine as long as we can guarantee that the start/stop goes together with the record state function. However, that's not true for ncclProfilerProxyCtrlIdle/ncclProfilerProxyCtrlActive state used in this code excerpt from proxy.cc

int lastIdle = 0;
  /* Too frequent call of ncclProxyGetPostedOps() will result in perf regression for small message
   * communication. proxyOpAppendCounter is a counter that helps us decide if we need to append proxy ops.
   * After each progress, proxyOpAppendCounter will increase by 1 and compare with environment variable
   * ncclParamProgressAppendOpFreq(). If they are equal, we will append proxy ops. This will decrease the
   * frequency of calling ncclProxyGetPostedOps() and reduce the perf impact. */
  int proxyOpAppendCounter = 0;
  while (state->stop == 0 || (state->stop == 1 && state->active)) {
    int idle = 1;
    ncclResult_t ret = progressOps(proxyState, state, state->active, &idle);
    if (ret != ncclSuccess) {
      __atomic_store_n(&proxyState->asyncResult, ret, __ATOMIC_RELEASE);
      INFO(NCCL_ALL,"%s:%d -> %d [Progress Thread]", __FILE__, __LINE__, ret);
      continue;
    }
    void* eHandle;
    ncclProfilerStartProxyCtrlEvent(proxyState->profilerContext, &eHandle);
    if (lastIdle == 0 && idle == 1) ncclProfilerRecordProxyCtrlEventState(eHandle, 0, ncclProfilerProxyCtrlIdle);
    if (lastIdle == 1 && idle == 0) ncclProfilerRecordProxyCtrlEventState(eHandle, 0, ncclProfilerProxyCtrlActive);
    ncclProfilerStopProxyCtrlEvent(eHandle);
   .....
   .....
    lastIdle = idle;
}

When idle = 1 and lastIdle = 1 (or both = 0), then we cannot register the correct state for ProxyCtrl event -> that state field become uninitialized, or some default undefined state.

The fix

Without the correct state, we are attempting to fprintf an uninitialized const char* str;, which is undefined behavior. This fix (1st commit) address that by adding an else clause for early return in that case.

I also propose a fix (2nd commit) for the root cause of this by adding an if guard around the proxyCtrl profiling logic mentioned above. This will also improve the profiler performance a lot since we would not have to constantly profile proxy Ctrl event if proxy thread idling state doesn't change.

@gcongiu
Copy link
Contributor

gcongiu commented Apr 8, 2025

Thank you for the contribution. This will be in version 2.27

@jushg
Copy link
Author

jushg commented Apr 9, 2025

Thanks for taking a look at this. I will close this PR now

@jushg jushg closed this Apr 9, 2025
@jushg
Copy link
Author

jushg commented Apr 9, 2025

@gcongiu just realized that the expression can probably be simplified to lastIdle != idle FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants