Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Cupti PM sampling support for gpu profiler #24406

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cashton-nvidia
Copy link

Adds perf metric sampling to the Cupti gpu tracer.

Adds a unit test which validates expected number of flops are sampled.

Copy link

google-cla bot commented Mar 31, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link

@rengoog rengoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cashton-nvidia ! 2 more nits: other than that LGTM.

@cashton-nvidia cashton-nvidia marked this pull request as draft March 31, 2025 21:05
@cashton-nvidia cashton-nvidia force-pushed the add-cupti-pm-sampling branch from 19630ba to f2e570b Compare April 1, 2025 15:56
@cashton-nvidia cashton-nvidia marked this pull request as ready for review April 1, 2025 15:59
@cashton-nvidia cashton-nvidia marked this pull request as draft April 2, 2025 17:40
@cashton-nvidia cashton-nvidia marked this pull request as ready for review April 2, 2025 17:40
@cashton-nvidia cashton-nvidia requested a review from rengoog April 2, 2025 17:40
Comment on lines 31 to 37
// CUPTI PM sampling headers added in 12.6
// Note - bug fix for PM sampling added after 12.6.2
#if CUDA_VERSION >= 12060
#define CUPTI_PM_SAMPLING 1
#else
#define CUPTI_PM_SAMPLING 0
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if, #ifdef, or #ifndef macros are banned in XLA unless they cannot be avoided. Do you think you can refactor the code to avoid preprocessor conditionals (or at least significantly reduce the number of usages)? E.g.

  • Return a absl::Unimplemented errors (or equivalent)
  • Skip tests. I.e. if (!CUPTI_PM_SAMPLING) { GTEST_SKIP() }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are #define macros banned as well? Or just the #if test?

Do you have an example of how XLA currently handles versioning in external dependencies where APIs / external headers & symbols are possibly not present, and does this without macros?

If not, first thoughts -
I'd imagine most / all of the #if cases in the tracer can be replaced with unimplemented errors.

The interface though seems more problematic - the new headers need to be avoided. Since the preprocessor runs first, I'm not aware of any tricks except #if to avoid those includes.

Beyond including the headers, there's also the issue of the wrapper itself. I could use #if #else to have alternate wrappers for these new possibly missing functions, or I could probably write some weak symbol replacements for the missing symbols. I might be able to get the compiler to completely eliminate the references to the missing symbols and not need this, but I think that would rely on the compiler optimizations more than we can guarantee. All of those options are more complicated to maintain than what's there currently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done what I can - couldn't completely remove #if in a couple of cases - I needed it to include the headers, and to define a couple of types before use, but everywhere else I used compiler elimination of dead code to avoid errors due to missing functions / types. Can you take a look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are #define macros banned as well? Or just the #if test?

Just #if tests.

I needed it to include the headers, and to define a couple of types before use, but everywhere else I used compiler elimination of dead code to avoid errors due to missing functions / types. Can you take a look?

Right, guarding new headers can't be avoided. Overall LGTM.

@@ -0,0 +1,29 @@
/* Copyright 2024 The OpenXLA Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add missing #ifndef header guard.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh is pragma once not sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of Google C++ style guide: https://google.github.io/styleguide/cppguide.html#The__define_Guard

It triggers a blocking LINT error internally. Please remove #pragma once in favor of #ifndef header guard

@@ -0,0 +1,29 @@
/* Copyright 2024 The OpenXLA Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of Google C++ style guide: https://google.github.io/styleguide/cppguide.html#The__define_Guard

It triggers a blocking LINT error internally. Please remove #pragma once in favor of #ifndef header guard

Comment on lines +34 to +49

using xla::profiler::CuptiInterface;
using xla::profiler::CuptiTracer;
using xla::profiler::CuptiTracerCollectorOptions;
using xla::profiler::CuptiTracerOptions;
using xla::profiler::CuptiWrapper;

// Needed to create different cupti tracer for each test cases.
class TestableCuptiTracer : public CuptiTracer {
public:
explicit TestableCuptiTracer()
: CuptiTracer(new CuptiErrorManager(std::make_unique<CuptiWrapper>())) {}
};


namespace {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved into the anonymous namespace?

Suggested change
using xla::profiler::CuptiInterface;
using xla::profiler::CuptiTracer;
using xla::profiler::CuptiTracerCollectorOptions;
using xla::profiler::CuptiTracerOptions;
using xla::profiler::CuptiWrapper;
// Needed to create different cupti tracer for each test cases.
class TestableCuptiTracer : public CuptiTracer {
public:
explicit TestableCuptiTracer()
: CuptiTracer(new CuptiErrorManager(std::make_unique<CuptiWrapper>())) {}
};
namespace {
namespace {
using xla::profiler::CuptiInterface;
using xla::profiler::CuptiTracer;
using xla::profiler::CuptiTracerCollectorOptions;
using xla::profiler::CuptiTracerOptions;
using xla::profiler::CuptiWrapper;
// Needed to create different cupti tracer for each test cases.
class TestableCuptiTracer : public CuptiTracer {
public:
explicit TestableCuptiTracer()
: CuptiTracer(new CuptiErrorManager(std::make_unique<CuptiWrapper>())) {}
};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the style guide link. I'd looked through xla source for a coding conventions reference and didn't find anything, though some sub parts of it referenced other conventions. (I'm surrpised pragma once is forbidden since every compiler I'm aware of supports it, but I know it's technically non-standard.) Will make these changes.

@@ -91,8 +91,113 @@ class MockCupti : public xla::profiler::CuptiInterface {
MOCK_METHOD(CUptiResult, GetGraphExecId,
(CUgraphExec graph_exec, uint32_t* graph_id), (override));

#if CUPTI_PM_SAMPLING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this - I'm surprised it worked. Tested by building with CUPTI_PM_SAMPLING forced to both 0 and 1, and it worked. Must not be any live dependencies on the mock code? Fixed.

Comment on lines +1670 to +1672
if (VLOG_IS_ON(2)) {
LOG(INFO) << "(Profiling::PM Sampling) Top of decode loop";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (VLOG_IS_ON(2)) {
LOG(INFO) << "(Profiling::PM Sampling) Top of decode loop";
}
VLOG(2) << "(Profiling::PM Sampling) Top of decode loop";

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't clear on what vlog resolved to - if it's equivalent to log(info), I'll switch all these to the simpler code where possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is equivalent to LOG(INFO). I'd say you only want to use if (VLOG_IS_ON(...)) to guard against evaluating an expensive statement. E.g.

if (VLOG_IS_ON(1)) {
  Foo foo = Expensive();
  LOG(INFO) << foo;
}

But you can often rewrite as:

VLOG(1) << Expensive();

which is equivalent.

Comment on lines +1682 to +1685
if (VLOG_IS_ON(2)) {
LOG(INFO) << "(Profiling::PM Sampling) Beginning decode for device " <<
dev->device_id_;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (VLOG_IS_ON(2)) {
LOG(INFO) << "(Profiling::PM Sampling) Beginning decode for device " <<
dev->device_id_;
}
VLOG(2) << "(Profiling::PM Sampling) Beginning decode for device " <<
dev->device_id_;

"restart session";
}
if (ret != CUPTI_SUCCESS) {
return tsl::errors::Internal("CUPTI error during "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace all tsl::errors:: usages with absl:: equivalents. E.g. absl::InternalError.

class PmSamplingDevice {
// Internal state
// Declared roughly in order of initialization
char const* chipName_ = nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUPTI APIs use char *, so this avoids converting to / from strings. This chip name is produced by CUPTI and then later passed to other APIs. It's not used by other XLA / xprof code. Do we still need to convert it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking because it looks there is a memory leak.

We have chipName_ = strdup(p.pChipName) which IIUC creates a copy of p.pChipName. Should the copy be be freed when PmSamplingDevice is destroyed? If chipName_ is stored as a std::string it will be freed automatically when PmSamplingDevice is destroyed.

This chip name is produced by CUPTI and then later passed to other APIs.

Note you can just call std::string::data to get back a const char* when calling other APIs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, true. The string size would be small, ~6 bytes, but would be lost in this case. Switching seemed more complex but w/ it being so simple I hadn't considered the leak.

Comment on lines +105 to +106
auto start_walltime_ns = absl::GetCurrentTimeNanos();
auto start_gputime_ns = CuptiTracer::GetTimestamp();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use auto here. It only saves a few characters.

https://google.github.io/styleguide/cppguide.html#Type_deduction

std::atomic_uint64_t atomic_total_fp64 = 0;
std::atomic_uint64_t atomic_total_read = 0;
std::atomic_uint64_t atomic_total_write = 0;
bool skip_first = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be std::atomic_bool?

Comment on lines 31 to 37
// CUPTI PM sampling headers added in 12.6
// Note - bug fix for PM sampling added after 12.6.2
#if CUDA_VERSION >= 12060
#define CUPTI_PM_SAMPLING 1
#else
#define CUPTI_PM_SAMPLING 0
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are #define macros banned as well? Or just the #if test?

Just #if tests.

I needed it to include the headers, and to define a couple of types before use, but everywhere else I used compiler elimination of dead code to avoid errors due to missing functions / types. Can you take a look?

Right, guarding new headers can't be avoided. Overall LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants