-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Cupti PM sampling support for gpu profiler #24406
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cashton-nvidia ! 2 more nits: other than that LGTM.
19630ba
to
f2e570b
Compare
…-sampling' into add-cupti-pm-sampling
// CUPTI PM sampling headers added in 12.6 | ||
// Note - bug fix for PM sampling added after 12.6.2 | ||
#if CUDA_VERSION >= 12060 | ||
#define CUPTI_PM_SAMPLING 1 | ||
#else | ||
#define CUPTI_PM_SAMPLING 0 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if
, #ifdef
, or #ifndef
macros are banned in XLA unless they cannot be avoided. Do you think you can refactor the code to avoid preprocessor conditionals (or at least significantly reduce the number of usages)? E.g.
- Return a
absl::Unimplemented
errors (or equivalent) - Skip tests. I.e.
if (!CUPTI_PM_SAMPLING) { GTEST_SKIP() }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are #define macros banned as well? Or just the #if test?
Do you have an example of how XLA currently handles versioning in external dependencies where APIs / external headers & symbols are possibly not present, and does this without macros?
If not, first thoughts -
I'd imagine most / all of the #if cases in the tracer can be replaced with unimplemented errors.
The interface though seems more problematic - the new headers need to be avoided. Since the preprocessor runs first, I'm not aware of any tricks except #if to avoid those includes.
Beyond including the headers, there's also the issue of the wrapper itself. I could use #if #else to have alternate wrappers for these new possibly missing functions, or I could probably write some weak symbol replacements for the missing symbols. I might be able to get the compiler to completely eliminate the references to the missing symbols and not need this, but I think that would rely on the compiler optimizations more than we can guarantee. All of those options are more complicated to maintain than what's there currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done what I can - couldn't completely remove #if in a couple of cases - I needed it to include the headers, and to define a couple of types before use, but everywhere else I used compiler elimination of dead code to avoid errors due to missing functions / types. Can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are #define macros banned as well? Or just the #if test?
Just #if tests.
I needed it to include the headers, and to define a couple of types before use, but everywhere else I used compiler elimination of dead code to avoid errors due to missing functions / types. Can you take a look?
Right, guarding new headers can't be avoided. Overall LGTM.
@@ -0,0 +1,29 @@ | |||
/* Copyright 2024 The OpenXLA Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add missing #ifndef header guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh is pragma once not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of Google C++ style guide: https://google.github.io/styleguide/cppguide.html#The__define_Guard
It triggers a blocking LINT error internally. Please remove #pragma once
in favor of #ifndef header guard
@@ -0,0 +1,29 @@ | |||
/* Copyright 2024 The OpenXLA Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of Google C++ style guide: https://google.github.io/styleguide/cppguide.html#The__define_Guard
It triggers a blocking LINT error internally. Please remove #pragma once
in favor of #ifndef header guard
|
||
using xla::profiler::CuptiInterface; | ||
using xla::profiler::CuptiTracer; | ||
using xla::profiler::CuptiTracerCollectorOptions; | ||
using xla::profiler::CuptiTracerOptions; | ||
using xla::profiler::CuptiWrapper; | ||
|
||
// Needed to create different cupti tracer for each test cases. | ||
class TestableCuptiTracer : public CuptiTracer { | ||
public: | ||
explicit TestableCuptiTracer() | ||
: CuptiTracer(new CuptiErrorManager(std::make_unique<CuptiWrapper>())) {} | ||
}; | ||
|
||
|
||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved into the anonymous namespace?
using xla::profiler::CuptiInterface; | |
using xla::profiler::CuptiTracer; | |
using xla::profiler::CuptiTracerCollectorOptions; | |
using xla::profiler::CuptiTracerOptions; | |
using xla::profiler::CuptiWrapper; | |
// Needed to create different cupti tracer for each test cases. | |
class TestableCuptiTracer : public CuptiTracer { | |
public: | |
explicit TestableCuptiTracer() | |
: CuptiTracer(new CuptiErrorManager(std::make_unique<CuptiWrapper>())) {} | |
}; | |
namespace { | |
namespace { | |
using xla::profiler::CuptiInterface; | |
using xla::profiler::CuptiTracer; | |
using xla::profiler::CuptiTracerCollectorOptions; | |
using xla::profiler::CuptiTracerOptions; | |
using xla::profiler::CuptiWrapper; | |
// Needed to create different cupti tracer for each test cases. | |
class TestableCuptiTracer : public CuptiTracer { | |
public: | |
explicit TestableCuptiTracer() | |
: CuptiTracer(new CuptiErrorManager(std::make_unique<CuptiWrapper>())) {} | |
}; | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the style guide link. I'd looked through xla source for a coding conventions reference and didn't find anything, though some sub parts of it referenced other conventions. (I'm surrpised pragma once is forbidden since every compiler I'm aware of supports it, but I know it's technically non-standard.) Will make these changes.
@@ -91,8 +91,113 @@ class MockCupti : public xla::profiler::CuptiInterface { | |||
MOCK_METHOD(CUptiResult, GetGraphExecId, | |||
(CUgraphExec graph_exec, uint32_t* graph_id), (override)); | |||
|
|||
#if CUPTI_PM_SAMPLING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this - I'm surprised it worked. Tested by building with CUPTI_PM_SAMPLING forced to both 0 and 1, and it worked. Must not be any live dependencies on the mock code? Fixed.
if (VLOG_IS_ON(2)) { | ||
LOG(INFO) << "(Profiling::PM Sampling) Top of decode loop"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (VLOG_IS_ON(2)) { | |
LOG(INFO) << "(Profiling::PM Sampling) Top of decode loop"; | |
} | |
VLOG(2) << "(Profiling::PM Sampling) Top of decode loop"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't clear on what vlog resolved to - if it's equivalent to log(info), I'll switch all these to the simpler code where possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is equivalent to LOG(INFO)
. I'd say you only want to use if (VLOG_IS_ON(...))
to guard against evaluating an expensive statement. E.g.
if (VLOG_IS_ON(1)) {
Foo foo = Expensive();
LOG(INFO) << foo;
}
But you can often rewrite as:
VLOG(1) << Expensive();
which is equivalent.
if (VLOG_IS_ON(2)) { | ||
LOG(INFO) << "(Profiling::PM Sampling) Beginning decode for device " << | ||
dev->device_id_; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (VLOG_IS_ON(2)) { | |
LOG(INFO) << "(Profiling::PM Sampling) Beginning decode for device " << | |
dev->device_id_; | |
} | |
VLOG(2) << "(Profiling::PM Sampling) Beginning decode for device " << | |
dev->device_id_; |
"restart session"; | ||
} | ||
if (ret != CUPTI_SUCCESS) { | ||
return tsl::errors::Internal("CUPTI error during " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace all tsl::errors::
usages with absl::
equivalents. E.g. absl::InternalError
.
class PmSamplingDevice { | ||
// Internal state | ||
// Declared roughly in order of initialization | ||
char const* chipName_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUPTI APIs use char *, so this avoids converting to / from strings. This chip name is produced by CUPTI and then later passed to other APIs. It's not used by other XLA / xprof code. Do we still need to convert it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking because it looks there is a memory leak.
We have chipName_ = strdup(p.pChipName)
which IIUC creates a copy of p.pChipName
. Should the copy be be freed when PmSamplingDevice
is destroyed? If chipName_
is stored as a std::string
it will be freed automatically when PmSamplingDevice
is destroyed.
This chip name is produced by CUPTI and then later passed to other APIs.
Note you can just call std::string::data
to get back a const char*
when calling other APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, true. The string size would be small, ~6 bytes, but would be lost in this case. Switching seemed more complex but w/ it being so simple I hadn't considered the leak.
auto start_walltime_ns = absl::GetCurrentTimeNanos(); | ||
auto start_gputime_ns = CuptiTracer::GetTimestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use auto here. It only saves a few characters.
https://google.github.io/styleguide/cppguide.html#Type_deduction
std::atomic_uint64_t atomic_total_fp64 = 0; | ||
std::atomic_uint64_t atomic_total_read = 0; | ||
std::atomic_uint64_t atomic_total_write = 0; | ||
bool skip_first = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be std::atomic_bool?
// CUPTI PM sampling headers added in 12.6 | ||
// Note - bug fix for PM sampling added after 12.6.2 | ||
#if CUDA_VERSION >= 12060 | ||
#define CUPTI_PM_SAMPLING 1 | ||
#else | ||
#define CUPTI_PM_SAMPLING 0 | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are #define macros banned as well? Or just the #if test?
Just #if tests.
I needed it to include the headers, and to define a couple of types before use, but everywhere else I used compiler elimination of dead code to avoid errors due to missing functions / types. Can you take a look?
Right, guarding new headers can't be avoided. Overall LGTM.
Adds perf metric sampling to the Cupti gpu tracer.
Adds a unit test which validates expected number of flops are sampled.