-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-8687 Track vassert messages for crash log #25051
base: dev
Are you sure you want to change the base?
CORE-8687 Track vassert messages for crash log #25051
Conversation
5b24f93
to
cf31a6c
Compare
Retry command for Build#61645please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#61645
test results on build#61689
test results on build#61730
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, mostly just questions from me. (+ fyi the new unit tests are failing)
@@ -678,6 +678,82 @@ | |||
] | |||
} | |||
] | |||
}, | |||
{ | |||
"path": "/v1/debug/store_message/{shard}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Docs Team reviewers: These routes are only for testing and should not be documented.
// According to https://en.cppreference.com/w/c/program/signal, referring to | ||
// objects with static or thread-local storage duration that is not | ||
// lock-free atomic is UB. | ||
static thread_local inline std::atomic<const char*> _abuffer{nullptr}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remind me again why this needs to be static thread_local inline
? And it does look like this needs to be static thread_local inline
because it doesn't work as a regular member variable based on my testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need it unique to each thread, otherwise if >1 thread vasserts at the same time (and it was declared say static inline
) it would write to the same buffer
Yeah the perils of having two build systems. Forgot |
cf31a6c
to
6bd5f0c
Compare
Force push 6bd5f0c:
|
@@ -99,6 +99,6 @@ static thread_local assert_log_holder g_assert_log_holder; | |||
__LINE__, \ | |||
#x, \ | |||
##args); \ | |||
__builtin_trap(); \ | |||
abort(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I might suggest
std::abort()
to avoid any potential for the wrong symbol being used here since this file is included pretty much everywhere in redpanda. -
are we at any risk for the same deadlock problems as glibc 2.26 (listed in the notes on the man page) related to the closing/flushing of IO streams, particularly since we are going to be doing I/O in response to sigabort?
-
one of the original motivations behind builtin_trap was that it minimized the amount of code that was inlined into a vassert usage. presumably this isn't really an issue (at least in godbolt i see just a
call
instruction), but just noting that here for completeness.
The only downside of this switch is that now we will end up printing a stacktrace twice on vassert crashes (once in the vassert macro and once in seastar's SIGABRT handler).
presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?
Our issue with builtin_trap is that what it does is implementation dependent. In I think most cases it raises a SIGILL
but not all. We could change it to call raise(SIGILL)
if we wish to maintain a similar behavior and get around any possible deadlocking issues caused by abort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me rephrase. is there a benefit to
vassert():
logging
signal
signal_handler():
crash reporting
vs
vassert():
logging
crash reporting
builtin_trap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with @pgellert about this and we believe the second approach is preferrable. Less work in the handler is better, and avoids any annoying issues with memory of I/O getting messed up in the signal handler.
if constexpr (admin_server::is_store_message_enabled()) { | ||
register_route<superuser>( | ||
ss::httpd::debug_json::put_store_message, | ||
[this](std::unique_ptr<ss::http::request> req) { | ||
return put_store_message(std::move(req)); | ||
}); | ||
register_route<superuser>( | ||
ss::httpd::debug_json::get_store_message, | ||
[this](std::unique_ptr<ss::http::request> req) { | ||
return get_store_message(std::move(req)); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its worth debating if these are actually compiled out vs compiled in but disabled. for example compiling them in, and requiring some test-only runtime switch to make them active woudl mean we could test in both debug and release modes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took inspiration from the failure injector stuff. I agree that doing a runtime switch is preferable (we kind of do that with the RP_FIXTURE_TEST
environment variable for the OpenSSL Context stuff in unit tests). Happy to discuss a path forward on this and adapting it to other things like failure injector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't matter to me one way or the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue with this approach for now and get some eyes/consensus on 9051. Once that's settled we'll come back and adjust.
Created a structure that will be used to hold assert messages and the backtrace from calls to `vassert` that will be used by the crash tracker when a SIGABRT is signaled. Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
* Added assertions crash_type * When in sigabrt, check if there is an event held in the assert log holder and if there is, use that as the crash description Signed-off-by: Michael Boquard <[email protected]>
abort() raises SIGABRT which is catchable so we can log the error in the crash report (unlike __builtin_trap which may raise SIGILL, depending on implementation). Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Implement handlers for storing and retrieving messages stored in the assert log holder. Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Changes necessary due to changes in how vassert messaging is constructed. Signed-off-by: Michael Boquard <[email protected]>
6bd5f0c
to
432d396
Compare
Force push 432d396:
|
// lock-free atomic is UB. | ||
static thread_local inline std::atomic<const char*> _abuffer{nullptr}; | ||
}; | ||
static thread_local assert_log_holder g_assert_log_holder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm concerned this isn't going to do what you want (or that i'm mistaken about what the exact goal is). here you're going to i think get a separate instance of assert_log_holder
in each translation unit (static foo
in a header--very rare what is intended), then those would all contain another thread_local _abuffer
that points to a single instance (since its inline)? it's a bit of brain twister.
would this be simpler:
class assert_log_holder {
std::atomic...;
};
inline thread_local assert_log_holder g_assert_log_holder
here you get a single global instance of assert_log_holder
per thread.
This pull request adds the capability to the crash tracker to log out aborts caused by
vassert
.vassert
messages are now captured in a thread local storage instance that will be accessed bythe
SIGABRT
signal handler. This message will be written out to the generated crash reportvassert
macro will now callabort()
rather than__builtin_trap()
. This is because the implementation of__builtin_trap()
is implementation dependent and may or may not raise aSIGILL
signal.Backports Required
Release Notes