Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8687 Track vassert messages for crash log #25051

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

michael-redpanda
Copy link
Contributor

@michael-redpanda michael-redpanda commented Feb 6, 2025

This pull request adds the capability to the crash tracker to log out aborts caused by vassert.

  • vassert messages are now captured in a thread local storage instance that will be accessed by
    the SIGABRT signal handler. This message will be written out to the generated crash report
  • The vassert macro will now call abort() rather than __builtin_trap(). This is because the implementation of __builtin_trap() is implementation dependent and may or may not raise a SIGILL signal.
  • New debug mode only endpoints are added for testing purposes to set the value contained by the assert message storage class for testing

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • None

@michael-redpanda michael-redpanda requested review from pgellert and a team February 6, 2025 01:47
@michael-redpanda michael-redpanda self-assigned this Feb 6, 2025
@michael-redpanda michael-redpanda requested a review from a team as a code owner February 6, 2025 01:48
@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch 3 times, most recently from 5b24f93 to cf31a6c Compare February 6, 2025 01:52
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 6, 2025

Retry command for Build#61645

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/storage_failure_injection_test.py::StorageFailureInjectionTest.test_storage_failure_injection

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 6, 2025

CI test results

test results on build#61645
test_id test_kind job_url test_status passed
assert_log_holder_test_rpunit.assert_log_holder_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61645#0194d8f6-2372-4a0e-ba8d-84c0f118a065 FAIL 0/2
assert_log_holder_test_rpunit.assert_log_holder_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61645#0194d8f6-2373-411c-a317-0909013d80f2 FAIL 0/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d954-7b74-4708-9428-2495e0f5876f FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d954-7b74-4708-9428-2495e0f5876f FLAKY 1/2
rptest.tests.storage_failure_injection_test.StorageFailureInjectionTest.test_storage_failure_injection ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d938-77ec-4564-9024-c364ca23d33d FAIL 0/20
rptest.tests.storage_failure_injection_test.StorageFailureInjectionTest.test_storage_failure_injection ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d954-7b75-4de3-aa6f-724aba69073c FAIL 0/20
test results on build#61689
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd28-3869-4b8b-984d-460f70d5cf39 FLAKY 1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abf-4d74-967d-9ca4d910b390 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd28-3869-483d-ab3e-70c2c8beb87d FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abd-4b2c-91f4-f3b5f295937e FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abd-47b1-af84-366e29982458 FLAKY 1/4
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abd-4b2c-91f4-f3b5f295937e FLAKY 1/2
test results on build#61730
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e113-6369-4853-8da4-fbf4865b1ff5 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e0f7-5783-4198-8ed7-90516a7ec14a FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e113-6368-4a7b-8793-07cb1d10d25b FLAKY 1/5
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e113-6368-454a-a30e-7d261e447951 FLAKY 1/3
rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_node_recovery.recovery_type=full ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e0f7-5781-4532-9b56-54f6ec3a1fb5 FLAKY 1/2

Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, mostly just questions from me. (+ fyi the new unit tests are failing)

src/v/base/vassert.h Show resolved Hide resolved
@@ -678,6 +678,82 @@
]
}
]
},
{
"path": "/v1/debug/store_message/{shard}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Docs Team reviewers: These routes are only for testing and should not be documented.

src/v/base/vassert.h Show resolved Hide resolved
// According to https://en.cppreference.com/w/c/program/signal, referring to
// objects with static or thread-local storage duration that is not
// lock-free atomic is UB.
static thread_local inline std::atomic<const char*> _abuffer{nullptr};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remind me again why this needs to be static thread_local inline? And it does look like this needs to be static thread_local inline because it doesn't work as a regular member variable based on my testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need it unique to each thread, otherwise if >1 thread vasserts at the same time (and it was declared say static inline) it would write to the same buffer

src/v/crash_tracker/recorder.cc Show resolved Hide resolved
@michael-redpanda michael-redpanda requested review from a team, BenPope and IoannisRP and removed request for a team February 6, 2025 19:16
@michael-redpanda
Copy link
Contributor Author

(+ fyi the new unit tests are failing)

Yeah the perils of having two build systems. Forgot GTEST in rp_unit

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from cf31a6c to 6bd5f0c Compare February 6, 2025 19:44
@michael-redpanda
Copy link
Contributor Author

Force push 6bd5f0c:

  • Added use of the assert storage class to unreachable()
  • Fixed issue with tests CMake file
  • Adjusted failure injection log allow list

@@ -99,6 +99,6 @@ static thread_local assert_log_holder g_assert_log_holder;
__LINE__, \
#x, \
##args); \
__builtin_trap(); \
abort(); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I might suggest std::abort() to avoid any potential for the wrong symbol being used here since this file is included pretty much everywhere in redpanda.

  • are we at any risk for the same deadlock problems as glibc 2.26 (listed in the notes on the man page) related to the closing/flushing of IO streams, particularly since we are going to be doing I/O in response to sigabort?

  • one of the original motivations behind builtin_trap was that it minimized the amount of code that was inlined into a vassert usage. presumably this isn't really an issue (at least in godbolt i see just a call instruction), but just noting that here for completeness.

The only downside of this switch is that now we will end up printing a stacktrace twice on vassert crashes (once in the vassert macro and once in seastar's SIGABRT handler).

presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?

Our issue with builtin_trap is that what it does is implementation dependent. In I think most cases it raises a SIGILL but not all. We could change it to call raise(SIGILL) if we wish to maintain a similar behavior and get around any possible deadlocking issues caused by abort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me rephrase. is there a benefit to

vassert():
logging
signal

signal_handler():
crash reporting

vs

vassert():
logging
crash reporting
builtin_trap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with @pgellert about this and we believe the second approach is preferrable. Less work in the handler is better, and avoids any annoying issues with memory of I/O getting messed up in the signal handler.

Comment on lines +505 to +516
if constexpr (admin_server::is_store_message_enabled()) {
register_route<superuser>(
ss::httpd::debug_json::put_store_message,
[this](std::unique_ptr<ss::http::request> req) {
return put_store_message(std::move(req));
});
register_route<superuser>(
ss::httpd::debug_json::get_store_message,
[this](std::unique_ptr<ss::http::request> req) {
return get_store_message(std::move(req));
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its worth debating if these are actually compiled out vs compiled in but disabled. for example compiling them in, and requiring some test-only runtime switch to make them active woudl mean we could test in both debug and release modes...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took inspiration from the failure injector stuff. I agree that doing a runtime switch is preferable (we kind of do that with the RP_FIXTURE_TEST environment variable for the OpenSSL Context stuff in unit tests). Happy to discuss a path forward on this and adapting it to other things like failure injector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't matter to me one way or the other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue with this approach for now and get some eyes/consensus on 9051. Once that's settled we'll come back and adjust.

Created a structure that will be used to hold assert messages and the
backtrace from calls to `vassert` that will be used by the crash tracker
when a SIGABRT is signaled.

Signed-off-by: Michael Boquard <[email protected]>
* Added assertions crash_type
* When in sigabrt, check if there is an event held in the assert log
  holder and if there is, use that as the crash description

Signed-off-by: Michael Boquard <[email protected]>
abort() raises SIGABRT which is catchable so we can log the error in the
crash report (unlike __builtin_trap which may raise SIGILL, depending on
implementation).

Signed-off-by: Michael Boquard <[email protected]>
Implement handlers for storing and retrieving messages stored in the
assert log holder.

Signed-off-by: Michael Boquard <[email protected]>
Changes necessary due to changes in how vassert messaging is
constructed.

Signed-off-by: Michael Boquard <[email protected]>
@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from 6bd5f0c to 432d396 Compare February 7, 2025 13:53
@michael-redpanda
Copy link
Contributor Author

Force push 432d396:

  • Rebase off dev to fix conflicts. No changes

// lock-free atomic is UB.
static thread_local inline std::atomic<const char*> _abuffer{nullptr};
};
static thread_local assert_log_holder g_assert_log_holder;
Copy link
Member

@dotnwat dotnwat Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm concerned this isn't going to do what you want (or that i'm mistaken about what the exact goal is). here you're going to i think get a separate instance of assert_log_holder in each translation unit (static foo in a header--very rare what is intended), then those would all contain another thread_local _abuffer that points to a single instance (since its inline)? it's a bit of brain twister.

would this be simpler:

class assert_log_holder {
  std::atomic...;
};

inline thread_local assert_log_holder g_assert_log_holder

here you get a single global instance of assert_log_holder per thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants