-
Notifications
You must be signed in to change notification settings - Fork 15
[PROF-12853] Catch panics inside wrap_with_ffi_result and wrap_with_void_ffi_result
#1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…h_void_ffi_result` **What does this PR do?** This PR updates the `wrap_with_ffi_result` and `wrap_with_void_ffi_result` macros to catch any panics that happen inside them, returning them as errors. The error handling is made in such a way (see `handle_panic_error` for details) that it should be able to report back an error even if we fail to do any allocations. Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros. **Motivation:** In <https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing> (Datadog-only link, sorry!) we saw `ddog_prof_Exporter_send` crashing due to what can be summed up as `ddog_prof_Exporter_send` (report a profile) -> hyper-util tries to do dns resolution in a separate thread pool -> tokio failed to create a new thread -> panic and we tear down the app because we can't report a profile This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083. **Additional Notes:** While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the `..._no_catch` variants of the macros. The return type change in `ddog_crasht_CrashInfoBuilder_build` does change the tag enum entries, which unfortunately is a breaking change. Ideas on how to work around this? This makes the following enum entries change: * `DDOG_CRASHT_CRASH_INFO_NEW_RESULT_OK` => `DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_OK_HANDLE_CRASH_INFO` * `DDOG_CRASHT_CRASH_INFO_NEW_RESULT_ERR` => `DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_ERR_HANDLE_CRASH_INFO` **How to test the change?** This change includes test coverage. I've also separately tried to sprinkle a few `panic!` calls manually and tested that it works as expected.
danielsn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the consumer supposed to do if there was a panic caught? We should clearly document that.
| const CANNOT_ALLOCATE: &std::ffi::CStr = | ||
| c"libdatadog failed: (panic) Cannot allocate error message"; | ||
| const CANNOT_ALLOCATE_CHAR_SLICE: CharSlice = unsafe { | ||
| crate::Slice::from_raw_parts( | ||
| CANNOT_ALLOCATE.as_ptr(), | ||
| CANNOT_ALLOCATE.to_bytes_with_nul().len(), | ||
| ) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to run this with MIRI to make sure we get lifetimes right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does miri not run our regular test suite? E.g. I somewhat assumed the new tests in utils.rs would be automatically covered... aren't they? 👀
| match error { | ||
| None => CharSlice::empty(), | ||
| Some(err) => CharSlice::from(err.as_ref()), | ||
| // When the error is empty (CANNOT_ALLOCATE_ERROR) we assume we failed to allocate an actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to special case ddog_error_drop to avoid dropping the static const CANNOT_ALLOCATE_ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the case? Specifically ddog_error_drop receives the Error, not the CharSlice we produce here, so the ddog_error_drop should see the vec with capacity == 0, not the string produced here. But I may be missing something...? 👀
| // This pattern of String vs &str comes from | ||
| // https://doc.rust-lang.org/std/panic/struct.PanicHookInfo.html#method.payload | ||
| if let Some(s) = error.downcast_ref::<String>() { | ||
| anyhow::anyhow!("{} failed: (panic) {}", function_name, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyhow itself can allocate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! That's why there's this second catch_unwind: Either anyhow succeeds and we're able to return this error OR it fails and we fall back to the CANNOT_ALLOCATE_ERROR.
This way we should be able to return the nice error in most cases.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1334 +/- ##
==========================================
+ Coverage 70.87% 71.20% +0.32%
==========================================
Files 385 392 +7
Lines 61838 62763 +925
==========================================
+ Hits 43828 44689 +861
- Misses 18010 18074 +64
🚀 New features to boost your workflow:
|
BenchmarksComparisonBenchmark execution time: 2025-11-20 11:32:11 Comparing candidate commit 7e19376 in PR branch Found 0 performance improvements and 13 performance regressions! Performance is the same for 42 metrics, 2 unstable metrics. scenario:credit_card/is_card_number/ 3782-8224-6310-005
scenario:credit_card/is_card_number/ 378282246310005
scenario:credit_card/is_card_number/378282246310005
scenario:credit_card/is_card_number/37828224631000521389798
scenario:credit_card/is_card_number_no_luhn/ 378282246310005
scenario:credit_card/is_card_number_no_luhn/378282246310005
scenario:credit_card/is_card_number_no_luhn/37828224631000521389798
CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
Group 13
Group 14
Group 15
Group 16
Group 17
BaselineOmitted due to size. |
| // A Rust Vec of size 0 [has no allocated memory](https://doc.rust-lang.org/std/vec/struct.Vec.html#guarantees): | ||
| // "In particular, if you construct a Vec with capacity 0 via Vec::new, vec![], | ||
| // Vec::with_capacity(0), or by calling shrink_to_fit on an empty Vec, it will not allocate | ||
| // memory." And as per https://doc.rust-lang.org/nomicon/vec/vec-dealloc.html: | ||
| // "We must not call alloc::dealloc when self.cap == 0, as in this case we haven't actually | ||
| // allocated any memory." | ||
| if self.capacity == 0 { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this necessary in some case?
This should naturally be handled by Drop for std::vec::Vec, because it's safe to drop an empty std::vec::Vec. After creating a std::vec::Vec of its contents, the FFI Vec will now be "dropped" but drops on pointers do nothing. I can't see why it would be necessary to early-return here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... At some point in my experiments I was using the mockalloc crate and it complained that we were still trying to drop the vec...
But I'm not confident if that's a real issue with our code or just some artifact of that crate/how I was setting up my test.
As you mentioned the alloc::vec::Vec::from_raw_parts doc kinda states that ptr only needs to be non-null and aligned if capacity is zero (e.g. doesn't need to be allocated using the global allocator) which means it won't touch it.
I'll leave it here to see if anyone has any ideas on why this may or may not be needed; and if nobody else chimes in I'll undo this part of the change.
|
If we do catch_unwind, we lose the rust frames for debugging. Anything we can do about that? |
|
Possibly we should publish a crash report when this panic happens? |
That would be cool! As a bonus we could actually include the panic message, which right now we don't get. |
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-apple-darwin
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-apple-darwin
x86_64-unknown-linux-gnu
|
| // When the error is empty (CANNOT_ALLOCATE_ERROR) we assume we failed to allocate an actual | ||
| // error and return this placeholder message instead. | ||
| Some(err) => { | ||
| if *err == CANNOT_ALLOCATE_ERROR { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is going to match any Error with empty message, not necessarily CANNOT_ALLOCATE_ERROR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup -- I've tweaked the comment to clarify this is intentional: f060873 .
I tried to have a different, static non-empty marker but it needed other changes to make the type system happy so that's why I ended up picking this direction.
|
|
||
| #[named] | ||
| unsafe fn ddog_crasht_crash_info_builder_build_impl( | ||
| pub unsafe extern "C" fn ddog_crasht_CrashInfoBuilder_build( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are catching these, I wonder what the difference is, if any, with extern "C" vs extern "C-unwind". Have you come across any size differences or anything else which might suggest using one or the other in cases when we catch panics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from my reading of https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html I'm not sure extern "C-unwind" is different from extern "C" here... 🤔
In particular, there's never unwinding from another language to rust (that I know of?), and there's no unwinding from rust into another language -- previously we panic'd directly, and now we're handing the rust panic in rust code, and returning an error.
So there's never, that I can think of, to/from rust boundaries being crossed with unwinding, so this should never apply.
I suspect in practice, since this is a static string, it doesn't make a difference but let's fix it still.
Ooops!
What does this PR do?
This PR updates the
wrap_with_ffi_resultandwrap_with_void_ffi_resultmacros to catch any panics that happen inside them, returning them as errors.The error handling is made in such a way (see
handle_panic_errorfor details) that it should be able to report back an error even if we fail to do any allocations.Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros.
Motivation
In https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing (Datadog-only link, sorry!) we saw
ddog_prof_Exporter_sendcrashing due to what can be summed up asddog_prof_Exporter_send(report a profile) ->hyper-util tries to do dns resolution in a separate thread pool ->
tokio failed to create a new thread ->
panic and we tear down the app because we can't report a profile
This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083.
Additional Notes
While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the
..._no_catchvariants of the macros.The return type change in
ddog_crasht_CrashInfoBuilder_builddoes change the tag enum entries, which unfortunately is a breaking change.Ideas on how to work around this? This makes the following enum entries change:
DDOG_CRASHT_CRASH_INFO_NEW_RESULT_OK=>DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_OK_HANDLE_CRASH_INFODDOG_CRASHT_CRASH_INFO_NEW_RESULT_ERR=>DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_ERR_HANDLE_CRASH_INFOHow to test the change?
This change includes test coverage. I've also separately tried to sprinkle a few
panic!calls manually and tested that it works as expected.