Skip to content

Conversation

@realFlowControl
Copy link
Member

@realFlowControl realFlowControl commented Nov 28, 2025

Description

We got some crash reports with stacks like this:

#14  0x00006101e274a1e3 execute_ex 
#15  0x00006101e26ca4f6 zend_call_function 
#16  0x00006101e26ca8ad zend_call_known_function 
#17  0x00006101e26c9647 zend_lookup_class_ex 
#18  0x00006101e26cac0e zend_fetch_class 

None of these are actually helpful, because this would just look like a crash in the engine's execute_ex function somewhere. But now we got a new crash trace adding some stack frames that look like stack corruption, but reveal something interesting nonetheless:

#0   0x00006101e26a8712 zend_mm_gc 
#1   0x00007b434a0a82e1 get_or_init<core::cell::RefCell<datadog_php_profiling::RequestLocals>, (), fn() -> core::cell::RefCell<datadog_php_profiling::RequestLocals>> (native/lazy.rs:53:15)
#2   0x00007b434a0a82e1 {closure#0} (native/mod.rs:94:25)
#3   0x00007b434a0a82e1 call_once<datadog_php_profiling::REQUEST_LOCALS::{constant#0}::{closure_env#0}, (core::option::Option<&mut core::option::Option<core::cell::RefCell<datadog_php_profiling::RequestLocals>>>)> (src/ops/function.rs:250:5)
#4   0x00007b434a0a82e1 try_with<core::cell::RefCell<datadog_php_profiling::RequestLocals>, datadog_php_profiling::{impl#2}::try_with_borrow::{closure_env#0}<datadog_php_profiling::RequestLocals, datadog_php_profiling::allocation::alloc_prof_rshutdown::{closure_env#0}, bool>, core::result::Result<bool, core::cell::BorrowError>> (src/thread/local.rs:282:37)
#5   0x00007b434a0a82e1 try_with_borrow<datadog_php_profiling::RequestLocals, datadog_php_profiling::allocation::alloc_prof_rshutdown::{closure_env#0}, bool> (profiling/src/lib.rs:498:12)
#6   0x00007b434a0a82e1 alloc_prof_rshutdown (allocation/mod.rs:152:30)
#7   0x00007b434a0a81be get_or_init<core::cell::RefCell<datadog_php_profiling::allocation::AllocationProfilingStats>, (), fn() -> core::cell::RefCell<datadog_php_profiling::allocation::AllocationProfilingStats>> (native/lazy.rs:56:40)
#8   0x00007b434a0a81be {closure#0} (native/mod.rs:94:25)
#9   0x00007b434a0a81be call_once<datadog_php_profiling::allocation::ALLOCATION_PROFILING_STATS::{constant#0}::{closure_env#0}, (core::option::Option<&mut core::option::Option<core::cell::RefCell<datadog_php_profiling::allocation::AllocationProfilingStats>>>)> (src/ops/function.rs:250:5)
#10  0x00007b434a0a81be try_with<core::cell::RefCell<datadog_php_profiling::allocation::AllocationProfilingStats>, datadog_php_profiling::{impl#2}::try_with_borrow_mut::{closure_env#0}<datadog_php_profiling::allocation::AllocationProfilingStats, datadog_php_profiling::allocation::allocation_le83::alloc_prof_realloc::{closure_env#0}, bool>, core::result::Result<bool, core::cell::BorrowMutError>> (src/thread/local.rs:282:37)
#11  0x00007b434a0a81be try_with_borrow_mut<datadog_php_profiling::allocation::AllocationProfilingStats, datadog_php_profiling::allocation::allocation_le83::alloc_prof_realloc::{closure_env#0}, bool> (profiling/src/lib.rs:507:12)
#12  0x00007b434a0a81be borrow_mut_or_false<std::thread::local::LocalKey<core::cell::RefCell<datadog_php_profiling::allocation::AllocationProfilingStats>>, datadog_php_profiling::allocation::AllocationProfilingStats, datadog_php_profiling::allocation::allocation_le83::alloc_prof_realloc::{closure_env#0}> (profiling/src/lib.rs:489:9)
#13  0x00007b434a0a81be alloc_prof_realloc (allocation/allocation_le83.rs:418:10)
#14  0x00006101e274a1e3 execute_ex 
#15  0x00006101e26ca4f6 zend_call_function 
#16  0x00006101e26ca8ad zend_call_known_function 
#17  0x00006101e26c9647 zend_lookup_class_ex 
#18  0x00006101e26cac0e zend_fetch_class 

There is just no way that alloc_prof_realloc would call into alloc_prof_rshutdown and from there, there is no way to call into zend_mm_gc...
The stack trace is still weird, but alloc_prof_realloc in case ZendMM can't allocate, will bailout (longjmp() to EG(bailout) which usually leads to RSHUTDOWN and as such to alloc_prof_rshutdown. I can't see a path though from alloc_prof_rshutdown to zend_mm_gc...

But there are a few ideas and things to this stack anyway: according to the runtime stack trace, it is in Composer\Autoload\ClassLoader::findFileWithExtension() line 505 which could trigger a reallocation, which on the way down, could call into zend_mm_gc. So at least this makes sense, not 100% with the stack trace we see, but ...

So anyway, I was thinking:

  • we might not need to do all of this get_or_init() dance with the RefCell in ALLOCATION_PROFILING_STATS at runtime, when we know there is no reentrancy
  • we also do not need to do the runtime lazy init for the thread local ALLOCATION_PROFILING_STATS itself when we make it a const
  • SeqCst memory ordering for stats and intervals is not needed, we only need atomicity, there is no happen-before relation to other memory, so we can use Relaxed and also move all the stats behind a feature flag because we are the only ones using it

Reviewer checklist

  • Test coverage seems ok.
  • Appropriate labels assigned.

@github-actions github-actions bot added profiling Relates to the Continuous Profiler tracing labels Nov 28, 2025
@realFlowControl realFlowControl marked this pull request as ready for review November 28, 2025 10:54
@realFlowControl realFlowControl requested a review from a team as a code owner November 28, 2025 10:54
@realFlowControl realFlowControl force-pushed the florian/stabilize-and-speedup-allocations branch from 5049f86 to be579db Compare December 2, 2025 10:16
@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.60%. Comparing base (867b36c) to head (df68409).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3505      +/-   ##
==========================================
- Coverage   61.74%   61.60%   -0.14%     
==========================================
  Files         142      142              
  Lines       12975    12975              
  Branches     1700     1700              
==========================================
- Hits         8011     7993      -18     
- Misses       4204     4222      +18     
  Partials      760      760              

see 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 867b36c...df68409. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@realFlowControl realFlowControl force-pushed the florian/stabilize-and-speedup-allocations branch from be579db to 1f5dd79 Compare December 2, 2025 10:29
@pr-commenter
Copy link

pr-commenter bot commented Dec 2, 2025

Benchmarks [ profiler ]

Benchmark execution time: 2025-12-05 09:52:55

Comparing candidate commit df68409 in PR branch florian/stabilize-and-speedup-allocations with baseline commit 867b36c in branch master.

Found 2 performance improvements and 6 performance regressions! Performance is the same for 22 metrics, 6 unstable metrics.

scenario:php-profiler-exceptions-with-profiler

  • 🟩 execution_time [-8.305ms; -3.247ms] or [-8.175%; -3.196%]

scenario:php-profiler-exceptions-with-profiler-and-timeline

  • 🟩 execution_time [-7.393ms; -4.521ms] or [-7.062%; -4.318%]

scenario:php-profiler-timeline-memory-control

  • 🟥 cpu_user_time [+34.651ms; +40.145ms] or [+5.756%; +6.669%]
  • 🟥 execution_time [+34.303ms; +38.972ms] or [+5.441%; +6.181%]

scenario:php-profiler-timeline-memory-with-profiler

  • 🟥 cpu_user_time [+24.147ms; +64.942ms] or [+2.059%; +5.537%]
  • 🟥 execution_time [+31.811ms; +46.897ms] or [+3.375%; +4.976%]

scenario:php-profiler-timeline-memory-with-profiler-and-timeline

  • 🟥 cpu_user_time [+38.069ms; +73.850ms] or [+2.415%; +4.685%]
  • 🟥 execution_time [+36.537ms; +53.509ms] or [+2.930%; +4.291%]

@pr-commenter
Copy link

pr-commenter bot commented Dec 2, 2025

Benchmarks [ tracer ]

Benchmark execution time: 2025-12-02 11:36:13

Comparing candidate commit 1f5dd79 in PR branch florian/stabilize-and-speedup-allocations with baseline commit 17189c6 in branch master.

Found 3 performance improvements and 0 performance regressions! Performance is the same for 190 metrics, 1 unstable metrics.

scenario:BM_TeaSapiSpindown

  • 🟩 execution_time [-33.401µs; -12.352µs] or [-6.168%; -2.281%]

scenario:ComposerTelemetryBench/benchTelemetryParsing

  • 🟩 execution_time [-1.427µs; -0.773µs] or [-12.738%; -6.905%]

scenario:TraceSerializationBench/benchSerializeTrace

  • 🟩 execution_time [-54.266µs; -42.834µs] or [-12.026%; -9.492%]

Copy link
Collaborator

@morrisonlevi morrisonlevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is just no way that alloc_prof_realloc would call into alloc_prof_rshutdown and from there, there is no way to call into zend_mm_gc...

I don't think this quite right. The function zend_mm_safe_error is called by various allocation functions such as zend_mm_realloc_huge, and zend_mm_safe_error has zend_bailout which jumps to EG(bailout). It could jump to shutdown, yeah?

I was thinking: we might not need to do all of this get_or_init() dance and use a RefCell at runtime, when we can instead initialize the ALLOCATION_PROFILING_STATS in GINIT and then just "know" that this is initialized and use it.

RefCell is not about being initialized or not, that is the thread-local lazy initialization. The RefCell is about enforcing Rust's borrowing rules at runtime.


I think there's possibly some optimization though. We could use MaybeUninit<RefCell<AllocationProfilingStats>> and make it const { MaybeUninit::uninit() } so it doesn't need lazy-initialization checks at runtime. We initialize it in ginit. The RefCell still is there to enforce borrow rules at runtime (cover our butts, yes, it causes crashes, but that's better than undefined behavior).

@realFlowControl realFlowControl force-pushed the florian/stabilize-and-speedup-allocations branch from c65a5b8 to d84a013 Compare December 4, 2025 06:40
@realFlowControl
Copy link
Member Author

With b0ce8e9 I've guarded the allocation (and exception) stats we are sending with the metadata JSON behind a feature flag because it looks like we are spending nearly ⅓ of our time just in incrementing these counters:

image

They are also only needed by us as a measurement to validate the upscaling algorithm. I am gonna measure again once the binary is ready to benchmark.

@realFlowControl realFlowControl marked this pull request as draft December 4, 2025 14:33
@realFlowControl
Copy link
Member Author

There is just no way that alloc_prof_realloc would call into alloc_prof_rshutdown and from there, there is no way to call into zend_mm_gc...

I don't think this quite right. The function zend_mm_safe_error is called by various allocation functions such as zend_mm_realloc_huge, and zend_mm_safe_error has zend_bailout which jumps to EG(bailout). It could jump to shutdown, yeah?

You are right, in case ZendMM can't allocate it bails out. Okay, that that part of the stack trace is fine. Just don't know how zend_mm_gc ends up there 😞

I was thinking: we might not need to do all of this get_or_init() dance and use a RefCell at runtime, when we can instead initialize the ALLOCATION_PROFILING_STATS in GINIT and then just "know" that this is initialized and use it.

RefCell is not about being initialized or not, that is the thread-local lazy initialization. The RefCell is about enforcing Rust's borrowing rules at runtime.

I think there's possibly some optimization though. We could use MaybeUninit<RefCell<AllocationProfilingStats>> and make it const { MaybeUninit::uninit() } so it doesn't need lazy-initialization checks at runtime. We initialize it in ginit. The RefCell still is there to enforce borrow rules at runtime (cover our butts, yes, it causes crashes, but that's better than undefined behavior).

We could benchmark the different options against each other and see. There are probably easier and more safe ways to bring down overhead without doing too much unsafe stuff 😉

@morrisonlevi
Copy link
Collaborator

The RefCell still is there to enforce borrow rules at runtime (cover our butts, yes, it causes crashes, but that's better than undefined behavior).

We are okay to use UnsafeCell instead of RefCell. We don't hold a reference while doing anything which could cause reentrancy. We need to better document this though.

@morrisonlevi morrisonlevi changed the title fix(prof): Speedup hot path in allocator and fix possible crash perf(prof): speedup hot path in allocator Dec 4, 2025
@realFlowControl realFlowControl marked this pull request as ready for review December 5, 2025 15:40
@morrisonlevi
Copy link
Collaborator

I had a thought: when an Apache graceful restart is done, the process goes through mshutdown and back into minit. What is the story there with ginit and gshutdown?

Copy link
Collaborator

@morrisonlevi morrisonlevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to encapsulate the access to ALLOCATION_PROFILING_STATS so that other modules are not repeating the unsafe code (and comments that ought to accompany it). I'll work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

profiling Relates to the Continuous Profiler tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants