Skip to content

Program hangs if sample-collection time is > interval #123

@benbuckman

Description

@benbuckman

Context

We are trying to enable StackProf in production on a REST API service. We would like to use mode: :wall, raw: true, aggregate: true, but it has been challenging to calibrate the interval. The segfault bug described in #81 makes very low/fast sampling rates concerning (especially because our hypothesis about the causes of those segfaults, which we have not verified, centered on overlapping signal handlers). But at an interval of (e.g.) 20ms the data is not particularly useful.

Observed behavior

At lower intervals, however, we encounter a separate bug in which the program doesn't crash, but rather it hangs – the CPU is pegged to 100% and the program being profiled does not advance. This specifically seems to occur when the time to capture a single sample is greater than the interval between samples. (The time it takes to capture a sample fluctuates, I assume, based on the size of the stack, CPU load, etc, so it could take a bunch of samples successfully and then encounter this race.) What I suspect is occurring – and I am still in the process of debugging and verifying the details – is that it starts piling up jobs with rb_postponed_job_register_one, and by the time it finishes one sample, it's started another, (and so on forever), blocking any other instructions.

Proposal

I think some code needs to be added, to stackprof_record_sample or stackprof_record_sample_for_stack, measuring the time it took to capture the sample. If that time is >= interval, it needs to handle that somehow. I can think of a few options at this point:

  1. Calculate how many sample(s)' time was borrowed and skip that number of next samples. (This seems complicated and risky.)
  2. Inject fake frames into the results, similar to GC frames, that make it clear (to the final report or flamegraph) that this situation was encountered. (I'm not sure how feasible this would be.)
  3. Internally stop profiling, and when StackProf.results is called, raise an exception that makes it clear that the profiling was stopped due to this situation. (We don't want to raise mid-profiling because that could bubble up anywhere in the program. This seems the easiest of the 3.)

Next steps for me

  • Distill this bug down to a simple repro script.
  • Confirm that the bug is in fact caused by a pile-on of signal handlers / postponed jobs.
  • Implement one of the 3 solutions (probably the 3rd) behind a param passed to start or run and confirm that it fixes the bug.

(I'll update this issue as I do these.)

In the meantime, I'd be interested in anyone else's thoughts – have you also encountered this, do you know why it might be occurring, and what fix do you propose?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions