Program hangs if sample-collection time is > interval

### Context
We are trying to enable StackProf in production on a REST API service. We would like to use `mode: :wall, raw: true, aggregate: true`, but it has been challenging to calibrate the `interval`. The segfault bug described in #81 makes very low/fast sampling rates concerning (especially because our hypothesis about the causes of those segfaults, which we have not verified, centered on [overlapping signal handlers](http://man7.org/linux/man-pages/man7/signal-safety.7.html)). But at an interval of (e.g.) 20ms the data is not particularly useful.

### Observed behavior
At lower intervals, however, we encounter a separate bug in which the program doesn't crash, but rather it _hangs_ – the CPU is pegged to 100% and the program being profiled does not advance. This specifically seems to occur **when the time to capture a single sample is greater than the interval between samples.** (The time it takes to capture a sample fluctuates, I assume, based on the size of the stack, CPU load, etc, so it could take a bunch of samples successfully and then encounter this race.) What I suspect is occurring – and I am still in the process of debugging and verifying the details – is that it starts [piling up jobs with `rb_postponed_job_register_one`](https://github.com/tmm1/stackprof/blob/b886c3c/ext/stackprof/stackprof.c#L561-L570), and by the time it finishes one sample, it's started another, (and so on forever), blocking any other instructions.

### Proposal
I think some code needs to be added, to [`stackprof_record_sample`](https://github.com/tmm1/stackprof/blob/b886c3c/ext/stackprof/stackprof.c#L491) or [`stackprof_record_sample_for_stack`](https://github.com/tmm1/stackprof/blob/b886c3c/ext/stackprof/stackprof.c#L381), measuring the time it took to capture the sample. If that time is `>= interval`, it needs to handle that somehow. I can think of a few options at this point:

1. Calculate how many sample(s)' time was borrowed and skip that number of next samples. (This seems complicated and risky.)
1. Inject fake frames into the results, similar to GC frames, that make it clear (to the final report or flamegraph) that this situation was encountered. (I'm not sure how feasible this would be.)
1. Internally stop profiling, and when `StackProf.results` is called, raise an exception that makes it clear that the profiling was stopped due to this situation. (We don't want to raise mid-profiling because that could bubble up anywhere in the program. This seems the easiest of the 3.)

### Next steps for me
- [x] Distill this bug down to a simple repro script.
- [x] Confirm that the bug is in fact caused by a pile-on of signal handlers / postponed jobs.
- [ ] Implement one of the 3 solutions (probably the 3rd) behind a param passed to `start` or `run` and confirm that it fixes the bug.

_(I'll update this issue as I do these.)_

In the meantime, I'd be interested in anyone else's thoughts – have you also encountered this, do you know why it might be occurring, and what fix do you propose?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Program hangs if sample-collection time is > interval #123

Context

Observed behavior

Proposal

Next steps for me

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Program hangs if sample-collection time is > interval #123

Description

Context

Observed behavior

Proposal

Next steps for me

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions