-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cpu] higher check runner counts may produce CPU spikes #919
Comments
On a machine with 2 cores, running about 15 python checks total (most of them The spikes happen every 2 minutes because that's the frequency at which process checks refresh their caches (when they do, they call |
Done some more testing: this behavior of the python runtime can be reproduced outside of the Agent6, with 2 simple python scripts that attempt to mimic the behavior of the process check: Given: import psutil
def list_psutil_processes():
for p in psutil.process_iter():
print p.name() Every n seconds (and accounting for total execution time of each sequence to compute next run) we run directly with
On Dev account: The scripts use on average, on an 8-core Linux VM:
|
In general, I'm more interested on fetching numbers able to describe how much overhead an highly concurrent scheduler is adding to the metrics collection cycle - the use case here (IO Intensive checks with multiple instances) looks a lot like a corner case and I'd like to collect more info and feedback before claiming we have a fire to put out. With this in mind and in regard of possible fixes,
I strongly advice against this: implementation would add significant complexity, specially considering that Autodiscovery can easily change the number of check instances running at every given time.
This can be done to some extent in order to provide a more reasonable default but IMO we should still prefer concurrency - users should be able to reduce it, even drastically, but only if spikes happen and when that represents a problem. |
Spiking cpu/mem usage will cause issues with the docker agent if limits are set:
This means if we want reliable behaviour in containers, we must aim for the flatter resource usage profile possible. Could we "autoscale" the runner number depending on the total collection time? |
Describe what happened:
Due to the way we process and schedule checks, when the number of check runner go-routines is high there's a chance to experience CPU spikes.
As discussed, we believe that due to the fact we schedule checks to run at fixed intervals when the number of runners is high, and in particular when these checks are likely to wait on system calls (check instances - ie. check runs - will release the GIL when waiting for the OS to return) will drive up the number of python checks concurrently running and therefore drive up the CPU utilization. A lower number of check runners reduces the concurrency and lowers the CPU utilization.
A single python check runner (other than long running-checks) replicates the agent5 behavior where instances ran serially - resulting in the lowest possible CPU load.
The overall averaged CPU also had an increment even after being averaged out. So it's not just a spike, but an increase in overall CPU load (scheduling and context-switching overhead?).
Describe what you expected:
A flatter+lower CPU profile/footprint would be preferable.
Steps to reproduce the issue:
Possible fixes
The text was updated successfully, but these errors were encountered: