-
Notifications
You must be signed in to change notification settings - Fork 101
Fix data race in checkin API and improve overall performance #5834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is because the checkin bulker has a default timeout of 10 seconds, meaning the original 10 seconds could result in it being missed by the check.
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
pchila
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only went through handleCheckin* and I left a couple of comments. May come back with more once I go over the rest of the PR.
I also think that @michel-laterman could have a look at this since he should have a better context around agent checkin in fleet-server
michel-laterman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I really like the refactor.
My biggest concern is the possbile 2nd ES call to remove the audit/unenrolled attributes
|
@pchila @michel-laterman Can you give this another review? |
pchila
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing my comments. LGTM
What is the problem this PR solves?
Fix data race in checkin API to prevent multiple Fleet Servers handling the same Elastic Agent from overwriting each other. Ensures that the Elastic Agent document is updated before a response back from check-in, to ensure that the state between Fleet and Elastic Agent is always consistent.
Improves the checkin logic to make it easier to understand and removes each moving parts. Removes the need for an extra ticker in each long poll connection select. Removes the need to compute unique Elastic Agent bodies to send in the MUpdate. This reduces the amount of memory allocations and the amount of data that is sent over the MUpdate.
How does this PR solve the problem?
It solves the issue by ensuring to write the Elastic Agent document before sending a response, which ensures that if the Elastic Agent quickly makes another connection to a different Fleet Server that it is working with the previous state that was written by the previous checkin.
Design Checklist
[ ] I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.(no changes in this PR require this)Checklist
[ ] I have made corresponding changes to the documentation(none related in this PR, all internal)[ ] I have made corresponding change to the default configuration files(none)./changelog/fragmentsusing the changelog toolRelated issues