Skip to content

Conversation

@blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Nov 4, 2025

What is the problem this PR solves?

Fix data race in checkin API to prevent multiple Fleet Servers handling the same Elastic Agent from overwriting each other. Ensures that the Elastic Agent document is updated before a response back from check-in, to ensure that the state between Fleet and Elastic Agent is always consistent.

Improves the checkin logic to make it easier to understand and removes each moving parts. Removes the need for an extra ticker in each long poll connection select. Removes the need to compute unique Elastic Agent bodies to send in the MUpdate. This reduces the amount of memory allocations and the amount of data that is sent over the MUpdate.

How does this PR solve the problem?

It solves the issue by ensuring to write the Elastic Agent document before sending a response, which ensures that if the Elastic Agent quickly makes another connection to a different Fleet Server that it is working with the previous state that was written by the previous checkin.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • [ ] I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc. (no changes in this PR require this)

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation (none related in this PR, all internal)
  • [ ] I have made corresponding change to the default configuration files (none)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@blakerouse blakerouse self-assigned this Nov 4, 2025
@blakerouse blakerouse added the backport-skip Skip notification from the automated backport with mergify label Nov 4, 2025
@blakerouse blakerouse requested a review from a team as a code owner November 4, 2025 23:43
@blakerouse blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 4, 2025
@prodsecmachine
Copy link

prodsecmachine commented Nov 4, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Licenses 0 0 0 0 0 issues
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@blakerouse blakerouse marked this pull request as draft November 5, 2025 02:15
@blakerouse blakerouse marked this pull request as ready for review November 6, 2025 16:38
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only went through handleCheckin* and I left a couple of comments. May come back with more once I go over the rest of the PR.

I also think that @michel-laterman could have a look at this since he should have a better context around agent checkin in fleet-server

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like the refactor.
My biggest concern is the possbile 2nd ES call to remove the audit/unenrolled attributes

@blakerouse
Copy link
Contributor Author

@pchila @michel-laterman Can you give this another review?

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my comments. LGTM

@blakerouse blakerouse merged commit f740682 into elastic:main Nov 19, 2025
9 checks passed
@blakerouse blakerouse deleted the improve-checking branch November 19, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CheckIn bulker 10 second window is a problem Improve checkin bulker performance

4 participants