Fix data race in checkin API and improve overall performance #5834

blakerouse · 2025-11-04T23:43:54Z

What is the problem this PR solves?

Fix data race in checkin API to prevent multiple Fleet Servers handling the same Elastic Agent from overwriting each other. Ensures that the Elastic Agent document is updated before a response back from check-in, to ensure that the state between Fleet and Elastic Agent is always consistent.

Improves the checkin logic to make it easier to understand and removes each moving parts. Removes the need for an extra ticker in each long poll connection select. Removes the need to compute unique Elastic Agent bodies to send in the MUpdate. This reduces the amount of memory allocations and the amount of data that is sent over the MUpdate.

How does this PR solve the problem?

It solves the issue by ensuring to write the Elastic Agent document before sending a response, which ensures that if the Elastic Agent quickly makes another connection to a different Fleet Server that it is working with the previous state that was written by the previous checkin.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
~~[ ] I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.~~ (no changes in this PR require this)

Checklist

I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~ (none related in this PR, all internal)
~~[ ] I have made corresponding change to the default configuration files~~ (none)
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes CheckIn bulker 10 second window is a problem #5793
Closes Improve checkin bulker performance #5563

This is because the checkin bulker has a default timeout of 10 seconds, meaning the original 10 seconds could result in it being missed by the check.

prodsecmachine · 2025-11-04T23:44:04Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Licenses	0	0	0	0	0 issues
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

pchila

Only went through handleCheckin* and I left a couple of comments. May come back with more once I go over the rest of the PR.

I also think that @michel-laterman could have a look at this since he should have a better context around agent checkin in fleet-server

internal/pkg/api/handleCheckin.go

internal/pkg/api/handleCheckin_test.go

michel-laterman

Overall, I really like the refactor.
My biggest concern is the possbile 2nd ES call to remove the audit/unenrolled attributes

internal/pkg/server/fleet.go

internal/pkg/checkin/bulk.go

internal/pkg/api/handleCheckin.go

internal/pkg/api/handleCheckin_test.go

internal/pkg/checkin/bulk.go

blakerouse · 2025-11-19T15:12:19Z

@pchila @michel-laterman Can you give this another review?

pchila

Thank you for addressing my comments. LGTM

blakerouse added 7 commits October 28, 2025 17:35

Be-very defensive when it comes to updating local_metadata.

2ebecc8

Add changelog entry.

96fa69b

Merge branch 'main' into fix-5674

c50065b

Bump the EventuallyWithT to 20 seconds in E2E tests.

114e494

This is because the checkin bulker has a default timeout of 10 seconds, meaning the original 10 seconds could result in it being missed by the check.

Improve checkin.

f0fcf4a

Improve checkin.

9071d5a

Merge remote-tracking branch 'upstream/main' into improve-checking

eb97850

blakerouse self-assigned this Nov 4, 2025

blakerouse added the backport-skip Skip notification from the automated backport with mergify label Nov 4, 2025

blakerouse requested a review from a team as a code owner November 4, 2025 23:43

blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 4, 2025

blakerouse requested review from pchila and ycombinator November 4, 2025 23:43

Fix lint.

5290aac

blakerouse marked this pull request as draft November 5, 2025 02:15

blakerouse added 3 commits November 5, 2025 12:21

Fix missed scenarios.

e9d2ab2

Add unit tests.

b253218

Fix e2e test.

0d9acbb

blakerouse marked this pull request as ready for review November 6, 2025 16:38

blakerouse mentioned this pull request Nov 6, 2025

Handle malformatted JSON in .fleet-agents components field #5858

Merged

3 tasks

blakerouse added 2 commits November 7, 2025 10:00

Merge remote-tracking branch 'upstream/main' into improve-checking

53a62d7

Update changelog entry.

6401bb4

pchila reviewed Nov 13, 2025

View reviewed changes

internal/pkg/api/handleCheckin.go Show resolved Hide resolved

internal/pkg/api/handleCheckin_test.go Outdated Show resolved Hide resolved

michel-laterman reviewed Nov 13, 2025

View reviewed changes

blakerouse added 2 commits November 18, 2025 14:24

Fix UpgradedAt and use a MUpdate for painless audit.

bbd2ac2

Merge remote-tracking branch 'upstream/main' into improve-checking

190ab0e

pchila approved these changes Nov 19, 2025

View reviewed changes

blakerouse merged commit f740682 into elastic:main Nov 19, 2025
9 checks passed

blakerouse deleted the improve-checking branch November 19, 2025 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix data race in checkin API and improve overall performance #5834

Fix data race in checkin API and improve overall performance #5834

blakerouse commented Nov 4, 2025 •

edited

Loading

Uh oh!

prodsecmachine commented Nov 4, 2025 •

edited

Loading

Uh oh!

pchila left a comment

Uh oh!

Uh oh!

Uh oh!

michel-laterman left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blakerouse commented Nov 19, 2025

Uh oh!

pchila left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix data race in checkin API and improve overall performance #5834

Fix data race in checkin API and improve overall performance #5834

Conversation

blakerouse commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the problem this PR solves?

How does this PR solve the problem?

Design Checklist

Checklist

Related issues

Uh oh!

prodsecmachine commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

pchila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blakerouse commented Nov 19, 2025

Uh oh!

pchila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blakerouse commented Nov 4, 2025 •

edited

Loading

prodsecmachine commented Nov 4, 2025 •

edited

Loading