Elastic Agent upgrade: lock-free manual rollback #8767

pchila · 2025-07-01T16:42:46Z

What does this PR do?

This PR introduces a new --rollback flag to elastic-agent upgrade command that will switch back to a previous elastic-agent installation, effectively rolling back an upgrade.
This PR makes the elastic agent main process "take over" the watcher applocker to write the rollback request before running the watcher again, which will perform the rollback.
The Upgrader now receives a rollback boolean which will trigger:

a takeover of the watcher applocker
a modification of the update marker, specifying that a rollback has been requested
a new watcher being launched which will take care of the actual rollback

On the watcher side, the watch loop is now interruptible to be able to have the watcher exit gracefully.

Why is it important?

This is the initial implementation for a manual rollback flow.
In this first iteration, the rollback only works during the grace period (that is while the watcher is still running), in follow-up PRs this functionality will be extended for the duration of agent.upgrade.rollback.window.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

How to test this PR locally

Package elastic agent twice from this PR:

SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package

AGENT_PACKAGE_VERSION="9.2.0+20250701000000" SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package

Install the version 9.2.0-SNAPSHOT as usual
Set a rollback window duration > 0, for example 2 hours:

  agent.upgrade:
    rollback:
      window: 2h

Trigger an update to the other package (saved on disk):

elastic-agent upgrade --skip-verify --source-uri=file:///vagrant/build/distributions 9.2.0+20250701000000-SNAPSHOT

Wait for the new agent to come online and the upgrade details to signal UPG_WATCHING state
Manually rollback to the previous version:

elastic-agent upgrade --rollback 9.2.0-SNAPSHOT

Notes:

Trying to rollback after the grace period does not work and may break the agent install (this is because the watcher is still cleaning up the upgrade marker and the previous install at the end of the grace period)

Related issues

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-07-01T16:43:24Z

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

cmacknz · 2025-07-02T19:49:02Z

internal/pkg/agent/application/upgrade/upgrade.go

+						u.log.Errorf("error finding process with PID: %d: %s", pid, findProcErr)
+						continue
+					}
+					killProcErr := process.Kill()


I think this is dangerous without coordination between the agent and the watcher to make sure we don't kill the watcher while it is in the process of rolling back the currently running agent.

I still think if the watcher is running, we could communicate the need to rollback via the StateWatch it already has on this agent instead of having to kill it:

elastic-agent/internal/pkg/agent/application/upgrade/watcher.go

Line 157 in 676e18d

watch, err := ch.agentClient.StateWatch(stateCtx)

Is there a reason that wouldn't work? It seems like it is a safer way to trigger this to me.

The edge case to this approach is if the grace period expires as you indicate a rollback should happen, the watcher might exit before it can see the need to rollback.

That is a nicer edge case than the agent installation being broken, which is my main worry with unconditionally killing the watcher.

The Kill() is the PoC part here, the point to bring home is that the elastic agent main process should take over any watcher process that is competing over the appLocker.
This is needed to remove concurrent writes to the upgrade marker: the actual rollback will start later when a new watcher is launched.
The StateWatch usage that you are proposing would just make the watcher exit or it would trigger an immediate rollback (driven by the watcher) ?

The StateWatch usage that you are proposing would just make the watcher exit or it would trigger an immediate rollback (driven by the watcher) ?

It could be either of these (probably immediate rollback is simpler?), the main point is to avoid a situation where we kill the watcher while it is rolling back we need a graceful shutdown mechanism or a way to avoid having to shut down the watcher to trigger a rollback.

One thing I remember about file locks on Windows is that they are not always released immediately.

On windows the lock is implemented with LockFileEx

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.

So we need to be careful with locks in the case where the program can exit unexpectedly (e.g. panic). The watcher is simple enough that this is pretty unlikely and I don't think we've seen it before. We did observe the Beats having problems with this though in the past as they are more likely to panic or be OOMKilled.

@cmacknz Tried using SIGTERM and SIGINT (for the windows part) in 18c1804 but we receive those signals when the agent starting the upgrade rexecs, which then brings the watcher to terminate the watch immediately and cleanup 😞

We need less commonly used signals or an entirely different way of making the watcher shutdown gracefully (without writing concurrently files)

CTRL_BREAK might do it #7738, though in the receiving process Go will translate it into SIGINT so maybe not.

I think my main worry about the use of locks is one of the processes being killed, not properly releasing the lock, and dooming future upgrades.

Do we need to kill the watcher? We could use the state watch to communicate the intent to rollback, and then have the agent ensure the watcher is always running when a rolback is request but hasn't been performed yet (e.g. if the watcher exits after the grace period because of perfect timeing it just gets started again). We could also look at using a separate socket/named pipe to have agent tell the watcher to rollback.

@cmacknz, yesterday looking at the PR with @pkoutsovasilis we found out the source of the SIGINT signal when re-executing: it was explicitly set as Pdeathsig signal when launching watcher
4f73814#diff-168080314caf1d3868d593889dc36edec0ec2f12fd50e37d5a3d57e05274a10aL41

Removing that restored watching functionality on agent restart

mergify · 2025-07-11T10:18:59Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lock-free-manual-rollback upstream/lock-free-manual-rollback
git merge upstream/main
git push upstream lock-free-manual-rollback

…rker

elasticmachine · 2025-08-01T15:37:16Z

💔 Build Failed

Buildkite Build
Commit: cba5465

Failed CI Steps

History

💚 Build #24622 succeeded 3ade6df
💔 Build #24463 failed 6873b56
💔 Build #24401 failed 2a3d70c
💔 Build #24291 failed 807944a

cc @pchila

elastic-sonarqube · 2025-08-01T16:39:38Z

Quality Gate passed

Issues
11 New issues
3 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
73.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

mergify bot assigned pchila Jul 1, 2025

pchila added the skip-changelog label Jul 1, 2025

cmacknz reviewed Jul 2, 2025

View reviewed changes

pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from a7e6486 to 33bfe58 Compare July 11, 2025 07:09

pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from 764b7ad to e4f6b45 Compare July 15, 2025 16:41

pchila mentioned this pull request Jul 16, 2025

Pass rollback window duration to upgrade watcher command #8177

Closed

8 tasks

pchila changed the title ~~[DO NOT MERGE] - Lock free manual rollback PoC~~ Lock free manual rollback Jul 25, 2025

pchila changed the title ~~Lock free manual rollback~~ Elastic Agent upgrade: lock-free manual rollback Jul 25, 2025

pchila mentioned this pull request Jul 25, 2025

Add rollback_window to elastic agent configuration with a default value of 7d #6881

Open

pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from fa7ce8f to 6873b56 Compare July 28, 2025 08:38

pchila added 15 commits July 30, 2025 16:54

Add rollback field to UpgradeRequest

a6ba356

introduce rollback parameter to upgrade

146a9ae

manual rollback from CLI PoC

e350419

Concurrently retry taking over watcher

f525165

Gracefully shutdown agent watcher

7255ab5

move desired outcome check before grace period evaluation

e31f9ab

Add rollbacks available to upgrade marker

c9865c1

remove fakeAcker in favour of generated Acker mock

69b5329

Introduce WatcherHelper

b806260

Add tests for available_rollbacks

4a76020

Add tests for takeOverWatcher

f455903

add testlocker binary to sonar exclusions

b1720df

disable rollback window by default

30bdffa

Add formal checks to manual rollback arguments

7567b8d

rename forceRollbackToPreviousVersion

55ad3c4

test watchloop

3901e20

pchila force-pushed the lock-free-manual-rollback branch from 6873b56 to 3901e20 Compare July 30, 2025 14:54

pchila added 2 commits July 30, 2025 19:10

fixup! test watchloop

3ade6df

Re-invoke watcher after takeover

a6052ca

pchila linked an issue Aug 1, 2025 that may be closed by this pull request

Handle rollback command/action in upgrade flow #6889

Open

pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team enhancement New feature or request labels Aug 1, 2025

Add minimum version check for creating rollbacks entries in update ma…

cba5465

…rker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elastic Agent upgrade: lock-free manual rollback #8767

Elastic Agent upgrade: lock-free manual rollback #8767

pchila commented Jul 1, 2025 •

edited

Loading

Uh oh!

mergify bot commented Jul 1, 2025

Uh oh!

cmacknz Jul 2, 2025

Uh oh!

cmacknz Jul 2, 2025 •

edited

Loading

Uh oh!

pchila Jul 3, 2025 •

edited

Loading

Uh oh!

cmacknz Jul 3, 2025

Uh oh!

cmacknz Jul 3, 2025 •

edited

Loading

Uh oh!

pchila Jul 8, 2025 •

edited

Loading

Uh oh!

cmacknz Jul 8, 2025

Uh oh!

pchila Jul 9, 2025

Uh oh!

mergify bot commented Jul 11, 2025

Uh oh!

elasticmachine commented Aug 1, 2025 •

edited

Loading

Uh oh!

elastic-sonarqube bot commented Aug 1, 2025

Uh oh!

Uh oh!

Elastic Agent upgrade: lock-free manual rollback #8767

Are you sure you want to change the base?

Elastic Agent upgrade: lock-free manual rollback #8767

Conversation

pchila commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

Uh oh!

mergify bot commented Jul 1, 2025

Uh oh!

cmacknz Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pchila Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmacknz Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pchila Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmacknz Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

pchila Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 11, 2025

Uh oh!

elasticmachine commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

History

Uh oh!

elastic-sonarqube bot commented Aug 1, 2025

Quality Gate passed

Uh oh!

Uh oh!

pchila commented Jul 1, 2025 •

edited

Loading

cmacknz Jul 2, 2025 •

edited

Loading

pchila Jul 3, 2025 •

edited

Loading

cmacknz Jul 3, 2025 •

edited

Loading

pchila Jul 8, 2025 •

edited

Loading

elasticmachine commented Aug 1, 2025 •

edited

Loading