Skip to content

Elastic Agent upgrade: lock-free manual rollback #8767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

pchila
Copy link
Member

@pchila pchila commented Jul 1, 2025

What does this PR do?

This is a PoC for a lock-free implementation of #6887 and #6889

It's still very rough around the edges and rollback only works correctly when rolling back during watching phase.
This can be a starting point for discussing the real implementation.

This PR makes the elastic agent main process "take over" the watcher applocker to write the rollback request before running the watcher again, which will perform the rollback

Why is it important?

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

  1. Package elastic agent twice from this PR:
    SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package
    
    AGENT_PACKAGE_VERSION="9.2.0+20250701000000" SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package
    
  2. Install the version 9.2.0-SNAPSHOT as usual
  3. Trigger an update to the other package (saved on disk):
    elastic-agent upgrade --skip-verify --source-uri=file:///vagrant/build/distributions 9.2.0+20250701000000-SNAPSHOT
    
  4. Wait for the new agent to come online and the upgrade details to signal UPG_WATCHING state
  5. Manually rollback to the previous version:
    elastic-agent upgrade --rollback 9.2.0-SNAPSHOT

Notes:

  • Trying to rollback after the grace period does not work and may break the agent install (this is because the watcher is still cleaning up the upgrade marker at the end of the grace period)
  • There are no sanity checks (existence of the version we rollback to, that there's a rollback available etc.), tried to add TODO comments where those should be placed

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@mergify mergify bot assigned pchila Jul 1, 2025
Copy link
Contributor

mergify bot commented Jul 1, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

u.log.Errorf("error finding process with PID: %d: %s", pid, findProcErr)
continue
}
killProcErr := process.Kill()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dangerous without coordination between the agent and the watcher to make sure we don't kill the watcher while it is in the process of rolling back the currently running agent.

I still think if the watcher is running, we could communicate the need to rollback via the StateWatch it already has on this agent instead of having to kill it:

watch, err := ch.agentClient.StateWatch(stateCtx)

Is there a reason that wouldn't work? It seems like it is a safer way to trigger this to me.

Copy link
Member

@cmacknz cmacknz Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edge case to this approach is if the grace period expires as you indicate a rollback should happen, the watcher might exit before it can see the need to rollback.

That is a nicer edge case than the agent installation being broken, which is my main worry with unconditionally killing the watcher.

Copy link
Member Author

@pchila pchila Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kill() is the PoC part here, the point to bring home is that the elastic agent main process should take over any watcher process that is competing over the appLocker.
This is needed to remove concurrent writes to the upgrade marker: the actual rollback will start later when a new watcher is launched.
The StateWatch usage that you are proposing would just make the watcher exit or it would trigger an immediate rollback (driven by the watcher) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StateWatch usage that you are proposing would just make the watcher exit or it would trigger an immediate rollback (driven by the watcher) ?

It could be either of these (probably immediate rollback is simpler?), the main point is to avoid a situation where we kill the watcher while it is rolling back we need a graceful shutdown mechanism or a way to avoid having to shut down the watcher to trigger a rollback.

Copy link
Member

@cmacknz cmacknz Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I remember about file locks on Windows is that they are not always released immediately.

On windows the lock is implemented with LockFileEx

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.

So we need to be careful with locks in the case where the program can exit unexpectedly (e.g. panic). The watcher is simple enough that this is pretty unlikely and I don't think we've seen it before. We did observe the Beats having problems with this though in the past as they are more likely to panic or be OOMKilled.

Copy link
Member Author

@pchila pchila Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz Tried using SIGTERM and SIGINT (for the windows part) in 18c1804 but we receive those signals when the agent starting the upgrade rexecs, which then brings the watcher to terminate the watch immediately and cleanup 😞

We need less commonly used signals or an entirely different way of making the watcher shutdown gracefully (without writing concurrently files)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CTRL_BREAK might do it #7738, though in the receiving process Go will translate it into SIGINT so maybe not.

I think my main worry about the use of locks is one of the processes being killed, not properly releasing the lock, and dooming future upgrades.

Do we need to kill the watcher? We could use the state watch to communicate the intent to rollback, and then have the agent ensure the watcher is always running when a rolback is request but hasn't been performed yet (e.g. if the watcher exits after the grace period because of perfect timeing it just gets started again). We could also look at using a separate socket/named pipe to have agent tell the watcher to rollback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz, yesterday looking at the PR with @pkoutsovasilis we found out the source of the SIGINT signal when re-executing: it was explicitly set as Pdeathsig signal when launching watcher
4f73814#diff-168080314caf1d3868d593889dc36edec0ec2f12fd50e37d5a3d57e05274a10aL41

Removing that restored watching functionality on agent restart

@pchila pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from a7e6486 to 33bfe58 Compare July 11, 2025 07:09
Copy link
Contributor

mergify bot commented Jul 11, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lock-free-manual-rollback upstream/lock-free-manual-rollback
git merge upstream/main
git push upstream lock-free-manual-rollback

@pchila pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from 764b7ad to e4f6b45 Compare July 15, 2025 16:41
@pchila pchila changed the title [DO NOT MERGE] - Lock free manual rollback PoC Lock free manual rollback Jul 25, 2025
@pchila pchila changed the title Lock free manual rollback Elastic Agent upgrade: lock-free manual rollback Jul 25, 2025
@pchila pchila force-pushed the lock-free-manual-rollback branch from 2a3d70c to fa7ce8f Compare July 28, 2025 08:14
@pchila pchila force-pushed the lock-free-manual-rollback branch from fa7ce8f to 6873b56 Compare July 28, 2025 08:38
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 28, 2025

💔 Build Failed

Failed CI Steps

History

cc @pchila

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants