-
Notifications
You must be signed in to change notification settings - Fork 177
Elastic Agent upgrade: lock-free manual rollback #8767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
u.log.Errorf("error finding process with PID: %d: %s", pid, findProcErr) | ||
continue | ||
} | ||
killProcErr := process.Kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dangerous without coordination between the agent and the watcher to make sure we don't kill the watcher while it is in the process of rolling back the currently running agent.
I still think if the watcher is running, we could communicate the need to rollback via the StateWatch
it already has on this agent instead of having to kill it:
watch, err := ch.agentClient.StateWatch(stateCtx) |
Is there a reason that wouldn't work? It seems like it is a safer way to trigger this to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The edge case to this approach is if the grace period expires as you indicate a rollback should happen, the watcher might exit before it can see the need to rollback.
That is a nicer edge case than the agent installation being broken, which is my main worry with unconditionally killing the watcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kill()
is the PoC part here, the point to bring home is that the elastic agent main process should take over any watcher process that is competing over the appLocker.
This is needed to remove concurrent writes to the upgrade marker: the actual rollback will start later when a new watcher is launched.
The StateWatch
usage that you are proposing would just make the watcher exit or it would trigger an immediate rollback (driven by the watcher) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StateWatch usage that you are proposing would just make the watcher exit or it would trigger an immediate rollback (driven by the watcher) ?
It could be either of these (probably immediate rollback is simpler?), the main point is to avoid a situation where we kill the watcher while it is rolling back we need a graceful shutdown mechanism or a way to avoid having to shut down the watcher to trigger a rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I remember about file locks on Windows is that they are not always released immediately.
On windows the lock is implemented with LockFileEx
If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.
So we need to be careful with locks in the case where the program can exit unexpectedly (e.g. panic). The watcher is simple enough that this is pretty unlikely and I don't think we've seen it before. We did observe the Beats having problems with this though in the past as they are more likely to panic or be OOMKilled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmacknz Tried using SIGTERM and SIGINT (for the windows part) in 18c1804 but we receive those signals when the agent starting the upgrade rexecs, which then brings the watcher to terminate the watch immediately and cleanup 😞
We need less commonly used signals or an entirely different way of making the watcher shutdown gracefully (without writing concurrently files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTRL_BREAK might do it #7738, though in the receiving process Go will translate it into SIGINT so maybe not.
I think my main worry about the use of locks is one of the processes being killed, not properly releasing the lock, and dooming future upgrades.
Do we need to kill the watcher? We could use the state watch to communicate the intent to rollback, and then have the agent ensure the watcher is always running when a rolback is request but hasn't been performed yet (e.g. if the watcher exits after the grace period because of perfect timeing it just gets started again). We could also look at using a separate socket/named pipe to have agent tell the watcher to rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmacknz, yesterday looking at the PR with @pkoutsovasilis we found out the source of the SIGINT signal when re-executing: it was explicitly set as Pdeathsig
signal when launching watcher
4f73814#diff-168080314caf1d3868d593889dc36edec0ec2f12fd50e37d5a3d57e05274a10aL41
Removing that restored watching functionality on agent restart
a7e6486
to
33bfe58
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
764b7ad
to
e4f6b45
Compare
2a3d70c
to
fa7ce8f
Compare
fa7ce8f
to
6873b56
Compare
💔 Build Failed
Failed CI StepsHistory
cc @pchila |
|
What does this PR do?
This is a PoC for a lock-free implementation of #6887 and #6889
It's still very rough around the edges and rollback only works correctly when rolling back during watching phase.
This can be a starting point for discussing the real implementation.
This PR makes the elastic agent main process "take over" the watcher applocker to write the rollback request before running the watcher again, which will perform the rollback
Why is it important?
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
9.2.0-SNAPSHOT
as usualelastic-agent upgrade --rollback 9.2.0-SNAPSHOT
Notes:
Related issues
rollback_window
is set #6880--rollback
option toelastic-agent
upgrade subcommand #6887UPG_WATCHING
state #6890Questions to ask yourself