want sled-agent timeout on completion of requests to stop a Propolis VM #4004

gjcolombo · 2023-08-31T20:43:58Z

Today, requests to stop a running instance must by necessity involve the instance's active Propolis (sled agent sends the stop request to Propolis; the instance is stopped when Propolis says so, at which point sled agent cleans up the Propolis zone and all related objects). If an instance's Propolis is not responding, or there is no active Propolis, there is no obvious way to clean up the instance and recover.

A short-term workaround is to grant some form of API access to sled agent's "unregister instance" API, which forcibly executes the termination path (tearing down the Propolis zone and removing the instance from the sled's instance table) and can get force an instance into a stopped state.

In the long run instance lifecycle management needs to be made more robust to Propolis failure and/or non-responsiveness.

jordanhendricks · 2023-09-25T11:06:54Z

Related: #3209

gjcolombo · 2023-10-06T18:55:38Z

In #4194, sled agent's "instance unregister" API assumes that it can produce the correct posterior VMM and instance states by emulating a "Propolis destroyed" state transition. (That is, sled agent's "rudely terminate this instance" function computes the next state by pretending that it immediately got a message from the current VMM that says "I am destroyed and my current migration has failed.")

This is fine today because the unregister API is only used when unwinding start and migrate sagas, where the VMMs that are subject to unregistration have by definition not gotten to do anything interesting yet. It's less fine if Propolis has already begun to run, and especially not fine if we're force-quitting an instance that's a migration target:

Source and target VMMs agree that migration is done
Just before they tell their respective sled agents, Nexus shows up and rudely terminates the source
The emulated "VMM destroyed, migration failed" transition retires the source and the migration, which makes it possible to start the instance again...
...but the target might actually begin to run while this is happening!

This seems dangerous. We probably want to adjust the synchronization here to be something more like the following:

Nexus "disowns" a particular VMM, e.g. by replacing its ID in an instance record with a zeroed ID (or some other set of sentinel values that prevents the instance from restarting)
Nexus terminates all of the disowned VMMs
Nexus clears out the sentinels to allow the instance to start again

This would need to be done in a saga to ensure the whole process runs to completion. More design work's needed here. We just need to do that work before we hook up any external APIs to the existing instance_ensure_unregistered sled agent interface.

davepacheco · 2024-01-26T19:54:57Z

Also related: #4872

A number of bugs relating to guest instance lifecycle management have been observed. These include: - Instances getting "stuck" in a transient state, such as `Starting` or `Stopping`, with no way to forcibly terminate them (#4004) - Race conditions between instances starting and receiving state updates, which cause provisioning counters to underflow (#5042) - Instances entering and exiting the `Failed` state when nothing is actually wrong with them, potentially leaking virtual resources (#4226) These typically require support intervention to resolve. Broadly , these issues exist because the control plane's current mechanisms for understanding and managing an instance's lifecycle state machine are "kind of a mess". In particular: - **(Conceptual) ownership of the CRDB `instance` record is currently split between Nexus and sled-agent(s).** Although Nexus is the only entity that actually reads or writes to the database, the instance's runtime state is also modified by the sled-agents that manage its active Propolis (and, if it's migrating, it's target Propolis), and written to CRDB on their behalf by Nexus. This means that there are multiple copies of the instance's state in different places at the same time, which can potentially get out of sync. When an instance is migrating, its state is updated by two different sled-agents, and they may potentially generate state updates that conflict with each other. And, splitting the responsibility between Nexus and sled-agent makes the code more complex and harder to understand: there is no one place where all instance state machine transitions are performed. - **Nexus doesn't ensure that instance state updates are processed reliably.** Instance state transitions triggered by user actions, such as `instance-start` and `instance-delete`, are performed by distributed sagas, ensuring that they run to completion even if the Nexus instance executing them comes to an untimely end. This is *not* the case for operations that result from instance state transitions reported by sled-agents, which just happen in the HTTP APIs for reporting instance states. If the Nexus processing such a transition crashes, gets network partition'd, or encountering a transient error, the instance is left in an incomplete state and the remainder of the operation will not be performed. This branch rewrites much of the control plane's instance state management subsystem to resolve these issues. At a high level, it makes the following high-level changes: - **Nexus is now the sole owner of the `instance` record.** Sled-agents no longer have their own copies of an instance's `InstanceRuntimeState`, and do not generate changes to that state when reporting instance observations to Nexus. Instead, the sled-agent only publishes updates to the `vmm` and `migration` records (which are never modified by Nexus directly) and Nexus is the only entity responsible for determining how an instance's state should change in response to a VMM or migration state update. - **When an instance has an active VMM, its effective external state is determined primarily by the active `vmm` record**, so that fewer state transitions *require* changes to the `instance` record. PR #5854 laid the ground work for this change, but it's relevant here as well. - **All updates to an `instance` record (and resources conceptually owned by that instance) are performed by a distributed saga.** I've introduced a new `instance-update` saga, which is responsible for performing all changes to the `instance` record, virtual provisioning resources, and instance network config that are performed as part of a state transition. Moving this to a saga helps us to ensure that these operations are always run to completion, even in the event of a sudden Nexus death. - **Consistency of instance state changes is ensured by distributed locking.** State changes may be published by multiple sled-agents to different Nexus replicas. If one Nexus replica is processing a state change received from a sled-agent, and then the instance's state changes again, and the sled-agent publishes that state change to a *different* Nexus...lots of bad things can happen, since the second state change may be performed from the previous initial state, when it *should* have a "happens-after" relationship with the other state transition. And, some operations may contradict each other when performed concurrently. To prevent these race conditions, this PR has the dubious honor of using the first _distributed lock_ in the Oxide control plane, the "instance updater lock". I introduced the locking primitives in PR #5831 --- see that branch for more discussion of locking. - **Background tasks are added to prevent missed updates**. To ensure we cannot accidentally miss an instance update even if a Nexus dies, hits a network partition, or just chooses to eat the state update accidentally, we add a new `instance-updater` background task, which queries the database for instances that are in states that require an update saga without such a saga running, and starts the requisite sagas. Currently, the instance update saga runs in the following cases: - An instance's active VMM transitions to `Destroyed`, in which case the instance's virtual resources are cleaned up and the active VMM is unlinked. - Either side of an instance's live migration reports that the migration has completed successfully. - Either side of an instance's live migration reports that the migration has failed. The inner workings of the instance-update saga itself is fairly complex, and has some kind of interesting idiosyncrasies relative to the existing sagas. I've written up a [lengthy comment] that provides an overview of the theory behind the design of the saga and its principles of operation, so I won't reproduce that in this commit message. [lengthy comment]: https://github.com/oxidecomputer/omicron/blob/357f29c8b532fef5d05ed8cbfa1e64a07e0953a5/nexus/src/app/sagas/instance_update/mod.rs#L5-L254

hawkw · 2024-10-03T20:47:10Z

Do we expect the interface into the forcibly-terminate operation to be exposed in the external API, and if so, would we want to make it a more privileged operation than the normal instance-stop and instance-delete APIs?

gjcolombo · 2024-10-03T22:48:02Z

Do we expect the interface into the forcibly-terminate operation to be exposed in the external API, and if so, would we want to make it a more privileged operation than the normal instance-stop and instance-delete APIs?

I think the answers are "yes" and "no." The idea of the API is to give users a crowbar that they can use to unstick an instance in the unlikely event that it gets stuck in a transitional Propolis state like Starting or Stopping. I would give the API an appropriately forceful name ("force-quit"? "force-terminate"?) to try to emphasize that this just blows away the entire VM process and doesn't give anything in it the chance to run any cleanup logic, and I'm not sure I'd add a console option for it (right away, anyway), but I do think it should be available to regular users.

hawkw · 2024-10-04T17:07:33Z

Yup, that makes sense.

hawkw · 2024-10-09T20:14:00Z

We recently discussed this, and came to the conclusion that it seems unfortunate to present the user with two different ways to stop an instance, one of which has a big warning label on it that says "only do this in case of emergencies" but have no difference in observable effects from the guest's perspective.¹ This forces the user to decide which way of essentially just pulling the virtual power cord out of their VM to use.

Instead, we might consider just making the system resilient to Propolis getting stuck or misbehaving whilst shutting down --- a normal instance-stop request could cause sled-agent to set a timeout, after which the Propolis zone is forcefully deleted if Propolis doesn't report in to say it's exited normally.

Furthermore, we would like to eventually make Propolis attempt to shut guests down more gracefully, but that's out of scope for this issue. See oxidecomputer/propolis#784

Given that Propolis does not currently attempt to gracefully shut down the guest, and the difference between "stop" and "force-terminate" is just whether Propolis itself is given the opportunity to put its affairs in order before the zone is torn down. ↩

hawkw · 2025-01-30T23:40:26Z

#6915 and #6913 solved (at least a subset) of problems related to killing instances that are in a stuck state, so now the normal shutdown API in sled-agent should basically be equivalent to force-shutdown as far as we are capable of having such a thing. I think this probably ought to be closed; what do you think @gjcolombo?

gjcolombo · 2025-02-01T00:33:35Z

Instead, we might consider just making the system resilient to Propolis getting stuck or misbehaving whilst shutting down

I thought we still needed to deal with this part (i.e., if Propolis is stuck in Starting or Stopping, then a request to stop its instance won't resolve). I think we should have an issue (this one or another one) tracking that behavior, since there's currently no other way to deal with an instance that's wedged waiting for Crucible activation (though that specific case might be resolvable by other means, see oxidecomputer/propolis#841).

hawkw · 2025-02-11T18:17:22Z

@gjcolombo sorry I missed your reply from earlier. Just to clarify, is this:

I thought we still needed to deal with this part (i.e., if Propolis is stuck in Starting or Stopping, then a request to stop its instance won't resolve). I think we should have an issue (this one or another one) tracking that behavior...

a change in Propolis, or in Omicron?

As I understand it, Nexus will happily send a stop request to sled-agent for an instance that's Starting, but I believe that the the sled-agent's only way of handling that request is to send a HTTP request to Propolis. As of #6915, the sled-agent should send that request immediately even if it's waiting on an outstanding request to perform some other operation. But, I'm not sure off the top of my head whether Propolis will be able to handle that request if it's waiting for Crucible activation.

Is my understanding correct? If so, it seems to me that maybe we should close this and go open a Propolis ticket about prioritizing requests to terminate...or, we should give the sled-agent a way to tear down the Propolis zone violently should it fail to handle a shutdown request.

gjcolombo · 2025-02-13T22:31:47Z

We checked in on this in today's hypervisor huddle. I believe the consensus was that

sled agent should have a timeout (a long timeout, on the order of several minutes, but a timeout) on requests to stop a Propolis VM; if the timeout expires before the VM enters the Destroyed state, sled agent should rudely terminate the Propolis zone and move the instance to failed; and
we should still aim to fix propolis#841 as soon as possible, since this is the primary issue causing instances to get into a state where one would rely on this kind of timeout.

leftwo · 2025-02-13T23:27:24Z

I believe I found another way to trigger this situation.
If you expunge a sled with running instances, then those instances are auto-restarted on a new sled.
That start can race with downstairs repair, and if an instance is started first, it will have an old VCR (that needs repair).
The instance with the old VCR won't be able to activate because one of the three downstairs is now missing.
So, no activation, and that means propolis can't act on any repair request that nexus sends it.

Presently, sled-agent's `InstanceRunner` has two mechanisms for shutting down a VMM: sending an instance state PUT request to the `propolis-server` process for the `Stopped` state, or forcibly terminating the `propolis-server` and tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place. This means that, should a VMM get *stuck* while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. [a Crucible activation that will never complete][1]. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner. This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the `Stopped` instance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned to `Stopped` is received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone. Fixes #4004. [1]: #4004 (comment)

hawkw · 2025-02-14T20:15:36Z

Okay, I put together a quick pass on the shutdown timeout: #7548

I still need to actually test that this handles the cases we care about before merging that PR, but I'm pretty sure the timeout should cover the entire shutdown process, starting from when we send the PUT to Propolis with the Stopped state.

Presently, sled-agent's `InstanceRunner` has two mechanisms for shutting down a VMM: sending an instance state PUT request to the `propolis-server` process for the `Stopped` state, or forcibly terminating the `propolis-server` and tearing down the zone. At present, when a request to stop an instance is sent to the sled-agent, it uses the first mechanism, where Propolis is politely asked to stop the instance --- which I'll refer to as "graceful shutdown". The forceful termination path is used when asked to unregister an instance where the VMM has not started up yet, when encountering an unrecoverable VMM error, or when killing an instance that was making use of an expunged disk. Currently, these two paths don't really overlap: when Nexus asks a sled-agent to stop an instance, all it will do is politely ask Propolis to please stop the instance gracefully, and will only fall back to violently shooting the zone in the face if Propolis returns the error that indicates it never knew about that instance in the first place. This means that, should a VMM get *stuck* while shutting down the instance, stopping it will never complete successfully, and the Propolis zone won't get cleaned up. This can happen due to e.g. [a Crucible activation that will never complete][1]. Thus, the sled-agent should attempt to violently terminate a Propolis zone when a graceful shutdown of the VMM fails to complete in a timely manner. This commit introduces a timeout for the graceful shutdown process. Now, when we send a PUT request to Propolis with the `Stopped` instance state, the sled-agent will start a 10-minute timer. If no update from Propolis that indicates the instance has transitioned to `Stopped` is received before the timer completes, the sled-agent will proceed with the forceful termination of the Propolis zone. Fixes #4004. [1]: #4004 (comment)

gjcolombo added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus labels Aug 31, 2023

gjcolombo changed the title ~~Instance stop doesn't work if Propolis panics or is unreachable~~ Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state Oct 6, 2023

This was referenced Oct 6, 2023

Split instance state into Instance and VMM tables #4194

Merged

Instance start/delete sagas hang while sleds are unreachable #4259

Closed

gjcolombo mentioned this issue Mar 6, 2024

sled-agent: don't block during instance creation request from nexus #4691

Merged

askfongjojo added the known issue To include in customer documentation and training label Mar 9, 2024

askfongjojo added this to the 8 milestone Mar 9, 2024

askfongjojo modified the milestones: 8, 9 Apr 24, 2024

morlandi7 modified the milestones: 9, 10 Jul 1, 2024

hawkw mentioned this issue Jul 23, 2024

Perform instance state transitions in instance-update saga #5749

Merged

morlandi7 modified the milestones: 10, 11 Aug 14, 2024

morlandi7 modified the milestones: 11, 12 Sep 26, 2024

hawkw self-assigned this Oct 4, 2024

hawkw linked a pull request Oct 7, 2024 that will close this issue

[nexus] add POST /v1/instances/{instance}/force-terminate #6795

Draft

morlandi7 modified the milestones: 12, 13 Nov 26, 2024

morlandi7 modified the milestones: 13, 14 Feb 12, 2025

gjcolombo changed the title ~~Want mechanism to forcibly remove an instance's active VMMs irrespective of instance state~~ want sled-agent timeout on completion of requests to stop a Propolis VM Feb 13, 2025

hawkw linked a pull request Feb 14, 2025 that will close this issue

[sled-agent] VMM graceful shutdown timeout #7548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

want sled-agent timeout on completion of requests to stop a Propolis VM #4004

want sled-agent timeout on completion of requests to stop a Propolis VM #4004

gjcolombo commented Aug 31, 2023

jordanhendricks commented Sep 25, 2023

gjcolombo commented Oct 6, 2023

davepacheco commented Jan 26, 2024

hawkw commented Oct 3, 2024

gjcolombo commented Oct 3, 2024

hawkw commented Oct 4, 2024

hawkw commented Oct 9, 2024 •

edited

Loading

hawkw commented Jan 30, 2025

gjcolombo commented Feb 1, 2025

hawkw commented Feb 11, 2025

gjcolombo commented Feb 13, 2025

leftwo commented Feb 13, 2025

hawkw commented Feb 14, 2025

want sled-agent timeout on completion of requests to stop a Propolis VM #4004

want sled-agent timeout on completion of requests to stop a Propolis VM #4004

Comments

gjcolombo commented Aug 31, 2023

jordanhendricks commented Sep 25, 2023

gjcolombo commented Oct 6, 2023

davepacheco commented Jan 26, 2024

hawkw commented Oct 3, 2024

gjcolombo commented Oct 3, 2024

hawkw commented Oct 4, 2024

hawkw commented Oct 9, 2024 • edited Loading

Footnotes

hawkw commented Jan 30, 2025

gjcolombo commented Feb 1, 2025

hawkw commented Feb 11, 2025

gjcolombo commented Feb 13, 2025

leftwo commented Feb 13, 2025

hawkw commented Feb 14, 2025

hawkw commented Oct 9, 2024 •

edited

Loading