Move waiting logic for host orchestrator settlement to backend#521
Move waiting logic for host orchestrator settlement to backend#521k311093 wants to merge 1 commit into
Conversation
deec9a3 to
2091716
Compare
| if err != nil { | ||
| return nil, err | ||
| } | ||
| client, err := g.InstanceManager.GetHostClient(g.Zone, host.Name) |
There was a problem hiding this comment.
From GceIM, GetHostClient internally invokes getHostInstance, which seems to be calling GCE API to retrieve the host. However, this method already has ins which is typed as *compute.Instance.
I think it's better not to get *compute.Instance twice, also not to bring new fields(InstanceManager, Zone) into opResultGetter.
There was a problem hiding this comment.
Changed logic to pass *compute.Instance and reuse it if specified.
493fffb to
b5e8c57
Compare
There was a problem hiding this comment.
Adding an inside logic to wait for when the host is ready to interact with shouldn't be part of the CreateHost request logic on the backend side.
CreateHost returns success when the host is created and host resources are allocated, not when the host is ready to interact with. To do this properly, you'd need to add and implement a new http endpoint that tests whether the host is ready, still clients would need to make these calls after CreateHost returned in order to determine whether the host is ready to interact with. However, it's up to every client implementation how they want to handle this, they can always keep doing what the current client is doing.
What use is a host that doesn't have a host orchestrator? If the answer is "none", then it would make sense to only mark the operation to create a host as completed once the host orchestrator is ready. It simplifies the api and puts the need for waiting in a single place (the cloud orchestrator) rather than in every client. The CO is already waiting for the host to be created, it might as well wait a bit longer. |
Please add more details to the description and the title of the PR. What is "host orchestrator settlement"? Which backend? The HO or the CO or something else? Why is this change useful/needed? What are the pros and cons? Even if you write a description in the bug that's not publicly available like this PR is. |
Updated commit message with descriptions. |
+1. I expect users are understanding the meaning of host as a host environment to launch Cuttlefish instances, and the route of orchestrating CF instances is Host Orchestrator. I think health of Host Orchestrator is more representative than health of computing unit(e.g. GCE or docker instance). From API view, I expect users create a host and verify with 3 steps; |
b3d13e7 to
8d2831f
Compare
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if err := client.WaitForHostReady(); err != nil { |
There was a problem hiding this comment.
+1
Waiting for the host to be ready (HO is up and running) should be done in the wait-for-create-operation-to-complete endpoint. Also, please take into account the following caveats:
- The /:wait endpoint isn't just for waiting for host creation to complete, it's used to wait for all types of operations. You're going to need to differentiate between the different operation types to only add the extra wait to the correct operation. You'll also need a way to figure out the host name from the operation.
- Waiting for the HO in the CO should have a timeout, after which the create operation should return failure. The timeout should depend on the actual InstanceManager implementation, as GCE hosts boot times are different than docker containers'.
- The returned failure be clear enough to differentiate between a failure to create the host and the HO not becoming reachable before the timeout.
- The wait for operation must detect and handle correctly (return failure) if a host was successfully created but deleted by a concurrent call before the create operation succeeded (it's ok to wait until the timeout the first time, but any subsequent wait should return immediately; this means in practices that you should check the host exists before you try to contact the HO).
While the wait-for-create-operation is being called, the list operation may return the existing host, even when the HO isn't ready in it. That's the way it works today, it's fine to keep that behavior for now.
Let's do #521 (comment) |
Got it. Thanks! |
f8c9f9f to
ef76e5e
Compare
|
Added endpoint |
| // `Get`/`Create`/`Update`, the response should be the relevant resource. | ||
| router.Handle("/v1/zones/{zone}/operations/{operation}/:wait", c.Authenticate(c.waitOperation)).Methods("POST") | ||
| router.Handle("/v1/zones/{zone}/hosts/{host}", c.Authenticate(c.deleteHost)).Methods("DELETE") | ||
| router.Handle("/v1/zones/{zone}/hosts/{host}/:wait-host-availability", c.Authenticate(c.waitCreateHost)).Methods("POST") |
There was a problem hiding this comment.
The CO API is designed to have only one :wait endpoint /v1/zones/{zone}/operations/{operation}/:wait, there are no specific wait endpoints like: /:wait-for-create or /:wait-for-delete for example.
For the purpose of this PR specifically wondering whether can you can extend the wait logic when the operation was a create host operation and wait a bit longer till the HO is reachable rather than adding a new endpoint.
There was a problem hiding this comment.
In that case, creating another endpoint for creating operation for waiting host orchestrator's settlement and wait it using previous :wait would be a proper way?
There was a problem hiding this comment.
The purpose of this PR was to simplify the client logic. Creating a new operation for waiting on HO readiness doesn't simplify the client implementation, it makes things even more complicated than what we have right now.
The way to make client logic simpler is for the client to make two calls only:
POST /host
POST /operation/foo/:wait
When the operation was a "create host operation". POST /operation/foo/:wait should be resolved when the host was created and the HO is ready. Right now we resolve "create host operation" the moment the host was created without waiting for the HO to be ready, we need to extend this waiting logic for create host operations and not add a new endpoint.
There was a problem hiding this comment.
Changed to wait for host ready at :wait like other operations. Moved host wait logic to WaitOperation function, not Get() method like before.
Checked with cloud orchestrator at GCP, docker and confirmed that it successfully created host and subsequent tasks (for GCP side, didn't checked CVD creation completely, but checked that uploading, extracting artifacts worked well).
There was a problem hiding this comment.
If the wait logic was moved behind POST /v1/zones/{zone}/operations/{operation}/:wait can we remove POST /v1/zones/{zone}/hosts/{host}/:wait-host-availability?
1a22a6e to
ec38115
Compare
| // `Get`/`Create`/`Update`, the response should be the relevant resource. | ||
| router.Handle("/v1/zones/{zone}/operations/{operation}/:wait", c.Authenticate(c.waitOperation)).Methods("POST") | ||
| router.Handle("/v1/zones/{zone}/hosts/{host}", c.Authenticate(c.deleteHost)).Methods("DELETE") | ||
| router.Handle("/v1/zones/{zone}/hosts/{host}/:wait-host-availability", c.Authenticate(c.waitCreateHost)).Methods("POST") |
There was a problem hiding this comment.
If the wait logic was moved behind POST /v1/zones/{zone}/operations/{operation}/:wait can we remove POST /v1/zones/{zone}/hosts/{host}/:wait-host-availability?
c4d69b0 to
ce31cf8
Compare
45a38bc to
0d668a2
Compare
While interacting with host orchestrator right after creating host API is returned, there is a possibility of getting 503 service temporarily unavailable error because host is created and booted but host orchestrator is not executed / ready yet. Handling logic for waiting host orchestrator readiness is currently in the client logic. Moving this logic into cloud orchestrator API endpoint (specifically inside WaitOperation) makes the client simpler and reduces redundant API endpoints. Context: b/501288123 TAG=agy CONV=f5f788b6-7aa2-4614-b441-87f3ad9fed4e
While interacting with host orchestrator right after creating host
API is returned, there is a possibility of getting 503 service
temporarily unavailable error because host it created and booted but
host orchestrator is not executed / ready yet.
Handling logic for waiting host orchestrator readiness is currently
in the client logic. Moving this logic into cloud orchestrator API
endpoint would make the client that uses host chestrator API simpler.
Context: b/501288123