Skip to content

feat(kubernetes): support HA gateway rebalancing#1868

Open
TaylorMutch wants to merge 6 commits into
mainfrom
1021-ha-gateway-rebalancing/tm
Open

feat(kubernetes): support HA gateway rebalancing#1868
TaylorMutch wants to merge 6 commits into
mainfrom
1021-ha-gateway-rebalancing/tm

Conversation

@TaylorMutch

@TaylorMutch TaylorMutch commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds HA gateway rebalancing support for Kubernetes deployments so client and supervisor traffic can survive gateway replica scale-up, scale-down, and pod rotation.

This PR targets main directly. The reconciler lease work from #1577 has already landed, so this PR now focuses on peer authentication/routing, supervisor relay handoff, Kubernetes ownership behavior, Helm/Skaffold HA wiring, CLI retry hardening, and HA validation.

How it works

The Gateway now exposes a peer Service to let gateway replicas discover and call each other to reach supervisors. When a client request lands on a replica that does not currently own the target sandbox supervisor session, that gateway resolves the owning replica and relays the supervisor traffic to the peer instead of failing the request. Kubernetes lease/reconciler ownership keeps sandbox supervision coordinated as gateway pods scale, roll, or disappear, while the CLI retries transient sync probes during those handoffs.

Related Issue

Closes #1021

Related: #1012, #1429, #1577, #1731, #1488

Changes

  • Adds gateway peer authentication and peer routing for HA supervisor relay handoff.
  • Adds Kubernetes compute lease/reconciler ownership behavior for multi-replica gateways.
  • Adds Helm peer Service/RBAC rendering and Skaffold HA/Envoy dev profile support.
  • Adds Kubernetes HA rebalancing e2e coverage and removes the noisy readyz e2e smoke.
  • Retries CLI sandbox file sync and transient sync probe failures after gateway rollouts.
  • Fixes z3 cross-builds by using Zig archive tools for generated gateway artifacts.
  • Updates architecture and local cluster/debug skills for HA gateway development.

Testing

  • cargo fmt --all -- --check
  • mise run helm:lint
  • cargo check -p openshell-server --features test-support
  • mise run pre-commit
  • Local Kubernetes HA validation with Envoy Gateway, external PostgreSQL, and gateway scale/rotation
  • GitHub Branch Checks passed
  • GitHub Helm Lint passed
  • GitHub Kubernetes HA E2E smoke passed with test:e2e-kubernetes

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@TaylorMutch TaylorMutch requested review from a team, derekwaynecarr and mrunalp as code owners June 11, 2026 04:47
@TaylorMutch TaylorMutch added the test:e2e-kubernetes Requires Kubernetes end-to-end coverage label Jun 11, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e-kubernetes applied for ad9f04d. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute Kubernetes HA E2E after building the required gateway and supervisor images once. This is an optional proof-of-life suite; failures are visible in the workflow run but do not publish a required CI gate status.

@TaylorMutch TaylorMutch force-pushed the 1021-ha-gateway-rebalancing/tm branch from 24c1003 to 3e590e6 Compare June 11, 2026 17:35
Comment thread deploy/helm/openshell/skaffold.yaml Outdated
- op: add
path: /deploy/helm/releases/0/valuesFiles/-
value: ci/values-high-availability.yaml
- name: ha-envoy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading the docs initially, it was not clear that ha-envoy included the high-availability profile? Could we call this out explicitly (perhaps renaming the profile), or make it so that these are composable?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm collapsing the two since they are are used in practice together.

@TaylorMutch TaylorMutch force-pushed the 1021-ha-gateway-rebalancing/tm branch 3 times, most recently from fb46193 to a60c79c Compare June 16, 2026 20:59
@TaylorMutch TaylorMutch requested a review from maxamillion as a code owner June 16, 2026 20:59
@TaylorMutch TaylorMutch force-pushed the 1021-ha-gateway-rebalancing/tm branch 2 times, most recently from e93a30d to 493f3da Compare June 23, 2026 19:23
@TaylorMutch TaylorMutch requested a review from elezar June 23, 2026 19:30
@TaylorMutch TaylorMutch force-pushed the 1021-ha-gateway-rebalancing/tm branch from 493f3da to a831deb Compare June 23, 2026 20:39
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@TaylorMutch TaylorMutch force-pushed the 1021-ha-gateway-rebalancing/tm branch from a831deb to 236c8ab Compare June 26, 2026 16:55
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e-kubernetes Requires Kubernetes end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(k8s, helm): Enable running OpenShell Gateway with multiple replicas

2 participants