Skip to content

Commit b2feacc

Browse files
authored
fix(bootstrap): stabilize release canary gateway startup (#1210)
1 parent 86b8ffd commit b2feacc

4 files changed

Lines changed: 12 additions & 3 deletions

File tree

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,7 @@ openshell logs <sandbox-name>
198198
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` |
199199
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
200200
| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
201+
| `K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource` | Optional Gateway API manifest was auto-applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait | Confirm the cluster image only bundles core manifests; apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only when `grpcRoute` is enabled |
201202

202203
## Reporting
203204

crates/openshell-bootstrap/src/lib.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1189,7 +1189,9 @@ async fn wait_for_namespace(
11891189
) -> Result<()> {
11901190
use miette::WrapErr;
11911191

1192-
let attempts = 60;
1192+
// Shared CPU runners can take several minutes to cold-start k3s, apply
1193+
// bundled manifests, and let the k3s Helm controller create the namespace.
1194+
let attempts = 150;
11931195
let max_backoff = std::time::Duration::from_secs(2);
11941196
let mut backoff = std::time::Duration::from_millis(200);
11951197

crates/openshell-cli/src/doctor_llm_prompt.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,7 @@ openshell doctor exec -- kubectl -n openshell get secret openshell-client-tls -o
225225

226226
Common mTLS issues:
227227

228-
- **Secrets missing**: The `openshell` namespace may not have been created yet (Helm controller race). Bootstrap waits up to 2 minutes for the namespace.
228+
- **Secrets missing**: The `openshell` namespace may not have been created yet (Helm controller race). Bootstrap waits up to about 5 minutes for the namespace.
229229
- **mTLS mismatch after manual secret deletion**: Delete all three secrets and redeploy — bootstrap will regenerate and restart the workload.
230230
- **CLI can't connect after redeploy**: Check that `~/.config/openshell/gateways/<name>/mtls/` contains `ca.crt`, `tls.crt`, `tls.key` and that they were updated at deploy time.
231231
- **Local mTLS files missing**: The gateway was deployed but CLI credentials weren't persisted (e.g., interrupted deploy). Extract from the cluster secret as shown above.

deploy/docker/Dockerfile.images

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,13 @@ COPY deploy/docker/cluster-healthcheck.sh /usr/local/bin/cluster-healthcheck.sh
202202
RUN chmod +x /usr/local/bin/cluster-healthcheck.sh
203203

204204
COPY deploy/docker/.build/charts/*.tgz /opt/openshell/charts/
205-
COPY deploy/kube/manifests/*.yaml /opt/openshell/manifests/
205+
# Only the core k3s auto-deploy manifests belong in the cluster image.
206+
# Gateway API routing is optional and requires Envoy Gateway CRDs, so
207+
# deploy/kube/manifests/envoy-gateway-openshell.yaml stays repo-local and is
208+
# applied manually by `mise run helm:gateway:apply` when grpcRoute is enabled.
209+
COPY deploy/kube/manifests/openshell-helmchart.yaml \
210+
deploy/kube/manifests/agent-sandbox.yaml \
211+
/opt/openshell/manifests/
206212
COPY deploy/kube/gpu-manifests/*.yaml /opt/openshell/gpu-manifests/
207213

208214
ENTRYPOINT ["/usr/local/bin/cluster-entrypoint.sh"]

0 commit comments

Comments
 (0)