Skip to content

Commit d44d8a1

Browse files
authored
feat: Openshell driver podman (#904)
* feat(podman): add Podman compute driver for rootless sandbox management Adds openshell-driver-podman, a new compute driver that manages OpenShell sandboxes as rootless Podman containers via the Podman REST API over a Unix socket. Enables local workstation sandboxes without Kubernetes. Driver features: - Bridge networking with ephemeral host-port mapping for rootless SSH reachability - Named volumes for workspace storage, Podman native health checks, GPU via CDI - Supervisor binary sideloaded via image volume mount (BYOC-compatible) - SSH handshake secret injected via Podman secrets API (not plaintext env) - Typed ContainerSpec structs, input validation, and path-traversal guards - Cgroups v2 required; fails fast on v1 hosts - Bounded event stream buffer; watch stream reconnection handled by server watch_loop - Graceful shutdown and standalone driver binary with gRPC bridge Rootless-specific fixes: - Skip drop_privileges when user namespace lacks SETUID/SETGID/DAC_READ_SEARCH caps - Add /run/netns tmpfs mount for ip netns in rootless containers - Use secret_env map (not secrets array) for env-var injection in libpod API - Resolve SSH endpoint to 127.0.0.1:<host_port> instead of unreachable bridge IP Server/sandbox hardening: - Split loopback and link-local SSRF gates; Podman/VM drivers allow loopback - Close SSRF bypass in SSH tunnel Host path by resolving DNS before connecting - Prevent OPENSHELL_* env var override by user-supplied spec environment maps - Disable SQLite pool idle_timeout/max_lifetime for in-memory databases - Emit deleted_event on 404-during-inspect instead of regressing sandbox phase - Key delete cleanup by stable sandbox_id to survive container label drift CLI fixes: - Restore --name as a named flag on sandbox create (not positional) - Fix exec command arg parsing to not consume sandboxed-command flags - Propagate SSH verbosity via OPENSHELL_SSH_LOG_LEVEL Build tooling: - Add tasks/scripts/container-engine.sh: auto-detects Podman or Docker, exposes unified ce_* helpers; all build/cluster/VM scripts updated to use it - Add docker:build:supervisor mise task for standalone supervisor image - Add openshell-driver-podman to Dockerfile.images pre-fetch/build stages - Add e2e/rust/e2e-podman.sh and e2e:podman mise task for full lifecycle testing Signed-off-by: Adam Miller <admiller@redhat.com> * fix(driver-podman): derive grpc endpoint from server bind port When a user starts the gateway on a non-default port (e.g. --port 8081), sandbox containers were receiving OPENSHELL_ENDPOINT pointing at the default port 8080. The driver's auto-detection fallback read OPENSHELL_BIND_ADDRESS from the environment, which was stale or unset, and fell back to DEFAULT_SERVER_PORT. Add gateway_port to PodmanComputeConfig and thread config.bind_address.port() from the server into the driver so the fallback uses the actual listening port. Remove the OPENSHELL_BIND_ADDRESS env var read and the extract_port_from_bind_address helper which are no longer needed. Add --gateway-port / OPENSHELL_GATEWAY_PORT to the standalone driver binary for parity when the driver is run outside the embedded server path. Signed-off-by: Adam Miller <admiller@redhat.com> * fix(driver-podman): address PR feedback on env test safety and cluster DNS docs Replace hand-rolled unsafe TempEnvVar RAII guard with temp_env::with_vars and a static ENV_LOCK mutex, fixing a data race in parallel test execution. The prior safety comment incorrectly claimed Cargo runs tests single-threaded. Update debug-openshell-cluster skill to accurately document the DNS proxy strategy (setup_dns_proxy + public DNS fallback) and clarify the separation between cluster DNS and sandbox agent DNS enforcement. Signed-off-by: Adam Miller <admiller@redhat.com> * fix(e2e): resolve CI failures in auth timeout, test harness, and formatting - Short-circuit browser_auth_flow when OPENSHELL_NO_BROWSER=1 instead of waiting the full 120s AUTH_TIMEOUT for a callback that never arrives - Add timeout to SandboxGuard::create() and create_with_upload() to prevent indefinite hangs (matches create_keep() which already had one) - Add missing '--' separator in no_proxy test before command args - Add #![cfg(feature = "e2e")] gate to sandbox_lifecycle.rs - Run cargo fmt on openshell-driver-podman - Refine cluster DNS docs for Podman in debug-openshell-cluster skill Signed-off-by: Adam Miller <admiller@redhat.com> * refactor(server): remove allows_loopback_endpoints from ComputeRuntime SSRF protection is now handled at the network and proxy layers (openshell-core net.rs, openshell-sandbox proxy.rs) rather than requiring per-driver flags on ComputeRuntime. Update architecture docs to reflect supervisor relay SSH transport and add rootless networking deep-dive. Signed-off-by: Adam Miller <admiller@redhat.com> --------- Signed-off-by: Adam Miller <admiller@redhat.com>
1 parent 77a88c3 commit d44d8a1

50 files changed

Lines changed: 5873 additions & 391 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@ description: Debug why a openshell cluster failed to start or is unhealthy. Use
77

88
Diagnose why a openshell cluster failed to start after `openshell gateway start`.
99

10-
Use **only** `openshell` CLI commands (`openshell status`, `openshell doctor logs`, `openshell doctor exec`) to inspect and fix the cluster. Do **not** use raw `docker`, `ssh`, or `kubectl` commands directly — always go through the `openshell doctor` interface. The CLI auto-resolves local vs remote gateways, so the same commands work everywhere.
10+
Use **only** `openshell` CLI commands (`openshell status`, `openshell doctor logs`, `openshell doctor exec`) to inspect and fix the cluster. Do **not** use raw `docker`, `podman`, `ssh`, or `kubectl` commands directly — always go through the `openshell doctor` interface. The CLI auto-resolves local vs remote gateways, so the same commands work everywhere.
1111

1212
## Overview
1313

14-
`openshell gateway start` creates a Docker container running k3s with the OpenShell server deployed via Helm. The deployment stages, in order, are:
14+
`openshell gateway start` creates a container (via Docker or Podman) running k3s with the OpenShell server deployed via Helm. The build and deploy scripts use a container-engine abstraction layer (`tasks/scripts/container-engine.sh`) that auto-detects Docker or Podman and provides a unified `ce` command interface. Set `CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to override auto-detection. The deployment stages, in order, are:
1515

1616
1. **Pre-deploy check**: `openshell gateway start` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
1717
2. Ensure cluster image is available (local build or remote pull)
18-
3. Create Docker network (`openshell-cluster`) and volume (`openshell-cluster-{name}`)
19-
4. Create and start a privileged Docker container (`openshell-cluster-{name}`)
18+
3. Create container network (`openshell-cluster`) and volume (`openshell-cluster-{name}`)
19+
4. Create and start a privileged container (`openshell-cluster-{name}`)
2020
5. Wait for k3s to generate kubeconfig (up to 60s)
2121
6. **Clean stale nodes**: Remove any `NotReady` k3s nodes left over from previous container instances that reused the same persistent volume
2222
7. **Prepare local images** (if `OPENSHELL_PUSH_IMAGES` is set): In `internal` registry mode, bootstrap waits for the in-cluster registry and pushes tagged images there. In `external` mode, bootstrap uses legacy `ctr -n k8s.io images import` push-mode behavior.
@@ -28,10 +28,11 @@ Use **only** `openshell` CLI commands (`openshell status`, `openshell doctor log
2828
- TLS secrets `openshell-server-tls` and `openshell-client-tls` exist in `openshell` namespace
2929
- Sandbox supervisor binary exists at `/opt/openshell/bin/openshell-sandbox` (emits `HEALTHCHECK_MISSING_SUPERVISOR` marker if absent)
3030

31-
For local deploys, metadata endpoint selection now depends on Docker connectivity:
31+
For local deploys, metadata endpoint selection depends on the container engine and its connectivity:
3232

3333
- default local Docker socket (`unix:///var/run/docker.sock`): `https://127.0.0.1:{port}` (default port 8080)
3434
- TCP Docker daemon (`DOCKER_HOST=tcp://<host>:<port>`): `https://<host>:{port}` for non-loopback hosts
35+
- Podman (rootless): `https://127.0.0.1:{port}` via `host.containers.internal` (the Podman equivalent of `host.docker.internal`)
3536

3637
The host port is configurable via `--port` on `openshell gateway start` (default 8080) and is stored in `ClusterMetadata.gateway_port`.
3738

@@ -41,7 +42,8 @@ The default cluster name is `openshell`. The container is `openshell-cluster-{na
4142

4243
## Prerequisites
4344

44-
- Docker must be running (locally or on the remote host)
45+
- Docker or Podman must be running (locally or on the remote host). The build system auto-detects which engine is available; set `CONTAINER_ENGINE=docker` or `CONTAINER_ENGINE=podman` to override.
46+
- For rootless Podman: ensure the Podman socket is active (`systemctl --user start podman.socket`) and subuid/subgid ranges are configured (`sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $(whoami)`)
4547
- The `openshell` CLI must be available
4648
- For remote clusters: SSH access to the remote host
4749

@@ -300,10 +302,10 @@ Common issues:
300302
DNS misconfiguration is a common root cause, especially on remote/Linux hosts:
301303

302304
```bash
303-
# Check the resolv.conf k3s is using
305+
# Check the resolv.conf k3s is using for pod DNS
304306
openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf
305307

306-
# Test DNS resolution from inside the container
308+
# Test DNS from inside the container
307309
openshell doctor exec -- sh -c 'nslookup google.com || wget -q -O /dev/null http://google.com && echo "network ok" || echo "network unreachable"'
308310
```
309311

@@ -313,23 +315,31 @@ Check the entrypoint's DNS decision in the container logs:
313315
openshell doctor logs --lines 20
314316
```
315317

316-
The entrypoint script selects DNS resolvers in this priority:
318+
The entrypoint (`cluster-entrypoint.sh`) sets k3s pod DNS via a single strategy with one fallback:
317319

318-
1. Viable nameservers from `/etc/resolv.conf` (not loopback/link-local)
319-
2. Docker `ExtServers` from `/etc/resolv.conf` comments
320-
3. Host gateway IP (Docker Desktop only, `192.168.*`)
321-
4. Fallback to `8.8.8.8` / `8.8.4.4`
320+
1. **Docker DNS proxy** (`setup_dns_proxy()`): reads the `DOCKER_OUTPUT` iptables chain to discover where Docker's embedded DNS (`127.0.0.11`) is actually listening, installs DNAT rules so k3s pods can reach it via the container's `eth0` IP, and writes `nameserver <eth0-ip>` to `/etc/rancher/k3s/resolv.conf`. On success logs: `Setting up DNS proxy: <ip>:53 -> 127.0.0.11`.
321+
2. **Public DNS fallback**: if `setup_dns_proxy()` fails for any reason, logs `Warning: Could not discover Docker DNS ports from iptables` and writes `nameserver 8.8.8.8` / `nameserver 8.8.4.4` to `/etc/rancher/k3s/resolv.conf`.
322+
323+
After either path, `verify_dns()` runs `nslookup` (5 retries) against the configured registry host (default `ghcr.io`). On failure it emits `DNS_PROBE_FAILED` into the logs. The Rust-side bootstrap (`runtime.rs` / `lib.rs`) watches for this marker and aborts early rather than spinning for the full 6-minute timeout.
324+
325+
**Important:** there are two independent DNS paths inside the cluster container. The entrypoint only writes `/etc/rancher/k3s/resolv.conf` (pod DNS). The container's system `/etc/resolv.conf` (used by containerd for image pulls and by `nslookup`) is set by Docker or Podman at container start and is never touched by the entrypoint. These can point at different nameservers.
326+
327+
**Under Podman:** `setup_dns_proxy()` always fails — Podman does not create a `DOCKER_OUTPUT` chain. k3s pod DNS always falls back to `8.8.8.8`/`8.8.4.4`. The cluster container runs on a named Podman bridge network, which uses **netavark + aardvark-dns**. Aardvark-dns listens on the bridge gateway IP (e.g. `10.89.x.1`) and forwards external queries to the host resolver. Podman sets the container's system `/etc/resolv.conf` to that address — so `nslookup ghcr.io` works fine even when `8.8.8.8` is blocked. This means **`DNS_PROBE_FAILED` is never emitted under Podman** even when pod-level DNS is broken: the entrypoint's `verify_dns()` and the Rust-side `probe_container_dns()` both call bare `nslookup`, which hits aardvark-dns via the system resolv.conf, not the k3s resolv.conf. Pod DNS failures surface later as CoreDNS upstream forwarding timeouts, not as an early bootstrap abort.
328+
329+
To debug Podman pod DNS failures: check `/etc/rancher/k3s/resolv.conf` confirms `8.8.8.8` is there, then verify `8.8.8.8:53` UDP is reachable from the host with `nc -vzu 8.8.8.8 53`.
322330

323331
If DNS is broken, all image pulls from the distribution registry will fail, as will pods that need external network access.
324332

333+
**Sandbox agent DNS is a separate enforcement layer.** The cluster DNS above controls what k3s pods and containerd can resolve. It has no bearing on what agent workloads inside sandboxes can reach. The sandbox supervisor (`openshell-sandbox`) creates an isolated Linux network namespace for each agent process with a veth pair, then installs iptables rules inside that namespace that REJECT all outbound UDP — including port 53 — except traffic destined for the supervisor's CONNECT proxy. Agent workloads cannot make raw DNS queries regardless of what nameservers are configured anywhere in the cluster. DNS must go through the HTTP CONNECT proxy. This is a kernel-enforced boundary at the netns level, not a configuration setting. The bypass monitor detects and logs any direct DNS attempt with the hint: `"DNS queries should route through the sandbox proxy; check resolver configuration"`.
334+
325335
## Common Failure Patterns
326336

327337
| Symptom | Likely Cause | Fix |
328338
|---------|-------------|-----|
329339
| `tls handshake eof` from `openshell status` | Server not running or mTLS credentials missing/mismatched | Check StatefulSet replicas (Step 3) and mTLS files (Step 6) |
330340
| StatefulSet `0/0` replicas | StatefulSet scaled to zero (failed deploy, manual scale-down, or Helm misconfiguration) | `openshell doctor exec -- kubectl -n openshell scale statefulset openshell --replicas=1` |
331341
| Local mTLS files missing | Deploy was interrupted before credentials were persisted | Extract from cluster secret `openshell-client-tls` (Step 6) |
332-
| Container not found | Image not built | `mise run docker:build:cluster` (local) or re-deploy (remote) |
342+
| Container not found | Image not built | `mise run docker:build:cluster` (local, works with both Docker and Podman) or re-deploy (remote) |
333343
| Container exited, OOMKilled | Insufficient memory | Increase host memory or reduce workload |
334344
| Container exited, non-zero exit | k3s crash, port conflict, privilege issue | Check `openshell doctor logs` for details |
335345
| `/readyz` fails | k3s still starting or crashed | Wait longer or check container logs for k3s errors |
@@ -343,10 +353,15 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
343353
| mTLS mismatch after redeploy | PKI rotated but workload not restarted, or rollout failed | Check that all three TLS secrets exist and that the openshell pod restarted after cert rotation (Step 6) |
344354
| Helm install job failed | Chart values error or dependency issue | `openshell doctor exec -- kubectl -n kube-system logs -l job-name=helm-install-openshell` |
345355
| NFD/GFD DaemonSets present (`node-feature-discovery`, `gpu-feature-discovery`) | Cluster was deployed before NFD/GFD were disabled (pre-simplify-device-plugin change) | These are harmless but add overhead. Clean up: `openshell doctor exec -- kubectl delete daemonset -n nvidia-device-plugin -l app.kubernetes.io/name=node-feature-discovery` and similarly for GFD. The `nvidia.com/gpu.present` node label is no longer applied; device plugin scheduling no longer requires it. |
356+
| Podman socket not found | Rootless Podman service not started | `systemctl --user start podman.socket` and verify with `podman info` |
357+
| Container creation fails with subuid/subgid error (Podman) | Missing user namespace ID mappings | `sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 $(whoami)` then `podman system migrate` |
358+
| Cgroups v1 detected (Podman) | Podman driver requires unified cgroup hierarchy | Set `systemd.unified_cgroup_hierarchy=1` kernel parameter and reboot |
359+
| `--restart=always` ignored (Podman rootless) | Rootless Podman does not support `--restart=always` for containers | Use a systemd user service instead: `loginctl enable-linger $(whoami)` then create a `~/.config/systemd/user/` unit |
346360
| Architecture mismatch (remote) | Built on arm64, deploying to amd64 | Cross-build the image for the target architecture |
347361
| Port conflict | Another service on the configured gateway host port (default 8080) | Stop conflicting service or use `--port` on `openshell gateway start` to pick a different host port |
348362
| gRPC connect refused to `127.0.0.1:443` in CI | Docker daemon is remote (`DOCKER_HOST=tcp://...`) but metadata still points to loopback | Verify metadata endpoint host matches `DOCKER_HOST` and includes non-loopback host |
349-
| DNS failures inside container | Entrypoint DNS detection failed | `openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf` and `openshell doctor logs --lines 20` |
363+
| DNS failures inside container (Docker) | `setup_dns_proxy()` failed to find `DOCKER_OUTPUT` iptables chain | `openshell doctor logs --lines 20` for `Warning: Could not discover Docker DNS ports`; try `docker network prune -f` and redeploy |
364+
| Pod external name resolution fails (Podman) | `setup_dns_proxy()` always fails under Podman; k3s resolv.conf falls back to `8.8.8.8`/`8.8.4.4`, which is blocked on this network | `DNS_PROBE_FAILED` will NOT appear — entrypoint and Rust-side probes resolve via aardvark-dns (system `/etc/resolv.conf`), not the k3s resolv.conf; check `openshell doctor exec -- cat /etc/rancher/k3s/resolv.conf` to confirm fallback; verify `8.8.8.8:53` UDP reachable from host via `nc -vzu 8.8.8.8 53` |
350365
| Pods can't reach kube-dns / ClusterIP services | `br_netfilter` not loaded; bridge traffic bypasses iptables DNAT rules | `sudo modprobe br_netfilter` on the host, then `echo br_netfilter \| sudo tee /etc/modules-load.d/br_netfilter.conf` to persist. Known to be required on Jetson Linux 5.15-tegra; other kernels (e.g. standard x86/aarch64 Linux) may have bridge netfilter built in and work without the module. The entrypoint logs a warning when `/proc/sys/net/bridge/bridge-nf-call-iptables` is absent but does not abort — only act on it if DNS or service connectivity is actually broken. |
351366
| Node DiskPressure / MemoryPressure / PIDPressure | Insufficient disk, memory, or PIDs on host | Free disk (`docker system prune -a --volumes`), increase memory, or expand host resources. Bootstrap auto-detects via `HEALTHCHECK_NODE_PRESSURE` marker |
352367
| Pods evicted with "The node had condition: [DiskPressure]" | Host disk full, kubelet evicting pods | Free disk space on host, then `openshell gateway destroy <name> && openshell gateway start` |

0 commit comments

Comments
 (0)