Skip to content

Commit d4055ca

Browse files
committed
feat(credentials): add provider credential storage drivers
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
1 parent 8e831f3 commit d4055ca

61 files changed

Lines changed: 8612 additions & 89 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/debug-openshell-cluster/SKILL.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,15 @@ release. Look for failed installs, unexpected values, missing namespace, wrong
143143
image tag, TLS settings that do not match the registered endpoint, and
144144
scheduling failures.
145145

146+
When no external credential driver is enabled, the Helm chart uses the
147+
gateway's default encrypted database credential storage. The chart creates a
148+
retained Kubernetes Secret for the shared KEK, injects it into gateway pods, and
149+
stores encrypted credential envelopes in the OpenShell database. For
150+
`workload.kind=deployment` or multi-replica gateways, confirm
151+
`server.externalDbSecret` points at a shared database. A render/install error
152+
mentioning `server.credentialDrivers` means the values selected multiple
153+
external credential backends.
154+
146155
For HA or PostgreSQL-backed installs, also check the external database Secret
147156
referenced by `server.externalDbSecret` and the PostgreSQL workload if the test
148157
or operator deployed one in-cluster:

.github/workflows/branch-e2e.yml

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ jobs:
2424
run_core_e2e: ${{ steps.labels.outputs.run_core_e2e }}
2525
run_gpu_e2e: ${{ steps.labels.outputs.run_gpu_e2e }}
2626
run_kubernetes_ha_e2e: ${{ steps.labels.outputs.run_kubernetes_ha_e2e }}
27+
run_kubernetes_credential_drivers_e2e: ${{ steps.labels.outputs.run_kubernetes_credential_drivers_e2e }}
2728
run_any_e2e: ${{ steps.labels.outputs.run_any_e2e }}
2829
steps:
2930
- uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7.0.0
@@ -41,12 +42,14 @@ jobs:
4142
run_core_e2e=true
4243
run_gpu_e2e=true
4344
run_kubernetes_ha_e2e=true
45+
run_kubernetes_credential_drivers_e2e=true
4446
else
4547
run_core_e2e="$(jq -r 'index("test:e2e") != null' <<< "$LABELS_JSON")"
4648
run_gpu_e2e="$(jq -r 'index("test:e2e-gpu") != null' <<< "$LABELS_JSON")"
4749
run_kubernetes_ha_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
50+
run_kubernetes_credential_drivers_e2e="$(jq -r 'index("test:e2e-kubernetes") != null' <<< "$LABELS_JSON")"
4851
fi
49-
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ] || [ "$run_kubernetes_ha_e2e" = "true" ]; then
52+
if [ "$run_core_e2e" = "true" ] || [ "$run_gpu_e2e" = "true" ] || [ "$run_kubernetes_ha_e2e" = "true" ] || [ "$run_kubernetes_credential_drivers_e2e" = "true" ]; then
5053
run_any_e2e=true
5154
else
5255
run_any_e2e=false
@@ -55,12 +58,13 @@ jobs:
5558
echo "run_core_e2e=$run_core_e2e"
5659
echo "run_gpu_e2e=$run_gpu_e2e"
5760
echo "run_kubernetes_ha_e2e=$run_kubernetes_ha_e2e"
61+
echo "run_kubernetes_credential_drivers_e2e=$run_kubernetes_credential_drivers_e2e"
5862
echo "run_any_e2e=$run_any_e2e"
5963
} >> "$GITHUB_OUTPUT"
6064
6165
build-gateway:
6266
needs: [pr_metadata]
63-
if: needs.pr_metadata.outputs.should_run == 'true' && (needs.pr_metadata.outputs.run_core_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true')
67+
if: needs.pr_metadata.outputs.should_run == 'true' && (needs.pr_metadata.outputs.run_core_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_ha_e2e == 'true' || needs.pr_metadata.outputs.run_kubernetes_credential_drivers_e2e == 'true')
6468
permissions:
6569
contents: read
6670
packages: write
@@ -135,6 +139,18 @@ jobs:
135139
extra-helm-values: deploy/helm/openshell/ci/values-high-availability.yaml
136140
external-postgres-secret: openshell-ha-pg
137141

142+
kubernetes-credential-drivers-e2e:
143+
needs: [pr_metadata, build-gateway, build-supervisor]
144+
if: needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_kubernetes_credential_drivers_e2e == 'true'
145+
permissions:
146+
contents: read
147+
packages: read
148+
uses: ./.github/workflows/e2e-kubernetes-test.yml
149+
with:
150+
image-tag: ${{ github.sha }}
151+
job-name: Kubernetes Credential Drivers E2E
152+
e2e-task: e2e:kubernetes:credential-drivers
153+
138154
core-e2e-result:
139155
name: Core E2E result
140156
needs: [pr_metadata, build-gateway, build-supervisor, e2e, kubernetes-e2e]
@@ -215,3 +231,30 @@ jobs:
215231
fi
216232
done
217233
exit "$failed"
234+
235+
kubernetes-credential-drivers-e2e-result:
236+
name: Kubernetes Credential Drivers E2E result
237+
needs: [pr_metadata, build-gateway, build-supervisor, kubernetes-credential-drivers-e2e]
238+
if: always() && needs.pr_metadata.outputs.should_run == 'true' && needs.pr_metadata.outputs.run_kubernetes_credential_drivers_e2e == 'true'
239+
runs-on: ubuntu-latest
240+
steps:
241+
- name: Verify Kubernetes credential drivers E2E jobs
242+
env:
243+
BUILD_GATEWAY_RESULT: ${{ needs.build-gateway.result }}
244+
BUILD_SUPERVISOR_RESULT: ${{ needs.build-supervisor.result }}
245+
KUBERNETES_CREDENTIAL_DRIVERS_E2E_RESULT: ${{ needs.kubernetes-credential-drivers-e2e.result }}
246+
run: |
247+
set -euo pipefail
248+
failed=0
249+
for item in \
250+
"build-gateway:$BUILD_GATEWAY_RESULT" \
251+
"build-supervisor:$BUILD_SUPERVISOR_RESULT" \
252+
"kubernetes-credential-drivers-e2e:$KUBERNETES_CREDENTIAL_DRIVERS_E2E_RESULT"; do
253+
name="${item%%:*}"
254+
result="${item#*:}"
255+
if [ "$result" != "success" ]; then
256+
echo "::error::$name concluded $result"
257+
failed=1
258+
fi
259+
done
260+
exit "$failed"

.github/workflows/e2e-kubernetes-test.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,11 @@ on:
3232
required: false
3333
type: string
3434
default: ""
35+
e2e-task:
36+
description: "mise task to run for the Kubernetes e2e job"
37+
required: false
38+
type: string
39+
default: "e2e:kubernetes"
3540
mise-version:
3641
description: "mise version to install on the bare Kubernetes e2e runner"
3742
required: false
@@ -112,11 +117,12 @@ jobs:
112117
kind load image-archive "$archive" --name "$KIND_CLUSTER_NAME"
113118
done
114119
115-
- name: Run Kubernetes E2E (Rust smoke)
120+
- name: Run Kubernetes E2E
116121
env:
117122
OPENSHELL_E2E_KUBE_CONTEXT: kind-${{ env.KIND_CLUSTER_NAME }}
118123
OPENSHELL_E2E_KUBE_EXTRA_VALUES: ${{ inputs.extra-helm-values }}
119124
OPENSHELL_E2E_KUBE_EXTERNAL_POSTGRES_SECRET: ${{ inputs.external-postgres-secret }}
120125
IMAGE_TAG: ${{ inputs.image-tag }}
121126
OPENSHELL_REGISTRY: ghcr.io/nvidia/openshell
122-
run: mise run --no-deps --skip-deps e2e:kubernetes
127+
E2E_TASK: ${{ inputs.e2e-task }}
128+
run: mise run --no-deps --skip-deps "$E2E_TASK"

.github/workflows/e2e-label-help.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ jobs:
5151
status_summary="The matching required CI gate status on this PR will flip green automatically once the run finishes."
5252
;;
5353
test:e2e-kubernetes)
54-
suite_summary="Kubernetes HA E2E"
54+
suite_summary="Kubernetes HA and credential-driver E2E"
5555
build_summary="gateway and supervisor images"
5656
status_summary="This is an optional proof-of-life suite; failures are visible in the workflow run but do not publish a required CI gate status."
5757
;;

AGENTS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
3939
| `crates/openshell-core/` | Shared core | Common types, configuration, error handling |
4040
| `crates/openshell-providers/` | Provider management | Credential provider backends |
4141
| `crates/openshell-tui/` | Terminal UI | Ratatui-based dashboard for monitoring |
42+
| `crates/openshell-driver-kubernetes-secrets/` | Kubernetes Secrets credential driver | In-process `CredentialDriver` backend for OpenShell-managed K8s Secret storage |
43+
| `crates/openshell-driver-openbao/` | OpenBao credential driver | In-process `CredentialDriver` backend for OpenBao/Vault-style KV storage |
4244
| `crates/openshell-driver-kubernetes/` | Kubernetes compute driver | In-process `ComputeDriver` backend for K8s sandbox pods |
4345
| `crates/openshell-driver-docker/` | Docker compute driver | In-process `ComputeDriver` backend for local Docker sandbox containers |
4446
| `crates/openshell-driver-vm/` | VM compute driver | Standalone libkrun-backed `ComputeDriver` subprocess (embeds its own rootfs + runtime) |

CI.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,11 @@ Three opt-in labels enable the long-running E2E suites:
1515
- `test:e2e` runs the standard E2E suite in `Branch E2E Checks`
1616
- `test:e2e-gpu` runs GPU E2E in `Branch E2E Checks`
1717
- `test:e2e-kubernetes` runs Kubernetes E2E with the HA Helm overlay
18-
(`replicaCount: 2` and bundled PostgreSQL) in `Branch E2E Checks`
18+
(`replicaCount: 2` and bundled PostgreSQL) and the credential-driver suite
19+
(Kubernetes Secrets plus OpenBao) in `Branch E2E Checks`
1920

2021
When multiple labels are present, `Branch E2E Checks` builds the shared gateway and supervisor images once and fans out all enabled suites in parallel.
21-
The `OpenShell / E2E` and `OpenShell / GPU E2E` required statuses are evaluated from separate suite result jobs inside that workflow. `test:e2e-kubernetes` is optional while HA behavior is under active iteration: failures are visible in the workflow run but do not publish a required CI gate status.
22+
The `OpenShell / E2E` and `OpenShell / GPU E2E` required statuses are evaluated from separate suite result jobs inside that workflow. `test:e2e-kubernetes` is optional while Kubernetes HA and credential-driver behavior are under active iteration: failures are visible in the workflow run but do not publish a required CI gate status.
2223

2324
The GitHub ruleset should require the `OpenShell / ...` statuses published by `Required CI Gates`, not the push-triggered workflow jobs directly.
2425

@@ -110,7 +111,7 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma
110111
| File | Role |
111112
|---|---|
112113
| `.github/workflows/branch-checks.yml` | Required non-E2E PR checks. Triggers on `push: pull-request/[0-9]+`. |
113-
| `.github/workflows/branch-e2e.yml` | Opt-in standard, GPU, and Kubernetes HA E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e`, `test:e2e-gpu`, or `test:e2e-kubernetes`. |
114+
| `.github/workflows/branch-e2e.yml` | Opt-in standard, GPU, Kubernetes HA, and Kubernetes credential-driver E2E. Triggers on `push: pull-request/[0-9]+` and runs jobs selected by `test:e2e`, `test:e2e-gpu`, or `test:e2e-kubernetes`. |
114115
| `.github/workflows/helm-lint.yml` | Helm chart validation. Triggers on `push: pull-request/[0-9]+` and skips lint jobs unless Helm inputs changed. |
115116
| `.github/actions/pr-gate/action.yml` | Composite action that resolves PR metadata and verifies the required label is set. |
116117
| `.github/actions/pr-merge-base/action.yml` | Composite action that resolves and fetches the merge-base commit for `pull-request/<N>` push workflows. |

Cargo.lock

Lines changed: 62 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ sha2 = "0.10"
8888
rand = "0.9"
8989
jsonwebtoken = "9"
9090
getrandom = "0.3"
91+
ring = "0.17"
9192
spiffe = { version = "0.15", default-features = false, features = ["workload-api-jwt", "tracing"] }
9293

9394
# Filesystem embedding

TESTING.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@ Suites:
150150
- Common suite (`--features e2e`) - driver-neutral CLI behavior, sandbox lifecycle, sync, port forwarding, policy, and provider tests.
151151
- Docker suite (`--features e2e-docker`) - common suite plus Docker-only coverage such as Dockerfile image builds, Docker preflight checks, and managed Docker gateway resume.
152152
- Docker GPU suite (`--features e2e-docker-gpu`) - Docker suite plus GPU sandbox smoke coverage.
153+
- Kubernetes credential-driver suite (`--features e2e-kubernetes-credential-drivers`) - targeted Kubernetes Secrets and OpenBao provider credential storage coverage.
153154

154155
GPU device-selection tests compare OpenShell sandboxes against a plain Docker or
155156
Podman container that requests `--device nvidia.com/gpu=all`. The probe image
@@ -173,6 +174,14 @@ Run the Podman-backed Rust CLI e2e suite:
173174
mise run e2e:podman
174175
```
175176

177+
Run the targeted Kubernetes credential-driver e2e suite. This deploys OpenBao
178+
into the test cluster and validates Kubernetes Secrets and OpenBao storage
179+
backends one at a time:
180+
181+
```shell
182+
mise run e2e:kubernetes:credential-drivers
183+
```
184+
176185
Run a single test directly with cargo:
177186

178187
```shell
@@ -203,3 +212,4 @@ The harness (`e2e/rust/src/harness/`) provides:
203212
| `OPENSHELL_GATEWAY` | Override active gateway name for E2E tests |
204213
| `OPENSHELL_GATEWAY_ENDPOINT` | Run E2E tests against an existing plaintext HTTP gateway endpoint |
205214
| `OPENSHELL_E2E_DRIVER` | Driver name exported by the e2e gateway wrapper (`docker`, `podman`, or `vm`) |
215+
| `OPENSHELL_E2E_CREDENTIAL_DRIVERS` | Enables the Kubernetes credential-driver fixture path in `e2e/with-kube-gateway.sh` |

architecture/gateway.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -159,9 +159,14 @@ default WAL journal mode), which mirror the same sensitive contents.
159159
Persisted state includes sandboxes, providers, provider credential refresh
160160
state, SSH sessions, policy revisions, settings, inference configuration, and
161161
deployment records. Provider refresh material is stored as a separate object
162-
scoped to the provider instance through `objects.scope`; the provider record
163-
keeps only the current injectable credential values and optional per-credential
164-
expiry timestamps.
162+
scoped to the provider instance through `objects.scope`. Provider records keep
163+
inline credential values only for legacy records created before credential
164+
driver storage. New provider writes keep driver-owned credential handles and
165+
optional per-credential expiry timestamps. When no external credential driver
166+
is configured, gateways use server-owned encrypted database credential storage
167+
for defense in depth. Multi-replica deployments can use that default with a
168+
shared database and shared key-encryption key, or opt into an external backend such as OpenBao
169+
or Kubernetes Secrets.
165170

166171
### Optimistic Concurrency (CAS)
167172

0 commit comments

Comments
 (0)