fix: multi-node session fixes + Cloud Run deployment#142
Closed
scion-gteam[bot] wants to merge 6 commits into
Closed
fix: multi-node session fixes + Cloud Run deployment#142scion-gteam[bot] wants to merge 6 commits into
scion-gteam[bot] wants to merge 6 commits into
Conversation
added 6 commits
June 5, 2026 16:34
OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie).
…ross-replica login loop The cookie-store fix (0515e2a) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login.
…broker Adds scripts/cloudrun/ with Dockerfile, deploy script, hub settings template, and README for deploying the Scion hub as a Cloud Run service (min=max=1) with a co-located broker targeting scion-demo-cluster.
…d Run deploy fixes - Add /health as an alias for /healthz in web.go, auth.go and isPublicRoute() (Cloud Run's Google Frontend intercepts /healthz and returns 404 before the container sees it; /health is not intercepted) - hubclient: fall back to /health when /healthz returns 404 for Cloud Run compat - Dockerfile: use entrypoint.sh wrapper; fix /run/secrets dir permissions - entrypoint.sh: copy secret-mounted settings.yaml via cat (symlink-safe) before starting the hub; use --enable-runtime-broker + --dev-auth flags - deploy.sh: mount settings secret at /run/secrets/settings.yaml - hub-settings-template.yaml: add active_profile: default
…un auth When running on GCP (GKE, Cloud Run, GCE), automatically fetches an OIDC identity token from the metadata server and adds it as Authorization: Bearer on all hub requests. Cloud Run's --no-allow-unauthenticated requires this as the outer auth layer; the inner hub auth continues to use the existing X-Scion-Agent-Token custom header (no conflict). - oidc.go: oidcTokenSource (cached, 5-min refresh margin), oidcTransport (http.RoundTripper wrapper), maybeConfigureOIDC() called from both NewClient() and NewClientWithConfig() - Audience defaults to hub URL, overridable via SCION_HUB_OIDC_AUDIENCE - isOnGCPFunc injectable for tests; graceful degradation if metadata unavailable - 10 new tests in oidc_test.go; all 40 pkg tests pass
Required for the broker to know where to pull agent container images on GKE. Found during E2E test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Delta fixes on top of PRs GoogleCloudPlatform#304 (Postgres), GoogleCloudPlatform#305 (broker dispatch), GoogleCloudPlatform#306 (NFS) which are already merged.
Changes
Multi-Node Session Fixes
SESSION_SECRETso all replicas produce/verify identical tokens. Fixes the login loop (session_expired) when requests alternate between replicas behind a load balancer.Cloud Run Deployment
scripts/cloudrun/*)/healthalias for Cloud Run health checkspkg/sciontool/hub/oidc.go)image_registryin hub settings templateWhat's NOT in this PR (already merged)
Test Plan
go build ./...PASSgo test ./pkg/store/...PASS