fix(hub): multi-node session fixes — OAuth state_mismatch + shared signing keys by scion-gteam[bot] · Pull Request #146 · ptone/scion

scion-gteam · 2026-06-05T23:53:22Z

Summary

Fixes two multi-node session issues that cause login failures when the hub runs behind a load balancer with multiple replicas sharing a Postgres database.

Changes

1. OAuth state_mismatch fix (`web.go`)

Replaces gorilla/sessions FilesystemStore (per-replica local disk) with an encrypted CookieStore. The entire session now lives in the client's cookie — any replica sharing SESSION_SECRET can read it. This fixes OAuth state_mismatch errors when /auth/login and /auth/callback hit different replicas.

2. Shared JWT signing keys (`web.go`)

When SESSION_SECRET is set, JWT signing keys (agent + user) are derived deterministically from it rather than from per-host sha256(hostname). This ensures all replicas produce and verify identical tokens, fixing the session_expired redirect loop where a JWT signed by Hub A failed verification on Hub B.

Context

These issues were discovered during live multi-node testing at https://multi.demo.scion-ai.dev (two hubs behind GCLB, shared CloudSQL). The core Postgres/broker-dispatch/NFS work is already merged via PRs GoogleCloudPlatform#304, GoogleCloudPlatform#305, GoogleCloudPlatform#306.

Test Plan

go build ./... ✅
Cross-replica session round-trip regression test ✅
Wrong-secret negative test ✅
Live validated on two VMs behind GCLB

Deploy Note

Existing scion_sess cookies become invalid on deploy (encoding changed) — users re-login once.

OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie).

…ross-replica login loop The cookie-store fix (0515e2a) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login.

Scion added 2 commits June 5, 2026 16:58

ptone closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hub): multi-node session fixes — OAuth state_mismatch + shared signing keys#146

fix(hub): multi-node session fixes — OAuth state_mismatch + shared signing keys#146
scion-gteam[bot] wants to merge 2 commits into
mainfrom
postgres/delta-fixes

scion-gteam Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scion-gteam Bot commented Jun 5, 2026

Summary

Changes

1. OAuth state_mismatch fix (web.go)

2. Shared JWT signing keys (web.go)

Context

Test Plan

Deploy Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. OAuth state_mismatch fix (`web.go`)

2. Shared JWT signing keys (`web.go`)