Skip to content

fix(hub): multi-node session fixes — OAuth state_mismatch + shared signing keys#146

Closed
scion-gteam[bot] wants to merge 2 commits into
mainfrom
postgres/delta-fixes
Closed

fix(hub): multi-node session fixes — OAuth state_mismatch + shared signing keys#146
scion-gteam[bot] wants to merge 2 commits into
mainfrom
postgres/delta-fixes

Conversation

@scion-gteam
Copy link
Copy Markdown

@scion-gteam scion-gteam Bot commented Jun 5, 2026

Summary

Fixes two multi-node session issues that cause login failures when the hub runs behind a load balancer with multiple replicas sharing a Postgres database.

Changes

1. OAuth state_mismatch fix (web.go)

Replaces gorilla/sessions FilesystemStore (per-replica local disk) with an encrypted CookieStore. The entire session now lives in the client's cookie — any replica sharing SESSION_SECRET can read it. This fixes OAuth state_mismatch errors when /auth/login and /auth/callback hit different replicas.

2. Shared JWT signing keys (web.go)

When SESSION_SECRET is set, JWT signing keys (agent + user) are derived deterministically from it rather than from per-host sha256(hostname). This ensures all replicas produce and verify identical tokens, fixing the session_expired redirect loop where a JWT signed by Hub A failed verification on Hub B.

Context

These issues were discovered during live multi-node testing at https://multi.demo.scion-ai.dev (two hubs behind GCLB, shared CloudSQL). The core Postgres/broker-dispatch/NFS work is already merged via PRs GoogleCloudPlatform#304, GoogleCloudPlatform#305, GoogleCloudPlatform#306.

Test Plan

  • go build ./...
  • Cross-replica session round-trip regression test ✅
  • Wrong-secret negative test ✅
  • Live validated on two VMs behind GCLB

Deploy Note

Existing scion_sess cookies become invalid on deploy (encoding changed) — users re-login once.

Scion added 2 commits June 5, 2026 16:58
OAuth login behind the load balancer intermittently failed with
state_mismatch: the CSRF state token (and the entire web session) was
stored in a gorilla FilesystemStore on the handling replica's local
disk, while the browser only carried a session-ID cookie. When the LB
routed /auth/login and /auth/callback to different replicas, the
callback replica had no matching session file -> empty state ->
state_mismatch. It only "worked" when both hops happened to hit the
same backend.

The same flaw affected the post-login session: sessionToBearerMiddleware
reads the Hub access/refresh JWTs from that disk-local store on every API
request, so sessions silently dropped whenever a follow-up request
landed on a different replica.

Replace the FilesystemStore with an encrypted, signed gorilla
CookieStore so the whole session lives in the client's cookie and any
replica sharing SESSION_SECRET can read it. Keys are derived
deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte
AES-256 encryption key, domain-separated). No DB, no migration; works
with N replicas.

The original switch to disk was motivated by a "JWT tokens exceed 4096
bytes" concern. Measured against the current compact HS256 tokens the
full session (identity + access + refresh) encodes to ~2.6 KB, well
under the browser's ~4 KB per-cookie cap, so the securecookie length
limit is left in force (oversize would now error+log, not silently drop).

Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica
round-trip regression test (cookie minted by replica A decodes on
replica B with the same secret; carries OAuth state + post-login tokens)
plus a negative test (a different secret cannot decode the cookie).
…ross-replica login loop

The cookie-store fix (0515e2a) made the web session replica-portable, but
the Hub JWT *inside* the cookie is still signed with a per-replica key:
ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and
hubID = sha256(hostname)[:12]. The integration env runs two replicas of one
logical hub behind a single LB, sharing one Postgres DB and one
SESSION_SECRET but with different hostnames -> different hubIDs -> different
HS256 signing keys.

So a user JWT minted on replica A failed signature verification on replica B
(go-jose: error in cryptographic primitive); refresh failed too (refresh
token signed with the same foreign key), so sessionToBearerMiddleware
declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1)
and returned session_expired. The cookie deletion turns it into a redirect
loop: dashboard flashes, then /login?error=session_expired.

Fix: extend the 0515e2a approach (replica-portable via the shared secret)
from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret;
when set, ensureSigningKey derives the agent and user signing keys
deterministically from it (domain-separated by key name) and bypasses
per-host secret-backend storage. cmd feeds the same --session-secret /
SESSION_SECRET value into both the web cookie store and the hub config via a
new resolveSessionSecret() helper. Empty secret keeps the existing per-hub
behavior (no regression for single-node/local dev).

Tests: cross-replica round trip (different hubID + same secret -> identical
keys, token minted on A validates on B; different secret cannot) plus
pre-configured-key precedence.

Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so
existing web/CLI tokens are invalidated once and users re-login.
@ptone ptone closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant