Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .design/project-log/2026-06-05-broker-disconnect-race-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Project Log: Broker Disconnect Reconnect Race Fix (Issue #131)

**Date:** 2026-06-05
**Task:** Unify broker disconnect race fix from two branches into PR #303

## Problem

When a broker disconnects and reconnects rapidly, the stale disconnect callback's
offline stamp can clobber the new connection's online status. The root cause is a
TOCTOU race: `ReleaseRuntimeBrokerConnection` and `UpdateRuntimeBrokerHeartbeat`
were separate calls — the heartbeat update has no session guard and unconditionally
overwrites status. Provider statuses are also clobbered and never restored by
heartbeats, leaving the broker permanently invisible until hub restart.

## Solution

Added `ReleaseAndMarkBrokerOffline` to the store interface — a single CAS write
that atomically clears affinity AND stamps status=offline, only if the session
still matches. If a concurrent reconnect has already claimed the broker with a
new session, the compare fails and the callback is a no-op.

Also added a re-check guard in `server.go` before updating provider statuses:
after the atomic release, re-read the broker to confirm no concurrent
`markBrokerOnline` has re-claimed it before stamping providers offline.

## Branch Unification

Two branches addressed this issue:
- `scion/dev-issue-131` (PR #303): had only a docs/project-log commit, no code fix
- `origin/fix/session-guarded-broker-disconnect` (fork PR #144): had the complete
code fix with tests

The fork branch's fix was the more complete solution. Rebased PR #303 onto
upstream main and cherry-picked the fork's fix commit to produce a single
unified branch.

## Files Changed

- `pkg/store/store.go` — added `ReleaseAndMarkBrokerOffline` to `RuntimeBrokerStore` interface
- `pkg/store/entadapter/project_store.go` — implemented `ReleaseAndMarkBrokerOffline` with CAS retry loop
- `pkg/hub/server.go` — rewired `SetOnDisconnect` callback to use the atomic method + provider re-check guard
- `pkg/store/entadapter/broker_affinity_test.go` — 4 new tests covering the atomic method

## Verification

- All 10 broker affinity tests pass (4 new + 6 existing)
- Hub package compiles cleanly
- Pre-existing test failures in `pkg/hub` (unrelated to this change) confirmed on upstream main
338 changes: 338 additions & 0 deletions .design/resource-clone-delete.md

Large diffs are not rendered by default.

22 changes: 18 additions & 4 deletions cmd/server_foreground.go
Original file line number Diff line number Diff line change
Expand Up @@ -859,6 +859,17 @@ func parseAdminEmails(cfg *config.GlobalConfig) []string {
return adminEmailList
}

// resolveSessionSecret resolves the deployment-wide session secret from the
// --session-secret flag, falling back to the SCION_SERVER_SESSION_SECRET env
// var. The same value backs both the web session cookie store and the hub JWT
// signing keys so that all replicas behind the load balancer agree.
func resolveSessionSecret() string {
if webSessionSecret != "" {
return webSessionSecret
}
return os.Getenv("SCION_SERVER_SESSION_SECRET")
}

// initHubServer creates and configures the Hub server.
func initHubServer(ctx context.Context, cfg *config.GlobalConfig, s store.Store, hubEndpoint, devAuthToken string, adminEmailList []string, adminMode bool, maintenanceMessage string, requestLogger, messageLogger *slog.Logger, globalDir string, pluginMgr *scionplugin.Manager, secretBackend secret.SecretBackend) (*hub.Server, error) {
hubCfg := hub.ServerConfig{
Expand Down Expand Up @@ -929,6 +940,12 @@ func initHubServer(ctx context.Context, cfg *config.GlobalConfig, s store.Store,
MaintenanceConfig: resolveMaintenanceConfig(cfg),
SecretBackend: secretBackend,
GCPProjectID: cfg.Hub.GCPProjectID,
// Derive the agent/user JWT signing keys from the same shared session
// secret the web cookie store uses, so every replica behind the load
// balancer agrees on the signing key regardless of its host-derived
// HubID. Without this, a JWT minted by one replica fails validation on
// another (cross-replica "session_expired" login loop).
SharedSigningSecret: resolveSessionSecret(),
}

hubSrv, err := hub.New(hubCfg, s)
Expand Down Expand Up @@ -1123,10 +1140,7 @@ func initWebServer(ctx context.Context, cfg *config.GlobalConfig, hubSrv *hub.Se
}

// Allow env var overrides for session/OAuth config
sessionSecret := webSessionSecret
if sessionSecret == "" {
sessionSecret = os.Getenv("SCION_SERVER_SESSION_SECRET")
}
sessionSecret := resolveSessionSecret()
baseURL := webBaseURL
if baseURL == "" {
baseURL = os.Getenv("SCION_SERVER_BASE_URL")
Expand Down
13 changes: 0 additions & 13 deletions pkg/ent/harnessconfig.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 0 additions & 10 deletions pkg/ent/harnessconfig/harnessconfig.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 0 additions & 15 deletions pkg/ent/harnessconfig/where.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

65 changes: 0 additions & 65 deletions pkg/ent/harnessconfig_create.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 0 additions & 34 deletions pkg/ent/harnessconfig_update.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading