Skip to content

fix: prevent stale disconnect from marking reconnected broker offline#303

Merged
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/dev-issue-131
Jun 5, 2026
Merged

fix: prevent stale disconnect from marking reconnected broker offline#303
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/dev-issue-131

Conversation

@ptone
Copy link
Copy Markdown
Member

@ptone ptone commented Jun 4, 2026

Summary

  • Root cause: When a broker's control channel disconnects and reconnects rapidly (within the same second), the old connection's deferred removeConnection goroutine races with the new connection. It removes the new connection from the map and fires onDisconnect, leaving onlineProviders=0 despite a live WebSocket session.
  • Fix: removeConnection now takes a *BrokerConnection pointer and only removes the map entry / fires the disconnect callback when the passed connection is still the active one for that brokerID. Superseded connections are silently skipped.
  • Adds two new tests reproducing the exact race scenario described in the issue.

Fixes issue 131

Test plan

  • TestControlChannelManager_ReconnectDoesNotTriggerDisconnect — verifies a stale removeConnection from the old connection does NOT fire onDisconnect or remove the new connection
  • TestControlChannelManager_StaleRemoveAfterReconnect_ThenRealDisconnect — verifies the full lifecycle: stale remove is skipped, then a real disconnect of the current connection fires the callback correctly
  • Existing TestControlChannelManager_OnDisconnectCallback and _NilSafe updated and passing
  • Full pkg/hub/... test suite passes

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition where a rapid disconnect and reconnect of a broker's control channel could leave the broker permanently offline. The fix updates removeConnection to accept and verify the specific *BrokerConnection pointer, ensuring that stale deferred disconnects do not remove newly established connections. Corresponding unit tests have been added to verify this behavior. There are no review comments, and I have no additional feedback to provide.

ptone added 2 commits June 5, 2026 20:48
…ce (#131)

The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection
and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects
rapidly, the stale disconnect's offline stamp can clobber the new connection's
online status because UpdateRuntimeBrokerHeartbeat has no session guard — it
unconditionally overwrites status. Provider statuses are also clobbered and never
restored by heartbeats, leaving the broker permanently invisible until hub restart.

Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps
status=offline in a single CAS write. If a concurrent reconnect has already
claimed the broker with a new session, the compare fails and the callback is
a no-op. Also add a re-check guard before updating provider statuses.
@scion-gteam scion-gteam Bot force-pushed the scion/dev-issue-131 branch from 9a7a7c1 to 1f903ff Compare June 5, 2026 20:52
@ptone ptone merged commit 3aab748 into GoogleCloudPlatform:main Jun 5, 2026
5 checks passed
@scion-gteam scion-gteam Bot deleted the scion/dev-issue-131 branch June 5, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant