Skip to content

Bug: Runtime broker control channel disconnects after ~4h, does not recover without hub restart #131

@scion-gteam

Description

@scion-gteam

Summary

The runtime broker's control channel repeatedly disconnects after the hub has been running for approximately 4 hours, causing no_runtime_broker errors on all agent start attempts. The broker does not automatically re-establish the connection — a full hub restart is required to recover.

Environment

  • Instance: scion-next (https://next.demo.scion-ai.dev)
  • Hub binary: upstream main at b2eaa59 (cross-compiled, deployed 2026-06-03)
  • Runtime broker: isHubManaged=false, Docker runtime, listening on localhost:9800

Expected behavior

The runtime broker control channel should either stay connected indefinitely or automatically reconnect when disconnected, without requiring a hub restart.

Actual behavior

After ~4h15m of uptime the control channel disconnects. The hub marks the broker offline (onlineProviders=0). Reconnect attempts fail. All new agent start requests return:

no_runtime_broker: Default runtime broker is unavailable and no alternatives found (status: 422)

Log sequence (from scion-next journal)

  • 14:28 — hub started (PID 2606172), runtime broker registered at localhost:9800
  • 18:40–18:43 — broker healthy, onlineProviders=1, agents launching successfully
  • 18:43:34 — "Broker control channel disconnected" / "Broker disconnected, marking offline"
  • 18:46:33 — brief reconnect then disconnected again
  • 18:51:22 — final disconnect, no recovery
  • 20:16 — manual hub restart → broker re-registered → fixed

Key log messages:

INFO "Broker control channel disconnected" brokerID=a6487539-...
INFO "Broker disconnected, marking offline" brokerID=a6487539-...

Timing note

The first disconnect occurred at exactly 4h15m after hub startup — may indicate a connection TTL, OAuth token expiry, or keep-alive timeout. The pattern of 3 disconnects with brief recoveries before a permanent offline state suggests a retry backoff that eventually gives up.

Workaround

Restart scion-hub. The runtime broker re-registers on startup and resumes normally.

Additional context

  • Not observed on the previous binary (branch grove-rename2). May be a regression in how the runtime broker control channel lifecycle is managed in recent upstream/main code.
  • The scion-sagan instance (which uses isHubManaged=true) did not exhibit this pattern.
  • Likely related to: the 422 no_runtime_broker errors currently blocking new agent creation on the scion-gteam instance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions