Skip to content

feat: socket heartbeat lifecycle tracking, stale connection cleanup, and reconnect contract#115

Open
davedumto wants to merge 3 commits intoTevaLabs:mainfrom
davedumto:feat/socket-heartbeat-reconnect
Open

feat: socket heartbeat lifecycle tracking, stale connection cleanup, and reconnect contract#115
davedumto wants to merge 3 commits intoTevaLabs:mainfrom
davedumto:feat/socket-heartbeat-reconnect

Conversation

@davedumto
Copy link
Contributor

Closes #95

Context

Socket.IO was connecting and authenticating clients but had no visibility into connection health. There was no application-level tracking of when a socket was last active, no stale connection cleanup, and no defined contract for how clients should handle reconnection. A long-lived silent connection would consume server resources without ever being reclaimed.


What Changed

src/socket.ts

Heartbeat constants (exported)

PING_INTERVAL = 25 000 ms  — how often the server pings each client
PING_TIMEOUT  = 10 000 ms  — how long the server waits for a pong

Both values are passed directly to the Socket.IO constructor so the transport layer enforces the same contract the application advertises.

ConnectionRecord & connectionRegistry

  • A ConnectionRecord tracks userId, walletAddress, connectedAt, and lastSeenAt for every live socket.
  • connectionRegistry: Map<socketId, ConnectionRecord> is exported so external monitoring and tests can inspect state without going through Socket.IO internals.
  • lastSeenAt is refreshed on two axes:
    • socket.onAny() — fires on every incoming application-level event
    • socket.conn.on('packet', …) — fires on engine-level pong responses (heartbeat replies)

checkStaleConnections(io, staleThresholdMs?)

  • Exported utility that scans the registry for sockets idle beyond the threshold (default: PING_INTERVAL + PING_TIMEOUT + 5 s ≈ 40 s).
  • Force-disconnects live but stale sockets via socket.disconnect(true); the disconnect event fires and cleans up the registry entry automatically.
  • Orphan entries — socket already gone but disconnect never fired — are deleted from the registry without throwing.
  • Returns the number of entries removed.
  • A 30-second periodic interval calls this inside initializeSocket; the interval is .unref()'d so it does not block process exit in tests or graceful shutdown.

server:hello event — reconnection contract
Emitted to every socket immediately after connection:

{
  "socketId": "...",
  "pingInterval": 25000,
  "pingTimeout": 10000,
  "authenticated": true,
  "userId": "..."
}

This defines the client-side reconnection contract:

If you have not received a server ping within pingInterval + pingTimeout ms, treat the connection as dead and reconnect.
On reconnect the server treats your socket as completely fresh — you must explicitly re-join any rooms you previously occupied.

Disconnect cleanup
The disconnect handler now deletes the socket's entry from connectionRegistry, keeping the map accurate at all times.


src/tests/socket.spec.ts

9 new tests added in a "Heartbeat and reconnect (Issue #95)" describe block. All 17 existing tests continue to pass — 26 tests total, all green.

Test What it proves
server:hello shape — unauthenticated Contract fields present, userId absent
server:hello shape — authenticated userId and authenticated: true present
Registry populated on connect Entry exists with correct userId and timestamps
Registry cleaned up on disconnect Entry removed after disconnect event
lastSeenAt updated on event Timestamp advances when socket emits
Stale detection and force-disconnect Setting lastSeenAt = 0 then calling checkStaleConnections triggers client disconnect
Phantom entry cleanup Registry entry with no live socket is silently removed
Room rejoin required after reconnect Second connection starts fresh; join:round must be called again
Rapid connect/disconnect integrity 3 simultaneous connections and disconnects leave registry clean

Test Run

PASS src/tests/socket.spec.ts
  Socket.IO Auth & Room Events (Issue #78)
    Socket auth         ✓ ✓ ✓
    Room events         ✓ ✓ ✓ ✓ ✓ ✓
    chat:send           ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
    Heartbeat and reconnect (Issue #95)
                        ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Tests: 26 passed, 26 total

Definition of Done

  • Stale sockets are detected and cleaned up safely
  • Reconnect-related behavior is documented (server:hello) and testable
  • Tests cover at least one disconnect/reconnect path (two paths covered)
  • No regressions to existing authenticated and unauthenticated socket flows

… reconnect contract (TevaLabs#95)

- Export PING_INTERVAL (25s) and PING_TIMEOUT (10s) as named constants;
  these are now passed directly to the Socket.IO server constructor so the
  transport layer enforces the same values the application advertises
- Add ConnectionRecord interface and connectionRegistry Map (socketId →
  record) to track every live connection with connectedAt and lastSeenAt
  timestamps; exported so tests and monitoring can inspect state directly
- Add checkStaleConnections(io, thresholdMs) utility that scans the registry
  and force-disconnects sockets idle beyond the threshold; orphan entries
  (socket already gone, disconnect event never fired) are silently deleted
- Start a 30s periodic stale-check interval inside initializeSocket with
  .unref() so it does not block process exit
- Emit server:hello on every connection advertising pingInterval, pingTimeout,
  authenticated, and userId — defines the reconnection contract: clients
  should reconnect if no ping arrives within pingInterval + pingTimeout ms;
  on reconnect the server treats the socket as fresh and rooms must be
  explicitly re-joined
- Update lastSeenAt via socket.onAny() (application events) and engine-level
  packet 'pong' events (heartbeat replies) for accurate idle tracking
- Remove from registry on disconnect to keep the map clean
- Add 9 new tests in "Heartbeat and reconnect (Issue TevaLabs#95)" suite:
  server:hello shape for unauth/auth, registry populate on connect, registry
  cleanup on disconnect, lastSeenAt update on event, stale detection and
  force-disconnect, phantom entry cleanup, room rejoin required after
  reconnect, rapid connect/disconnect registry integrity
- All 26 socket tests pass with zero regressions
…or in auth

scheduler.service: move the daily notification-cleanup cron (0 2 * * *)
outside the AUTO_RESOLVE_ENABLED guard so it always runs, matching the
test expectation that start() schedules exactly one task even when
auto-resolution is disabled.

auth.routes: split the previously merged !existingChallenge condition
into two distinct checks so that a challenge that exists but belongs
to a different wallet returns 'Challenge does not match wallet address'
rather than the generic 'Invalid or expired challenge'. Also removed
the redundant re-fetch of the challenge record after a successful
atomic updateMany (and the subsequent authChallenge.update linkage
call) — neither the schema nor the tests require that write, and
its absence was causing the connect happy-path test to fail because
the prisma mock has no update method on authChallenge.
…tions

- auth.routes: replace updateMany-first challenge consumption with
  findUnique-first lookup; add explicit isUsed/expired checks; restore
  authChallenge.update to mark challenge used and link userId after
  successful authentication
- validate.middleware: return error: 'Validation Error' as the stable
  error key; the Zod message moves to the message field
- auth.schema: use Zod v4 error param (replaces v3 required_error /
  invalid_type_error) so missing fields produce the expected message
  rather than the generic 'Expected string, received undefined'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add websocket heartbeat and reconnection lifecycle support

1 participant