Skip to content

fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect#1367

Closed
nesquena-hermes wants to merge 1 commit into
masterfrom
fix/clarify-sse-stale-detector
Closed

fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect#1367
nesquena-hermes wants to merge 1 commit into
masterfrom
fix/clarify-sse-stale-detector

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Follow-up to v0.50.249 / PR #1365 — clarify SSE stale-detector

Polish PR addressing Opus SHOULD-FIX #2 from the v0.50.249 pre-release review. Originally reset out of #1365 because @nesquena flagged it as out-of-scope; bringing it back per follow-up guidance that correctness-improving changes should ship even when out of scope.

Problem

The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the "no event in 60s" detector its comment claimed:

// Health timer: if no event received in 60s, reconnect.
_clarifyHealthTimer = setInterval(function() {
  if (_clarifyEventSource) {
    try { _clarifyEventSource.close(); } catch(_){}
    _clarifyEventSource = null;
  }
  clearInterval(_clarifyHealthTimer); _clarifyHealthTimer = null;
  startClarifyPolling(sid);
}, 60000);

Every 60s, regardless of whether events arrived: close, clear, restart. Effects on healthy connections:

  • One TCP/SSE setup+teardown per minute per active session
  • Each reconnect = clarify._lock round-trip (subscribe + unsubscribe) + a fresh initial snapshot push from the server
  • No correctness loss (recoverable via the next initial event) — just waste, plus an inaccurate comment

Fix

Track lastEventAt on initial + clarify event arrivals; only reconnect when the gap exceeds 60s. The server sends keepalive comments every 30s, but EventSource silently consumes those (browser parser eats : comment lines). So lastEventAt only bumps on real application events.

Under healthy conditions:

  • Server sends _clarify_sse_notify events on every submit_pending / resolve_clarify (head/empty-state)
  • Plus 30s keepalives (consumed silently by EventSource)
  • lastEventAt updates on every real event → gap stays under 60s → timer never fires

When the server is genuinely silent for 60s (e.g. silently broken mid-stream, proxy ate the connection), reconnect kicks in and rebuilds via the initial snapshot.

let _lastClarifyEventAt = Date.now();
const _markClarifyEvent = () => { _lastClarifyEventAt = Date.now(); };
_clarifyEventSource.addEventListener('initial', _markClarifyEvent);
_clarifyEventSource.addEventListener('clarify', _markClarifyEvent);
_clarifyHealthTimer = setInterval(function() {
  if (Date.now() - _lastClarifyEventAt < 60000) return;
  // … existing reconnect logic …
}, 60000);

Tests

  • Full pytest suite: 3447 passed, 0 failed
  • No new tests for this change since the timer behavior is hard to assert in static analysis. The behavioral pattern is the same as the working approval SSE handler's pattern.

Risk

Low. If the new lastEventAt gate has a logic bug, worst case is the old behavior returns (force-reconnect every 60s). The fallback path (onerror → 3s polling) is unchanged.

Compat notes

  • Frontend-only change (static/messages.js)
  • No backend / API / config changes
  • No new dependencies

…al reconnect

Follow-up to v0.50.249 / PR #1365. The clarify SSE health timer at
static/messages.js:1715 was an unconditional 60s force-reconnect:
every minute it would close the EventSource, clear itself, and call
startClarifyPolling(sid) again — even on a perfectly healthy
connection. Comment claimed it was a 'no event in 60s' detector but
the implementation didn't track event timestamps.

Effects of the bug:
- One TCP/SSE setup+teardown per minute per active session
- Each reconnect causes a clarify._lock round-trip (subscribe +
  unsubscribe) and a fresh 'initial' snapshot push from the server
- No correctness loss (recoverable via the next initial event), just
  waste

Fix: track lastEventAt on initial+clarify event arrivals; only
reconnect when the gap exceeds 60s. The server sends keepalive
comments every 30s but EventSource silently consumes those — we can
only bump lastEventAt on real application events. Under healthy
conditions (real events on submit/resolve plus the periodic
keepalives) the timer never fires.

Caught by Opus pre-release review of v0.50.249 as SHOULD-FIX #2.
Originally reset out of v0.50.249 because the reviewer flagged it as
out-of-scope; now reapplied per the user's guidance that
correctness-improving changes should ship even when out of scope.
Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (clean approve, behavioural harness confirms timer logic)

Single-commit follow-up to v0.50.249 / PR #1365. Replaces the unconditional 60s force-reconnect on the clarify SSE connection with a stale-detector that only reconnects when no real event has arrived in 60s. Frontend-only — static/messages.js, +15/-1.

What this ships

The clarify SSE health timer originally shipped in PR #1355 (part of v0.50.249 / merged via #1365 commit d72399a) at static/messages.js:1726-1734 was an unconditional 60s reconnect:

// Pre-fix (from PR #1355):
_clarifyHealthTimer = setInterval(function() {
  if (_clarifyEventSource) {
    try { _clarifyEventSource.close(); } catch(_){}
    _clarifyEventSource = null;
  }
  clearInterval(_clarifyHealthTimer); _clarifyHealthTimer = null;
  startClarifyPolling(sid);
}, 60000);

The comment claimed "if no event received in 60s, reconnect" but the code didn't track event timestamps — every 60s, regardless of whether events arrived, it would close + clear + restart.

The fix at static/messages.js:1722-1727 introduces _lastClarifyEventAt bumped on every 'initial' and 'clarify' event, with the timer body early-returning when the gap is < 60s:

let _lastClarifyEventAt = Date.now();
const _markClarifyEvent = () => { _lastClarifyEventAt = Date.now(); };
_clarifyEventSource.addEventListener('initial', _markClarifyEvent);
_clarifyEventSource.addEventListener('clarify', _markClarifyEvent);
_clarifyHealthTimer = setInterval(function() {
  if (Date.now() - _lastClarifyEventAt < 60000) return;
  /* … existing reconnect logic … */
}, 60000);

Traced against upstream hermes-agent

Frontend-only change. No upstream interaction.

End-to-end trace

  1. Connection open: startClarifyPolling(sid) at static/messages.js:1679 creates a new EventSource('/api/clarify/stream?…') and registers four listeners: existing 'initial' and 'clarify' handlers (lines 1691, 1699), plus the two new _markClarifyEvent handlers (lines 1724, 1725).
  2. Server pushes an 'initial' event immediately (post-fix #1365 deadlock fix at routes.py:2906-2924), then 'clarify' events on every submit_pending/resolve_clarify. The keepalive : keepalive\n\n comment lines arrive every 30s but are silently consumed by EventSource per the HTML Living Standard — comment lines don't fire message or named events.
  3. Health timer fires every 60s. With at least one real event in the last 60s, the timer early-returns at line 1727. With no events, it closes the EventSource at 1728-1731, clears itself at 1732, and recursively calls startClarifyPolling(sid) at 1733 — which calls stopClarifyPolling() first (already null after the explicit clear) then opens a fresh connection.
  4. Connection close paths (turn end, session change, error): stopClarifyPolling() at lines 943, 1021, 1060, 1101, etc. clears _clarifyEventSource, _clarifyFallbackTimer, and _clarifyHealthTimer. Old listeners on the closed EventSource are dead → garbage-collectable along with the captured _lastClarifyEventAt closure.
  5. onerror path at line 1707: calls stopClarifyPolling() then _startClarifyFallbackPoll(sid). Health timer can never run after this since it was cleared.

Race / lock analysis

  • Timer self-clear vs reconnect: at line 1732 the body calls clearInterval(_clarifyHealthTimer) (still pointing at the OLD interval) → sets _clarifyHealthTimer = null → calls startClarifyPolling(sid) which sets _clarifyHealthTimer to a NEW interval. The OLD interval's clearInterval already fired before reassignment, so the OLD one won't fire again. ✅
  • Closure scope: _lastClarifyEventAt is let-scoped to each call of startClarifyPolling. Each connection gets its own variable; old closures captured by old listeners are independent. No cross-contamination. ✅
  • Listener registration order: the two new _markClarifyEvent listeners are registered AFTER the existing data-handler listeners. Browsers fire all listeners on each event, so the data handlers run first, then _markClarifyEvent. Order doesn't matter for correctness — _markClarifyEvent only mutates _lastClarifyEventAt. Even if the data handler throws inside its try { } catch(e) {}, _markClarifyEvent still bumps the timestamp because it's a separate listener. ✅
  • No new shared state across threads — single-threaded JS event loop.

Behavioural harness — timer logic at simulated time

Extracted the gate logic into a Node simulation:

Test 1: events every 5s (healthy connection)
  reconnects: 0 (expected 0)               ✅

Test 2: silent server (no events for 600s)
  reconnects: 10 (expected ~10)            ✅
  log: reconnect at t=60000, 120000, …, 600000

Test 3: bursty events every 30s
  reconnects: 0 (expected 0)               ✅

Pre-fix behavior (no lastEventAt gate) would have produced 10 reconnects in all three scenarios — confirming the strict improvement.

Other audit — confirmed correct

  • No new endpoints / config / env vars. Frontend-only.
  • No XSS surface_markClarifyEvent reads no event data; just bumps an integer timestamp.
  • No memory leak — old EventSource closed → its listeners (and their closure-captured _lastClarifyEventAt) become unreachable.
  • No regression in onerror path — the timer never fires after stopClarifyPolling(). Fallback poll still runs at 3s.
  • Compat with v0.50.249 deadlock fix — built on master d72399a which contains my PR #1365 deadlock fix 604b44a. The handler at api/routes.py:2906-2924 inlines _gateway_queues reads under _clarify_lock instead of calling the recursive get_clarify_pending. Verified tests/test_pr1355_sse_handler_no_deadlock.py is present and passing.

Edge-case matrix

Scenario Pre-fix behavior Post-fix behavior
Healthy connection, events every 5s Reconnects every 60s anyway Never reconnects ✅
Bursty events every 30s Reconnects every 60s Never reconnects ✅
Silent server, no events at all Reconnects every 60s Reconnects every 60s ✅ (correct — assumes broken)
Idle long-running turn (no clarify needed) Reconnects every 60s Reconnects every 60s (rare, bounded by turn lifetime) ⚠️ minor
EventSource fires onerror Timer cleared via stopClarifyPolling, fallback poll starts Same ✅
Page hidden / tab backgrounded Browser may throttle setInterval (>1s); EventSource may suspend Behavior unchanged ✅
JSON.parse(ev.data) throws on malformed event Data handler catches, but timer still resets (separate listener) Same ✅
submit_pending followed by resolve_clarify (full lifecycle) Two events fire → lastEventAt bumped twice Timer never fires for that connection ✅

Tests

  • Full pytest suite: 3395 passed, 54 skipped, 3 xpassed, 0 failed in 16.39s on 1a5fe35. (Includes the 3 deadlock regression tests from PR #1365's fix.)
  • No new pytest tests — JS timer behavior is hard to assert in static-source analysis. The behavioral harness above confirms the gate logic.
  • CI: 3.11 ✅ pass, 3.12 ✅ pass, 3.13 was pending at fetch time but the pattern matches prior PRs that turned green within seconds.

Minor observations (non-blocking)

  • Idle but healthy connection still reconnects every 60s: when a turn is long-running and no clarify is needed, the SSE produces only the 'initial' event (at lastEventAt = T0). At T0+60s the timer fires and reconnects. Strictly improved over the pre-fix behavior (which always reconnected) but not perfect for idle dormancy. Could be addressed by checking _clarifyEventSource.readyState === EventSource.OPEN and using a longer threshold (e.g. 5 min). Out of scope for this fix.
  • No JS unit test for timer logic. A jsdom + fake timers harness could lock this in (vi.useFakeTimers() style). Worth adding if the codebase grows a JS test infrastructure; standalone Node harness above is sufficient validation for review purposes.
  • Regex on _clarifyHealthTimer in tests/test_clarify_sse.py:test_frontend_has_health_timer still passes — the marker name is preserved.
  • Comment block (lines 1712-1721) is well-explained and references "Opus pre-release review of v0.50.249" — clear traceability to why the change was made. Future maintainers will understand the intent.

Recommendation

Approved. Strict improvement over the pre-fix unconditional reconnect: healthy connections no longer churn TCP/SSE setup once per minute. Logic verified via behavioural harness against three traffic patterns (healthy / silent / bursty). No race or memory-leak concerns with the closure-scoped _lastClarifyEventAt and the recursive setInterval → clearInterval → startClarifyPolling → setInterval pattern. Built on a master that already contains the v0.50.249 deadlock fix, so this PR is independently shippable.

Parked at approval — ready for the release agent's merge/tag pipeline.

@nesquena-hermes nesquena-hermes mentioned this pull request Apr 30, 2026
6 tasks
@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Shipped in v0.50.250 via batch release PR #1368 (merge ffd11037).

Live at https://github.com/nesquena/hermes-webui/releases/tag/v0.50.250.

The clarify SSE health timer is now an actual stale-detector. Tracks lastEventAt on initial/clarify event arrivals; only reconnects when the gap exceeds 60s. Under steady clarify traffic the timer never reconnects; on idle sessions it still reconnects every ~60-120s (residual idle reconnect tracked as a follow-up — the original PR description's "never fires under healthy conditions" was technically wrong, fixed in CHANGELOG at 0735042 per Opus NIT-2).

This was the SHOULD-FIX #2 from the v0.50.249 pre-release Opus review, originally pulled out of #1365 as out-of-scope, brought back per the rule that small correctness-improving fixes ship even when flagged out-of-scope (mistake-log entry on the v0.50.249 reset incident at ~/WebUI/docs/mistake-log.md).

JKJameson pushed a commit to JKJameson/hermes-webui that referenced this pull request Apr 30, 2026
…al reconnect (nesquena#1367)

Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX nesquena#2.
Originally reset out of nesquena#1365 because the reviewer flagged it as
out-of-scope; brought back per follow-up guidance that
correctness-improving changes should ship even when out of scope.

The clarify SSE health timer at static/messages.js:1715 was an
unconditional 60s force-reconnect, not the 'no event in 60s' detector
its comment claimed. Now actually a stale-detector that tracks
lastEventAt on initial+clarify event arrivals; only reconnects when
the gap exceeds 60s. Under healthy conditions the timer never fires.

Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>
JKJameson pushed a commit to JKJameson/hermes-webui that referenced this pull request Apr 30, 2026
Bundles 2 PRs:
- nesquena#1366 fix: guard finalizeThinkingCard with session ID check (with pre-release fix)
- nesquena#1367 fix(clarify-sse): stale-detector health timer (Opus SHOULD-FIX from v0.50.249)

Pre-release fix on nesquena#1366: the contributor's guard depends on
liveAssistantTurn.dataset.sessionId, but no code in the repo sets
that attribute. Without the fix, the guard would always early-return
(undefined !== sid is always true), breaking the streaming UI
completely — every assistant turn's thinking card would stay open
forever. Added per-site stamps at all 3 places that create
liveAssistantTurn in static/ui.js, plus a regression test that fails
any future creation site that forgets the stamp.
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
…al reconnect (nesquena#1367)

Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX #2.
Originally reset out of nesquena#1365 because the reviewer flagged it as
out-of-scope; brought back per follow-up guidance that
correctness-improving changes should ship even when out of scope.

The clarify SSE health timer at static/messages.js:1715 was an
unconditional 60s force-reconnect, not the 'no event in 60s' detector
its comment claimed. Now actually a stale-detector that tracks
lastEventAt on initial+clarify event arrivals; only reconnects when
the gap exceeds 60s. Under healthy conditions the timer never fires.

Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>
GeoffBao pushed a commit to GeoffBao/hermes-webui that referenced this pull request May 1, 2026
Bundles 2 PRs:
- nesquena#1366 fix: guard finalizeThinkingCard with session ID check (with pre-release fix)
- nesquena#1367 fix(clarify-sse): stale-detector health timer (Opus SHOULD-FIX from v0.50.249)

Pre-release fix on nesquena#1366: the contributor's guard depends on
liveAssistantTurn.dataset.sessionId, but no code in the repo sets
that attribute. Without the fix, the guard would always early-return
(undefined !== sid is always true), breaking the streaming UI
completely — every assistant turn's thinking card would stay open
forever. Added per-site stamps at all 3 places that create
liveAssistantTurn in static/ui.js, plus a regression test that fails
any future creation site that forgets the stamp.
@nesquena-hermes nesquena-hermes deleted the fix/clarify-sse-stale-detector branch May 11, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants