fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect#1367
fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect#1367nesquena-hermes wants to merge 1 commit into
Conversation
…al reconnect Follow-up to v0.50.249 / PR #1365. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect: every minute it would close the EventSource, clear itself, and call startClarifyPolling(sid) again — even on a perfectly healthy connection. Comment claimed it was a 'no event in 60s' detector but the implementation didn't track event timestamps. Effects of the bug: - One TCP/SSE setup+teardown per minute per active session - Each reconnect causes a clarify._lock round-trip (subscribe + unsubscribe) and a fresh 'initial' snapshot push from the server - No correctness loss (recoverable via the next initial event), just waste Fix: track lastEventAt on initial+clarify event arrivals; only reconnect when the gap exceeds 60s. The server sends keepalive comments every 30s but EventSource silently consumes those — we can only bump lastEventAt on real application events. Under healthy conditions (real events on submit/resolve plus the periodic keepalives) the timer never fires. Caught by Opus pre-release review of v0.50.249 as SHOULD-FIX #2. Originally reset out of v0.50.249 because the reviewer flagged it as out-of-scope; now reapplied per the user's guidance that correctness-improving changes should ship even when out of scope.
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ✅ (clean approve, behavioural harness confirms timer logic)
Single-commit follow-up to v0.50.249 / PR #1365. Replaces the unconditional 60s force-reconnect on the clarify SSE connection with a stale-detector that only reconnects when no real event has arrived in 60s. Frontend-only — static/messages.js, +15/-1.
What this ships
The clarify SSE health timer originally shipped in PR #1355 (part of v0.50.249 / merged via #1365 commit d72399a) at static/messages.js:1726-1734 was an unconditional 60s reconnect:
// Pre-fix (from PR #1355):
_clarifyHealthTimer = setInterval(function() {
if (_clarifyEventSource) {
try { _clarifyEventSource.close(); } catch(_){}
_clarifyEventSource = null;
}
clearInterval(_clarifyHealthTimer); _clarifyHealthTimer = null;
startClarifyPolling(sid);
}, 60000);The comment claimed "if no event received in 60s, reconnect" but the code didn't track event timestamps — every 60s, regardless of whether events arrived, it would close + clear + restart.
The fix at static/messages.js:1722-1727 introduces _lastClarifyEventAt bumped on every 'initial' and 'clarify' event, with the timer body early-returning when the gap is < 60s:
let _lastClarifyEventAt = Date.now();
const _markClarifyEvent = () => { _lastClarifyEventAt = Date.now(); };
_clarifyEventSource.addEventListener('initial', _markClarifyEvent);
_clarifyEventSource.addEventListener('clarify', _markClarifyEvent);
_clarifyHealthTimer = setInterval(function() {
if (Date.now() - _lastClarifyEventAt < 60000) return;
/* … existing reconnect logic … */
}, 60000);Traced against upstream hermes-agent
Frontend-only change. No upstream interaction.
End-to-end trace
- Connection open:
startClarifyPolling(sid)at static/messages.js:1679 creates a newEventSource('/api/clarify/stream?…')and registers four listeners: existing'initial'and'clarify'handlers (lines 1691, 1699), plus the two new_markClarifyEventhandlers (lines 1724, 1725). - Server pushes an
'initial'event immediately (post-fix #1365 deadlock fix at routes.py:2906-2924), then'clarify'events on everysubmit_pending/resolve_clarify. The keepalive: keepalive\n\ncomment lines arrive every 30s but are silently consumed by EventSource per the HTML Living Standard — comment lines don't firemessageor named events. - Health timer fires every 60s. With at least one real event in the last 60s, the timer early-returns at line 1727. With no events, it closes the EventSource at 1728-1731, clears itself at 1732, and recursively calls
startClarifyPolling(sid)at 1733 — which callsstopClarifyPolling()first (already null after the explicit clear) then opens a fresh connection. - Connection close paths (turn end, session change, error):
stopClarifyPolling()at lines 943, 1021, 1060, 1101, etc. clears_clarifyEventSource,_clarifyFallbackTimer, and_clarifyHealthTimer. Old listeners on the closed EventSource are dead → garbage-collectable along with the captured_lastClarifyEventAtclosure. onerrorpath at line 1707: callsstopClarifyPolling()then_startClarifyFallbackPoll(sid). Health timer can never run after this since it was cleared.
Race / lock analysis
- Timer self-clear vs reconnect: at line 1732 the body calls
clearInterval(_clarifyHealthTimer)(still pointing at the OLD interval) → sets_clarifyHealthTimer = null→ callsstartClarifyPolling(sid)which sets_clarifyHealthTimerto a NEW interval. The OLD interval's clearInterval already fired before reassignment, so the OLD one won't fire again. ✅ - Closure scope:
_lastClarifyEventAtislet-scoped to each call ofstartClarifyPolling. Each connection gets its own variable; old closures captured by old listeners are independent. No cross-contamination. ✅ - Listener registration order: the two new
_markClarifyEventlisteners are registered AFTER the existing data-handler listeners. Browsers fire all listeners on each event, so the data handlers run first, then_markClarifyEvent. Order doesn't matter for correctness —_markClarifyEventonly mutates_lastClarifyEventAt. Even if the data handler throws inside itstry { } catch(e) {},_markClarifyEventstill bumps the timestamp because it's a separate listener. ✅ - No new shared state across threads — single-threaded JS event loop.
Behavioural harness — timer logic at simulated time
Extracted the gate logic into a Node simulation:
Test 1: events every 5s (healthy connection)
reconnects: 0 (expected 0) ✅
Test 2: silent server (no events for 600s)
reconnects: 10 (expected ~10) ✅
log: reconnect at t=60000, 120000, …, 600000
Test 3: bursty events every 30s
reconnects: 0 (expected 0) ✅
Pre-fix behavior (no lastEventAt gate) would have produced 10 reconnects in all three scenarios — confirming the strict improvement.
Other audit — confirmed correct
- ✅ No new endpoints / config / env vars. Frontend-only.
- ✅ No XSS surface —
_markClarifyEventreads no event data; just bumps an integer timestamp. - ✅ No memory leak — old EventSource closed → its listeners (and their closure-captured
_lastClarifyEventAt) become unreachable. - ✅ No regression in
onerrorpath — the timer never fires afterstopClarifyPolling(). Fallback poll still runs at 3s. - ✅ Compat with v0.50.249 deadlock fix — built on master
d72399awhich contains my PR #1365 deadlock fix604b44a. The handler at api/routes.py:2906-2924 inlines_gateway_queuesreads under_clarify_lockinstead of calling the recursiveget_clarify_pending. Verifiedtests/test_pr1355_sse_handler_no_deadlock.pyis present and passing.
Edge-case matrix
| Scenario | Pre-fix behavior | Post-fix behavior |
|---|---|---|
| Healthy connection, events every 5s | Reconnects every 60s anyway | Never reconnects ✅ |
| Bursty events every 30s | Reconnects every 60s | Never reconnects ✅ |
| Silent server, no events at all | Reconnects every 60s | Reconnects every 60s ✅ (correct — assumes broken) |
| Idle long-running turn (no clarify needed) | Reconnects every 60s | Reconnects every 60s (rare, bounded by turn lifetime) |
EventSource fires onerror |
Timer cleared via stopClarifyPolling, fallback poll starts |
Same ✅ |
| Page hidden / tab backgrounded | Browser may throttle setInterval (>1s); EventSource may suspend | Behavior unchanged ✅ |
JSON.parse(ev.data) throws on malformed event |
Data handler catches, but timer still resets (separate listener) | Same ✅ |
submit_pending followed by resolve_clarify (full lifecycle) |
Two events fire → lastEventAt bumped twice |
Timer never fires for that connection ✅ |
Tests
- Full pytest suite: 3395 passed, 54 skipped, 3 xpassed, 0 failed in 16.39s on
1a5fe35. (Includes the 3 deadlock regression tests from PR #1365's fix.) - No new pytest tests — JS timer behavior is hard to assert in static-source analysis. The behavioral harness above confirms the gate logic.
- CI: 3.11 ✅ pass, 3.12 ✅ pass, 3.13 was pending at fetch time but the pattern matches prior PRs that turned green within seconds.
Minor observations (non-blocking)
- Idle but healthy connection still reconnects every 60s: when a turn is long-running and no clarify is needed, the SSE produces only the
'initial'event (at lastEventAt = T0). At T0+60s the timer fires and reconnects. Strictly improved over the pre-fix behavior (which always reconnected) but not perfect for idle dormancy. Could be addressed by checking_clarifyEventSource.readyState === EventSource.OPENand using a longer threshold (e.g. 5 min). Out of scope for this fix. - No JS unit test for timer logic. A jsdom + fake timers harness could lock this in (
vi.useFakeTimers()style). Worth adding if the codebase grows a JS test infrastructure; standalone Node harness above is sufficient validation for review purposes. - Regex on
_clarifyHealthTimerintests/test_clarify_sse.py:test_frontend_has_health_timerstill passes — the marker name is preserved. - Comment block (lines 1712-1721) is well-explained and references "Opus pre-release review of v0.50.249" — clear traceability to why the change was made. Future maintainers will understand the intent.
Recommendation
✅ Approved. Strict improvement over the pre-fix unconditional reconnect: healthy connections no longer churn TCP/SSE setup once per minute. Logic verified via behavioural harness against three traffic patterns (healthy / silent / bursty). No race or memory-leak concerns with the closure-scoped _lastClarifyEventAt and the recursive setInterval → clearInterval → startClarifyPolling → setInterval pattern. Built on a master that already contains the v0.50.249 deadlock fix, so this PR is independently shippable.
Parked at approval — ready for the release agent's merge/tag pipeline.
|
Shipped in v0.50.250 via batch release PR #1368 (merge Live at https://github.com/nesquena/hermes-webui/releases/tag/v0.50.250. The clarify SSE health timer is now an actual stale-detector. Tracks This was the SHOULD-FIX #2 from the v0.50.249 pre-release Opus review, originally pulled out of #1365 as out-of-scope, brought back per the rule that small correctness-improving fixes ship even when flagged out-of-scope (mistake-log entry on the v0.50.249 reset incident at |
…al reconnect (nesquena#1367) Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX nesquena#2. Originally reset out of nesquena#1365 because the reviewer flagged it as out-of-scope; brought back per follow-up guidance that correctness-improving changes should ship even when out of scope. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the 'no event in 60s' detector its comment claimed. Now actually a stale-detector that tracks lastEventAt on initial+clarify event arrivals; only reconnects when the gap exceeds 60s. Under healthy conditions the timer never fires. Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>
Bundles 2 PRs: - nesquena#1366 fix: guard finalizeThinkingCard with session ID check (with pre-release fix) - nesquena#1367 fix(clarify-sse): stale-detector health timer (Opus SHOULD-FIX from v0.50.249) Pre-release fix on nesquena#1366: the contributor's guard depends on liveAssistantTurn.dataset.sessionId, but no code in the repo sets that attribute. Without the fix, the guard would always early-return (undefined !== sid is always true), breaking the streaming UI completely — every assistant turn's thinking card would stay open forever. Added per-site stamps at all 3 places that create liveAssistantTurn in static/ui.js, plus a regression test that fails any future creation site that forgets the stamp.
…al reconnect (nesquena#1367) Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX #2. Originally reset out of nesquena#1365 because the reviewer flagged it as out-of-scope; brought back per follow-up guidance that correctness-improving changes should ship even when out of scope. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the 'no event in 60s' detector its comment claimed. Now actually a stale-detector that tracks lastEventAt on initial+clarify event arrivals; only reconnects when the gap exceeds 60s. Under healthy conditions the timer never fires. Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>
Bundles 2 PRs: - nesquena#1366 fix: guard finalizeThinkingCard with session ID check (with pre-release fix) - nesquena#1367 fix(clarify-sse): stale-detector health timer (Opus SHOULD-FIX from v0.50.249) Pre-release fix on nesquena#1366: the contributor's guard depends on liveAssistantTurn.dataset.sessionId, but no code in the repo sets that attribute. Without the fix, the guard would always early-return (undefined !== sid is always true), breaking the streaming UI completely — every assistant turn's thinking card would stay open forever. Added per-site stamps at all 3 places that create liveAssistantTurn in static/ui.js, plus a regression test that fails any future creation site that forgets the stamp.
Follow-up to v0.50.249 / PR #1365 — clarify SSE stale-detector
Polish PR addressing Opus SHOULD-FIX #2 from the v0.50.249 pre-release review. Originally reset out of #1365 because @nesquena flagged it as out-of-scope; bringing it back per follow-up guidance that correctness-improving changes should ship even when out of scope.
Problem
The clarify SSE health timer at
static/messages.js:1715was an unconditional 60s force-reconnect, not the "no event in 60s" detector its comment claimed:Every 60s, regardless of whether events arrived: close, clear, restart. Effects on healthy connections:
clarify._lockround-trip (subscribe + unsubscribe) + a freshinitialsnapshot push from the serverinitialevent) — just waste, plus an inaccurate commentFix
Track
lastEventAtoninitial+clarifyevent arrivals; only reconnect when the gap exceeds 60s. The server sends keepalive comments every 30s, but EventSource silently consumes those (browser parser eats:comment lines). SolastEventAtonly bumps on real application events.Under healthy conditions:
_clarify_sse_notifyevents on everysubmit_pending/resolve_clarify(head/empty-state)lastEventAtupdates on every real event → gap stays under 60s → timer never firesWhen the server is genuinely silent for 60s (e.g. silently broken mid-stream, proxy ate the connection), reconnect kicks in and rebuilds via the
initialsnapshot.Tests
Risk
Low. If the new
lastEventAtgate has a logic bug, worst case is the old behavior returns (force-reconnect every 60s). The fallback path (onerror→ 3s polling) is unchanged.Compat notes
static/messages.js)