fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect by nesquena-hermes · Pull Request #1367 · nesquena/hermes-webui

nesquena-hermes · 2026-04-30T22:05:19Z

Follow-up to v0.50.249 / PR #1365 — clarify SSE stale-detector

Polish PR addressing Opus SHOULD-FIX #2 from the v0.50.249 pre-release review. Originally reset out of #1365 because @nesquena flagged it as out-of-scope; bringing it back per follow-up guidance that correctness-improving changes should ship even when out of scope.

Problem

The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the "no event in 60s" detector its comment claimed:

// Health timer: if no event received in 60s, reconnect.
_clarifyHealthTimer = setInterval(function() {
  if (_clarifyEventSource) {
    try { _clarifyEventSource.close(); } catch(_){}
    _clarifyEventSource = null;
  }
  clearInterval(_clarifyHealthTimer); _clarifyHealthTimer = null;
  startClarifyPolling(sid);
}, 60000);

Every 60s, regardless of whether events arrived: close, clear, restart. Effects on healthy connections:

One TCP/SSE setup+teardown per minute per active session
Each reconnect = clarify._lock round-trip (subscribe + unsubscribe) + a fresh initial snapshot push from the server
No correctness loss (recoverable via the next initial event) — just waste, plus an inaccurate comment

Fix

Track lastEventAt on initial + clarify event arrivals; only reconnect when the gap exceeds 60s. The server sends keepalive comments every 30s, but EventSource silently consumes those (browser parser eats : comment lines). So lastEventAt only bumps on real application events.

Under healthy conditions:

Server sends _clarify_sse_notify events on every submit_pending / resolve_clarify (head/empty-state)
Plus 30s keepalives (consumed silently by EventSource)
lastEventAt updates on every real event → gap stays under 60s → timer never fires

When the server is genuinely silent for 60s (e.g. silently broken mid-stream, proxy ate the connection), reconnect kicks in and rebuilds via the initial snapshot.

let _lastClarifyEventAt = Date.now();
const _markClarifyEvent = () => { _lastClarifyEventAt = Date.now(); };
_clarifyEventSource.addEventListener('initial', _markClarifyEvent);
_clarifyEventSource.addEventListener('clarify', _markClarifyEvent);
_clarifyHealthTimer = setInterval(function() {
  if (Date.now() - _lastClarifyEventAt < 60000) return;
  // … existing reconnect logic …
}, 60000);

Tests

Full pytest suite: 3447 passed, 0 failed
No new tests for this change since the timer behavior is hard to assert in static analysis. The behavioral pattern is the same as the working approval SSE handler's pattern.

Risk

Low. If the new lastEventAt gate has a logic bug, worst case is the old behavior returns (force-reconnect every 60s). The fallback path (onerror → 3s polling) is unchanged.

Compat notes

Frontend-only change (static/messages.js)
No backend / API / config changes
No new dependencies

…al reconnect Follow-up to v0.50.249 / PR #1365. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect: every minute it would close the EventSource, clear itself, and call startClarifyPolling(sid) again — even on a perfectly healthy connection. Comment claimed it was a 'no event in 60s' detector but the implementation didn't track event timestamps. Effects of the bug: - One TCP/SSE setup+teardown per minute per active session - Each reconnect causes a clarify._lock round-trip (subscribe + unsubscribe) and a fresh 'initial' snapshot push from the server - No correctness loss (recoverable via the next initial event), just waste Fix: track lastEventAt on initial+clarify event arrivals; only reconnect when the gap exceeds 60s. The server sends keepalive comments every 30s but EventSource silently consumes those — we can only bump lastEventAt on real application events. Under healthy conditions (real events on submit/resolve plus the periodic keepalives) the timer never fires. Caught by Opus pre-release review of v0.50.249 as SHOULD-FIX #2. Originally reset out of v0.50.249 because the reviewer flagged it as out-of-scope; now reapplied per the user's guidance that correctness-improving changes should ship even when out of scope.

nesquena

Review — end-to-end ✅ (clean approve, behavioural harness confirms timer logic)

Single-commit follow-up to v0.50.249 / PR #1365. Replaces the unconditional 60s force-reconnect on the clarify SSE connection with a stale-detector that only reconnects when no real event has arrived in 60s. Frontend-only — static/messages.js, +15/-1.

What this ships

The clarify SSE health timer originally shipped in PR #1355 (part of v0.50.249 / merged via #1365 commit d72399a) at static/messages.js:1726-1734 was an unconditional 60s reconnect:

// Pre-fix (from PR #1355):
_clarifyHealthTimer = setInterval(function() {
  if (_clarifyEventSource) {
    try { _clarifyEventSource.close(); } catch(_){}
    _clarifyEventSource = null;
  }
  clearInterval(_clarifyHealthTimer); _clarifyHealthTimer = null;
  startClarifyPolling(sid);
}, 60000);

The comment claimed "if no event received in 60s, reconnect" but the code didn't track event timestamps — every 60s, regardless of whether events arrived, it would close + clear + restart.

The fix at static/messages.js:1722-1727 introduces _lastClarifyEventAt bumped on every 'initial' and 'clarify' event, with the timer body early-returning when the gap is < 60s:

let _lastClarifyEventAt = Date.now();
const _markClarifyEvent = () => { _lastClarifyEventAt = Date.now(); };
_clarifyEventSource.addEventListener('initial', _markClarifyEvent);
_clarifyEventSource.addEventListener('clarify', _markClarifyEvent);
_clarifyHealthTimer = setInterval(function() {
  if (Date.now() - _lastClarifyEventAt < 60000) return;
  /* … existing reconnect logic … */
}, 60000);

Traced against upstream hermes-agent

Frontend-only change. No upstream interaction.

End-to-end trace

Connection open: startClarifyPolling(sid) at static/messages.js:1679 creates a new EventSource('/api/clarify/stream?…') and registers four listeners: existing 'initial' and 'clarify' handlers (lines 1691, 1699), plus the two new _markClarifyEvent handlers (lines 1724, 1725).
Server pushes an 'initial' event immediately (post-fix #1365 deadlock fix at routes.py:2906-2924), then 'clarify' events on every submit_pending/resolve_clarify. The keepalive : keepalive\n\n comment lines arrive every 30s but are silently consumed by EventSource per the HTML Living Standard — comment lines don't fire message or named events.
Health timer fires every 60s. With at least one real event in the last 60s, the timer early-returns at line 1727. With no events, it closes the EventSource at 1728-1731, clears itself at 1732, and recursively calls startClarifyPolling(sid) at 1733 — which calls stopClarifyPolling() first (already null after the explicit clear) then opens a fresh connection.
Connection close paths (turn end, session change, error): stopClarifyPolling() at lines 943, 1021, 1060, 1101, etc. clears _clarifyEventSource, _clarifyFallbackTimer, and _clarifyHealthTimer. Old listeners on the closed EventSource are dead → garbage-collectable along with the captured _lastClarifyEventAt closure.
onerror path at line 1707: calls stopClarifyPolling() then _startClarifyFallbackPoll(sid). Health timer can never run after this since it was cleared.

Race / lock analysis

Timer self-clear vs reconnect: at line 1732 the body calls clearInterval(_clarifyHealthTimer) (still pointing at the OLD interval) → sets _clarifyHealthTimer = null → calls startClarifyPolling(sid) which sets _clarifyHealthTimer to a NEW interval. The OLD interval's clearInterval already fired before reassignment, so the OLD one won't fire again. ✅
Closure scope: _lastClarifyEventAt is let-scoped to each call of startClarifyPolling. Each connection gets its own variable; old closures captured by old listeners are independent. No cross-contamination. ✅
Listener registration order: the two new _markClarifyEvent listeners are registered AFTER the existing data-handler listeners. Browsers fire all listeners on each event, so the data handlers run first, then _markClarifyEvent. Order doesn't matter for correctness — _markClarifyEvent only mutates _lastClarifyEventAt. Even if the data handler throws inside its try { } catch(e) {}, _markClarifyEvent still bumps the timestamp because it's a separate listener. ✅
No new shared state across threads — single-threaded JS event loop.

Behavioural harness — timer logic at simulated time

Extracted the gate logic into a Node simulation:

Test 1: events every 5s (healthy connection)
  reconnects: 0 (expected 0)               ✅

Test 2: silent server (no events for 600s)
  reconnects: 10 (expected ~10)            ✅
  log: reconnect at t=60000, 120000, …, 600000

Test 3: bursty events every 30s
  reconnects: 0 (expected 0)               ✅

Pre-fix behavior (no lastEventAt gate) would have produced 10 reconnects in all three scenarios — confirming the strict improvement.

Other audit — confirmed correct

✅ No new endpoints / config / env vars. Frontend-only.
✅ No XSS surface — _markClarifyEvent reads no event data; just bumps an integer timestamp.
✅ No memory leak — old EventSource closed → its listeners (and their closure-captured _lastClarifyEventAt) become unreachable.
✅ No regression in onerror path — the timer never fires after stopClarifyPolling(). Fallback poll still runs at 3s.
✅ Compat with v0.50.249 deadlock fix — built on master d72399a which contains my PR #1365 deadlock fix 604b44a. The handler at api/routes.py:2906-2924 inlines _gateway_queues reads under _clarify_lock instead of calling the recursive get_clarify_pending. Verified tests/test_pr1355_sse_handler_no_deadlock.py is present and passing.

Edge-case matrix

Scenario	Pre-fix behavior	Post-fix behavior
Healthy connection, events every 5s	Reconnects every 60s anyway	Never reconnects ✅
Bursty events every 30s	Reconnects every 60s	Never reconnects ✅
Silent server, no events at all	Reconnects every 60s	Reconnects every 60s ✅ (correct — assumes broken)
Idle long-running turn (no clarify needed)	Reconnects every 60s	Reconnects every 60s (rare, bounded by turn lifetime) ⚠️ minor
EventSource fires `onerror`	Timer cleared via `stopClarifyPolling`, fallback poll starts	Same ✅
Page hidden / tab backgrounded	Browser may throttle setInterval (>1s); EventSource may suspend	Behavior unchanged ✅
`JSON.parse(ev.data)` throws on malformed event	Data handler catches, but timer still resets (separate listener)	Same ✅
`submit_pending` followed by `resolve_clarify` (full lifecycle)	Two events fire → `lastEventAt` bumped twice	Timer never fires for that connection ✅

Tests

Full pytest suite: 3395 passed, 54 skipped, 3 xpassed, 0 failed in 16.39s on 1a5fe35. (Includes the 3 deadlock regression tests from PR #1365's fix.)
No new pytest tests — JS timer behavior is hard to assert in static-source analysis. The behavioral harness above confirms the gate logic.
CI: 3.11 ✅ pass, 3.12 ✅ pass, 3.13 was pending at fetch time but the pattern matches prior PRs that turned green within seconds.

Minor observations (non-blocking)

Idle but healthy connection still reconnects every 60s: when a turn is long-running and no clarify is needed, the SSE produces only the 'initial' event (at lastEventAt = T0). At T0+60s the timer fires and reconnects. Strictly improved over the pre-fix behavior (which always reconnected) but not perfect for idle dormancy. Could be addressed by checking _clarifyEventSource.readyState === EventSource.OPEN and using a longer threshold (e.g. 5 min). Out of scope for this fix.
No JS unit test for timer logic. A jsdom + fake timers harness could lock this in (vi.useFakeTimers() style). Worth adding if the codebase grows a JS test infrastructure; standalone Node harness above is sufficient validation for review purposes.
Regex on _clarifyHealthTimer in tests/test_clarify_sse.py:test_frontend_has_health_timer still passes — the marker name is preserved.
Comment block (lines 1712-1721) is well-explained and references "Opus pre-release review of v0.50.249" — clear traceability to why the change was made. Future maintainers will understand the intent.

Recommendation

✅ Approved. Strict improvement over the pre-fix unconditional reconnect: healthy connections no longer churn TCP/SSE setup once per minute. Logic verified via behavioural harness against three traffic patterns (healthy / silent / bursty). No race or memory-leak concerns with the closure-scoped _lastClarifyEventAt and the recursive setInterval → clearInterval → startClarifyPolling → setInterval pattern. Built on a master that already contains the v0.50.249 deadlock fix, so this PR is independently shippable.

Parked at approval — ready for the release agent's merge/tag pipeline.

nesquena-hermes · 2026-04-30T22:51:30Z

Shipped in v0.50.250 via batch release PR #1368 (merge ffd11037).

Live at https://github.com/nesquena/hermes-webui/releases/tag/v0.50.250.

The clarify SSE health timer is now an actual stale-detector. Tracks lastEventAt on initial/clarify event arrivals; only reconnects when the gap exceeds 60s. Under steady clarify traffic the timer never reconnects; on idle sessions it still reconnects every ~60-120s (residual idle reconnect tracked as a follow-up — the original PR description's "never fires under healthy conditions" was technically wrong, fixed in CHANGELOG at 0735042 per Opus NIT-2).

This was the SHOULD-FIX #2 from the v0.50.249 pre-release Opus review, originally pulled out of #1365 as out-of-scope, brought back per the rule that small correctness-improving fixes ship even when flagged out-of-scope (mistake-log entry on the v0.50.249 reset incident at ~/WebUI/docs/mistake-log.md).

…al reconnect (nesquena#1367) Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX nesquena#2. Originally reset out of nesquena#1365 because the reviewer flagged it as out-of-scope; brought back per follow-up guidance that correctness-improving changes should ship even when out of scope. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the 'no event in 60s' detector its comment claimed. Now actually a stale-detector that tracks lastEventAt on initial+clarify event arrivals; only reconnects when the gap exceeds 60s. Under healthy conditions the timer never fires. Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>

Bundles 2 PRs: - nesquena#1366 fix: guard finalizeThinkingCard with session ID check (with pre-release fix) - nesquena#1367 fix(clarify-sse): stale-detector health timer (Opus SHOULD-FIX from v0.50.249) Pre-release fix on nesquena#1366: the contributor's guard depends on liveAssistantTurn.dataset.sessionId, but no code in the repo sets that attribute. Without the fix, the guard would always early-return (undefined !== sid is always true), breaking the streaming UI completely — every assistant turn's thinking card would stay open forever. Added per-site stamps at all 3 places that create liveAssistantTurn in static/ui.js, plus a regression test that fails any future creation site that forgets the stamp.

…al reconnect (nesquena#1367) Follow-up to v0.50.249 / PR nesquena#1365 absorbing Opus SHOULD-FIX #2. Originally reset out of nesquena#1365 because the reviewer flagged it as out-of-scope; brought back per follow-up guidance that correctness-improving changes should ship even when out of scope. The clarify SSE health timer at static/messages.js:1715 was an unconditional 60s force-reconnect, not the 'no event in 60s' detector its comment claimed. Now actually a stale-detector that tracks lastEventAt on initial+clarify event arrivals; only reconnects when the gap exceeds 60s. Under healthy conditions the timer never fires. Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>

Bundles 2 PRs: - nesquena#1366 fix: guard finalizeThinkingCard with session ID check (with pre-release fix) - nesquena#1367 fix(clarify-sse): stale-detector health timer (Opus SHOULD-FIX from v0.50.249) Pre-release fix on nesquena#1366: the contributor's guard depends on liveAssistantTurn.dataset.sessionId, but no code in the repo sets that attribute. Without the fix, the guard would always early-return (undefined !== sid is always true), breaking the streaming UI completely — every assistant turn's thinking card would stay open forever. Added per-site stamps at all 3 places that create liveAssistantTurn in static/ui.js, plus a regression test that fails any future creation site that forgets the stamp.

nesquena-hermes requested a review from nesquena April 30, 2026 22:05

nesquena approved these changes Apr 30, 2026

View reviewed changes

nesquena-hermes mentioned this pull request Apr 30, 2026

release: v0.50.250 #1368

Merged

6 tasks

nesquena-hermes closed this Apr 30, 2026

nesquena-hermes deleted the fix/clarify-sse-stale-detector branch May 11, 2026 06:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect#1367

fix(clarify-sse): make health timer a stale-detector, not unconditional reconnect#1367
nesquena-hermes wants to merge 1 commit into
masterfrom
fix/clarify-sse-stale-detector

nesquena-hermes commented Apr 30, 2026

Uh oh!

nesquena left a comment

Uh oh!

nesquena-hermes commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nesquena-hermes commented Apr 30, 2026

Follow-up to v0.50.249 / PR #1365 — clarify SSE stale-detector

Problem

Fix

Tests

Risk

Compat notes

Uh oh!

nesquena left a comment

Choose a reason for hiding this comment

Review — end-to-end ✅ (clean approve, behavioural harness confirms timer logic)

What this ships

Traced against upstream hermes-agent

End-to-end trace

Race / lock analysis

Behavioural harness — timer logic at simulated time

Other audit — confirmed correct

Edge-case matrix

Tests

Minor observations (non-blocking)

Recommendation

Uh oh!

nesquena-hermes commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants