feat(streaming): wake the agent server-side when a notify_on_complete background task finishes while the WebUI tab/session is idle#2242
Conversation
|
Self-review Q1-Q4 from internal reviewer (decisions, ahead of upstream review): Q1 Event name — keeping Q2 /api/process-complete-ack auth — keeping current (mirror /api/goal-continue ownership check). Endpoint is diagnostic-only; the real auth-bearing path is /api/chat/start. Q3 watch-pattern UX — keeping single event channel; will branch toast text in frontend on Q4 hermes-agent ±5 LOC companion patch — deferring indefinitely. WebUI-side env export is sufficient and matches the CLI/gateway pattern. Heads-up for maintainers: PR #2279 was opened today by another contributor as the Option A drain implementation for issue #2278. We are rebasing this PR on top of #2279 to position it as the Option B proactive-notification layer (per maintainer endorsement in #2278: "Option B can land later if there is demand"). #2262 compression-marker fix attribution stays with #2279. Updated PR body + integrated branch coming after local verification. |
|
Maintainer signal on Q1–Q4 so you can move forward: DecisionsQ1 — Event name Q2 — Q3 — Single event channel + frontend branching on Q4 — Defer hermes-agent companion patch: approved. WebUI-side env export matches the CLI/gateway pattern, no need for an agent-side change in this slice. StatusPR #2279 already merged (May 15) as the Option A drain layer. Path forward for this PR:
If the rebase is clean, no maintainer intervention needed — just push the rebase + flip out of draft and we'll review. Ping back if any of the Q1–Q4 decisions affect implementation enough that you need to revisit before rebasing. |
d168100 to
4358d2f
Compare
There was a problem hiding this comment.
Pull request overview
This PR layers proactive server-side agent wakeup ("Option Z") and a persistent per-session SSE live-view channel ("Option X") on top of the recently merged upstream PR #2279 (next-turn drain for terminal(notify_on_complete=true)). It adds a new api/background_process.py module containing a drain thread that reads tools.process_registry.completion_queue and, when the owning session is idle, starts a wakeup agent turn directly server-side — making the "fire a background task, close the tab, come back later" use case work the way CLI/gateway hosts already do. It also adds several follow-ups: an SSE write-deadline to prevent thread exhaustion, a model-resolve cache-only fast path to avoid a Copilot token-exchange hang during cold-cache wakeups, and a defer-race teardown hook so fast background tasks that complete during a turn's teardown window are redelivered when the session goes idle.
Changes:
- New
api/background_process.py(drain thread,SessionChannelregistry + reaper, deferred-wakeup persistence, server-side wakeup starter) and a new/api/session/streamSSE endpoint andstart_session_turnheadless entrypoint inapi/routes.py. api/config.pygainsget_available_models(prefer_cache=True)+ a bounded live-rebuild worker (_LIVE_REBUILD_BUDGET_SECONDS,_minimal_static_models_catalog) and new state (DEFERRED_PROCESS_WAKEUPS,PROCESS_SESSION_INDEX,SESSION_CHANNEL_*TTLs);api/streaming.pyadds_sse_set_write_deadline, env wiring forHERMES_SESSION_CHAT_ID, and a turn-teardown drain hook.- Frontend (
static/messages.js,static/sessions.js) opens/closes a per-session SSE stream, dedupes process_complete events module-wide, and attaches the existingattachLiveStreamrenderer when the server fans aserver_turn_startedframe.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
api/background_process.py |
New module: drain thread, SessionChannel + reaper, deferred-wakeup persistence, server-side wakeup runner. |
api/config.py |
Adds prefer_cache mode, bounded rebuild worker, minimal static catalog, and new state buckets for wakeup/SessionChannel. |
api/routes.py |
Adds start_session_turn, /api/session/stream + /api/process-complete-ack handlers, threads prefer_cached_catalog, arms SSE write deadlines, fans server_turn_started. |
api/streaming.py |
Adds _sse_set_write_deadline, exports HERMES_SESSION_CHAT_ID, registers per-session process mapping, calls turn-teardown deferred-wakeup drain. |
server.py |
Starts/stops the drain thread and SessionChannel reaper. |
static/messages.js |
Module-scope dedupe set, process_complete listener, startSessionStream/stopSessionStream, server_turn_started attach-existing-renderer logic. |
static/sessions.js |
Wires session-stream start/stop into newSession/loadSession. |
CHANGELOG.md |
Long entries describing the Option Z pivot, Option X channel, model-resolve fix, defer-race fix, and SSE backpressure fix. |
tests/test_session_channel_option_x.py, test_process_complete_wakeup.py, test_process_complete_ab_coexistence.py, test_optionz_liveview_perf.py, test_wakeup_model_resolve_hang.py, test_wakeup_defer_race.py, manual_repro_wakeup_hang.py |
New tests for SessionChannel lifecycle, AB-coexistence dedupe, live-view fan-out + SSE backpressure, model-resolve hang fix, and defer-race teardown hook. |
tests/test_bugbatch_apr2026.py |
Source-grep updated for new get_available_models( signature. |
wakeup-model-resolve-hang-fix.md, wakeup-defer-race-fix.md, optionz-liveview-perf-fix.md |
New design/repro notes (markdown only). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| The user restarted the real instance (pid 765), ran an empty background task | ||
| with `notify_on_complete=true`. The `process_complete` event fired |
There was a problem hiding this comment.
Thanks — agreed. These three files were working notes, not maintained project documentation, and the substance is already captured (operator details stripped) in the [Unreleased] section of CHANGELOG.md. I've removed all three from the repo root rather than relocating them, since nothing in them is unique to keep. Done in commit a76c050.
| "ok": True, | ||
| "session_id": s.session_id, | ||
| "process_id": pid, | ||
| "pending_consumed": False, # frontend re-POSTs /api/chat/start to actually consume |
There was a problem hiding this comment.
Good catch — that field and comment were stale after the Option Z pivot and contradicted the function's own docstring. Since the endpoint is now a pure no-op for state (the wakeup turn is started server-side; the browser never re-POSTs /api/chat/start), I've dropped "pending_consumed": False and replaced it with "noop": True plus a comment that matches the documented behavior. I verified nothing reads the old field first — the frontend fires the ack fire-and-forget (.catch(() => {})) without inspecting the body, and no test asserts it. Fixed in commit cee897a.
| for stream_id, channel in items: | ||
| meta = _cfg.ACTIVE_RUNS.get(stream_id) if hasattr(_cfg, "ACTIVE_RUNS") else None | ||
| owner_sid = (meta or {}).get("session_id") if isinstance(meta, dict) else None | ||
| # When no ACTIVE_RUNS row matches, fall back to broadcasting — the | ||
| # frontend ignores events for the wrong session_id by inspecting the | ||
| # payload (data.session_id !== activeSid). | ||
| if owner_sid and owner_sid != session_id: | ||
| continue | ||
| try: | ||
| channel.put_nowait((event, data)) | ||
| emitted += 1 | ||
| except Exception: | ||
| logger.debug("process_complete emit failed for stream %s", stream_id, exc_info=True) |
There was a problem hiding this comment.
Resolved by skipping non-matching streams when the owner is unknown — the option you preferred. _emit_to_session_streams now does if owner_sid != session_id: continue (previously if owner_sid and owner_sid != session_id), so a stream with no matching ACTIVE_RUNS row is no longer broadcast to, removing the cross-session-leak surface.
This was justified with a deterministic Playwright reproduction of the related open-tab live-view defect: the per-session SessionChannel (emitted unconditionally below this loop), now backed by an on-subscribe recovery replay, is the sole authoritative cross-turn / server-initiated live-view carrier after the Option X/Z work. The STREAMS loop is in-turn defense-in-depth only and is never the path that delivers a between-turns or server-initiated wakeup, so the broadcast fallback is not load-bearing for any live-view path. Covered by regression tests (owner-unknown skip, matching-owner still delivered, cross-session isolation). Fixed in commit 29cdce2.
| if process_id: | ||
| try: | ||
| from tools.process_registry import process_registry as _pr | ||
| with _pr._lock: | ||
| _pr._completion_consumed.add(process_id) | ||
| except Exception: | ||
| logger.debug("Failed to mark process completion consumed on B drain", exc_info=True) |
There was a problem hiding this comment.
Addressed with the defensive option (b). The read side already uses the public process_registry.is_completion_consumed(); only the write side reached into the private _lock / _completion_consumed, because there is no public mark_completion_consumed() upstream. I've extracted _mark_registry_completion_consumed() which documents the coupling and narrows the exception handling: ImportError stays best-effort (the registry is legitimately absent in non-agent unit-test contexts), but an AttributeError/TypeError from a future rename or retype is now logged at ERROR with an explicit contract-violation message instead of being swallowed silently at DEBUG. A future rename therefore fails loud rather than silently reintroducing the double-wakeup.
I also added a contract test (test_registry_completion_consumed_contract) plus a fail-loud test in tests/test_process_complete_ab_coexistence.py that pin the expected ProcessRegistry private surface, so an upstream rename breaks CI here visibly. Fixed in commit 9f3c7b6.
For the longer term, option (a) — a public mark_completion_consumed(process_id) / is_completion_consumed(process_id) pair on ProcessRegistry upstream — is the durable fix; I'm happy to submit that upstream and switch this call site to it if you'd like.
|
|
||
| Run: | ||
| /home/islaliu/.hermes/hermes-agent/venv/bin/python \ | ||
| tests/manual_repro_wakeup_hang.py | ||
| """ | ||
|
|
There was a problem hiding this comment.
Moved it out of the test root: git mv tests/manual_repro_wakeup_hang.py tests/manual/repro_wakeup_hang.py. I stripped the hardcoded absolute interpreter path from the docstring (the run instruction is now parametric — python tests/manual/repro_wakeup_hang.py from the repo root) and fixed its repo-root sys.path computation for the deeper location.
For the collection exclusion I used collect_ignore_glob = ["manual/*.py"] in the existing top-level tests/conftest.py rather than a nested tests/manual/conftest.py: a second conftest module collides with the suite's top-level from conftest import TEST_BASE/TEST_STATE_DIR imports (the suite uses rootdir-relative module names, no packages). Verified pytest --collect-only tests/ collects cleanly with no collection errors and the manual script no longer surfaces. Fixed in commit 3b61772.
…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
0ad2fde to
d6753e5
Compare
…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
|
Rebased onto latest Rebase summary
Verification on the rebased tree
Ready for independent maintainer review of the sensitive streaming paths. |
…ion X per-session live-view SSE on top of merged nesquena#2279 Layered ON TOP of the now-merged upstream PR nesquena#2279 (Option A next-turn drain). This branch carries ZERO nesquena#2279 code — nesquena#2279 (_drain_webui_process _notifications / _format_process_notification / _mark_process_completion _consumed / HERMES_SESSION_ID+PLATFORM env wiring / drain call-site / and intentionally NOT re-added here. Our-original contributions only: - api/background_process.py (NEW): completion-queue drain thread; Option Z server-side wakeup primary (_start_server_side_wakeup_turn, _session_has_active_turn); Option X SessionChannel registry + reaper + /api/session/stream fan-out; defer-race persistence (record/claim/drain_deferred_wakeups_for_session). Idempotency aligns to the REAL upstream dedupe contract: process_registry._completion _consumed via the same private-marker bridge nesquena#2279 uses, plus the ours-original PROCESS_COMPLETE_EVENTS_SEEN secondary gate. - api/config.py: PROCESS_SESSION_INDEX, PENDING_PROCESS_COMPLETIONS, PROCESS_COMPLETE_EVENTS_SEEN, DEFERRED_PROCESS_WAKEUPS; model-resolve hang fix (get_available_models(prefer_cache=), _minimal_static_models _catalog, bounded _LIVE_REBUILD_BUDGET_SECONDS); SessionChannel TTLs. - api/streaming.py: HERMES_SESSION_CHAT_ID env wiring + restore (the HERMES_SESSION_ID/PLATFORM lines are upstream nesquena#2279, kept once); register_process_session bind at chat start; SSE_WRITE_DEADLINE_SECONDS + _sse_set_write_deadline (Defect A backpressure); defer-race turn-teardown drain_deferred_wakeups_for_session idle-hook. - api/routes.py: start_session_turn() reusable headless entrypoint + server_turn_started live-view fan-out (Defect B); _handle_session_sse _stream (/api/session/stream); _handle_process_complete_ack diagnostic; PENDING_PROCESS_COMPLETIONS import + discard hook; prefer_cached_catalog threaded through _resolve_compatible_session_model_state (integrated with upstream nesquena#1855 fast-path — both kept); _sse_set_write_deadline on all 6 SSE endpoints. - server.py: drain + SessionChannel reaper start/stop (integrated with upstream drain_all_on_shutdown, both run). - static/messages.js + sessions.js: per-session live-view EventSource, _handleProcessCompleteEvent shared handler, server_turn_started -> attachLiveStream, startSessionStream/stopSessionStream lifecycle. Closed-tab + fast/slow self-wake + live-view + no-double-fire-with-nesquena#2279 + no-loop preserved.
…ab (lost fire-and-forget server_turn_started)
Root cause (Playwright-reproduced, see liveview-open-tab-fix.md §1): the server_turn_started fan-out in start_session_turn is a fire-and-forget SessionChannel.emit with no replay buffer. A tab whose /api/session/stream subscriber is momentarily absent at the emit instant (transient SSE drop / reverse-proxy idle-timeout / browser conn-pool starvation) misses it permanently, so a server-initiated wakeup never renders live (needs hard refresh). Server-side wakeup itself ran+persisted fine; only live-view was lost.
Fix: background_process.active_stream_id_for_session() + on-subscribe recovery in _handle_session_sse_stream replays a synthetic server_turn_started {recovered:true} to a freshly-subscribed tab; messages.js honours recovered via the reconnecting (replay) attach path. Reuses the single attachLiveStream renderer; idempotent with the original frame.
…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
…lot nesquena#3 owner-unknown skip 8 new tests in test_optionz_liveview_perf.py: active_stream_id_for_session lookup; on-subscribe recovery behavioural contract; handler + frontend (recovered->reconnecting) wiring grep; Copilot nesquena#3 skip-owner-unknown behaviour, owner-match regression guard, cross-session isolation, source-grep. 40/40 in liveview+optionx; 37/37 wakeup-critical green.
… Copilot nesquena#3 resolution
…iew nesquena#1) The 3 top-level fix-notes (wakeup-model-resolve-hang-fix.md, wakeup-defer-race-fix.md, optionz-liveview-perf-fix.md) were internal working notes with local pids, absolute machine paths, and task tokens. Their substance is already release-note captured in CHANGELOG.md [Unreleased]; nothing unique is lost. They do not belong in the repo.
…omment (Copilot nesquena#2) The Option Z pivot made /api/process-complete-ack a pure no-op for state (the server-side drain thread starts the wakeup turn; the browser never re-POSTs /api/chat/start). The response field "pending_consumed": False with the comment "frontend re-POSTs /api/chat/start to actually consume" directly contradicted the function's own docstring and the actual frontend (which fires the ack fire-and-forget and never reads the body). Replace it with an honest "noop": True that matches the documented behavior. No caller reads the old field (frontend .catch()s the fetch; no test asserts it).
…me; use/assert the dedupe contract (Copilot nesquena#4) _process_one() reached into process_registry._lock / _completion_consumed inside a broad `except Exception` that swallowed an AttributeError, so a future upstream rename of those privates would silently reintroduce the cross-A/B double-wakeup bug. There is no public mark_completion_consumed() upstream (only the READ side is public via is_completion_consumed(), already used here), so per Copilot's option (b): - Extract _mark_registry_completion_consumed() that documents the coupling and narrows exception handling: ImportError stays best-effort (registry legitimately absent off-agent), but AttributeError/TypeError from a renamed/retyped private now logs at ERROR with an explicit contract-violation message instead of a silent DEBUG. - Add _REGISTRY_CONSUMED_CONTRACT + two tests in test_process_complete_ab_coexistence.py: test_registry_completion_consumed_contract pins the private surface so an upstream rename breaks CI loudly HERE; the fail-loud test asserts a rename surfaces as ERROR not silent DEBUG. An upstream public mark_completion_consumed() remains the durable fix (noted in code + the PR reply as a maintainer suggestion). 72/72 wakeup-critical + changed-area tests green.
…Copilot nesquena#5) manual_repro_wakeup_hang.py sat in the pytest test root with a hardcoded /home/islaliu/.hermes/... venv path in its docstring. pytest never ran it (no test_ functions) but it was confusing. - git mv tests/manual_repro_wakeup_hang.py -> tests/manual/repro_wakeup_hang.py - Strip the operator-specific absolute interpreter path from the docstring; the run instruction is now parametric ("python tests/manual/repro_wakeup_hang.py" from repo root) and fix the sys.path repo-root math for the now-deeper location. - Exclude tests/manual/ from collection via collect_ignore_glob in the existing tests/conftest.py rather than a nested conftest.py: a second conftest module would collide with the suite's top-level 'from conftest import TEST_BASE/TEST_STATE_DIR' imports (no packages / rootdir-relative module names). Verified: pytest --collect-only tests/ collects 6014 tests, 0 collection errors, the manual script no longer surfaces.
…esquena#5 follow-up) Copilot review nesquena#5 'strip local paths' missed one residue: tests/test_wakeup_model_resolve_hang.py:3-4 still embedded the reviewer's machine path /home/islaliu/.hermes/webui/bootstrap-8787.log. Rewrote lines 3-4 to a neutral description ('a live thread-stack capture of a hung wakeup turn') — root-cause explanation itself unchanged. Full re-grep of tests/ + tests/manual/ confirms this was the only remaining residue. 7/7 tests pass.
…ider-catalog rebuild Multi-tab streaming interlock RCA (kanban t_d127953d). The two display-only resolvers _resolve_effective_session_model_for_display / _resolve_effective_session_model_provider_for_display called _resolve_compatible_session_model_state() without prefer_cached_catalog=True. For sessions with no persisted model_provider (kanban/imported), the fast path is skipped and it falls through to a full live get_available_models() cold rebuild (botocore IMDS ~4s on non-AWS/WSL + anthropic/openrouter /models ~0.9s). GET /api/session is a hot, side-effect-free per-tab/per-poll path; concurrent tabs serialized on the models-cache lock and starved SSE/streaming -> BrokenPipe/Cancelled storm. POST /api/chat/start wakeup path already passed prefer_cached_catalog=True; the GET display path was the missed blind spot. Fix: pass prefer_cached_catalog=True at both display resolvers (persisted session model authoritative; network-free minimal catalog backstops the default). One logical change, 2 call sites. ~7742ms -> 133-148ms (~52x); display-path live rebuilds N -> 0. - tests/test_session_display_resolver_no_live_rebuild.py: new invariant tripwire regression (4 tests; pre-fix 3 fail proving non-tautology) - tests/test_provider_mismatch.py: stub signature aligned (lambda **_kw)
…h.json fingerprint auth.json is rewritten by credential-pool/OAuth token refresh roughly every 14 minutes. _models_cache_source_fingerprint() hashed it via mtime/size (nesquena#1699 _models_cache_file_fingerprint), so every token refresh churned the fingerprint and the 24h /api/models cache was effectively dead -- the hot GET /api/session?resolve_model=1 path paid a cold ~11.5s rebuild every few minutes (RCA t_d127953d residual nesquena#2, t_16551f61). Add _auth_store_semantic_fingerprint(): content-hash auth.json with a DENY-list of known credential-rotation-only keys (access/refresh token, expiry, per-credential status/telemetry, request_count, save updated_at) stripped. Deny-list (not allow-list) is deliberate -- any unknown field, or a real provider/endpoint/model-set change (active_provider, a new credential_pool entry, base_url, source, label, auth_type, the providers{} block, ...) stays in the fingerprint and still correctly busts the cache. Conservative fallbacks: missing file -> marked; unreadable/corrupt -> stat-based fallback (never less safe than pre-fix). config.yaml keeps the cheap stat fingerprint (deliberate edits, no timer churn). Bidirectional invariant regression test (non-tautological -- the end-to-end churn test flips RED when the auth_json axis is reverted to stat-based): token-only churn keeps fingerprint byte-identical AND keeps a valid disk cache loadable; active_provider change / new credential_pool entry / changed base_url each flip the fingerprint AND reject the stale disk cache. Measured: 5/5 cold rebuilds per 5 refresh cycles -> 0/5. Tests: 9 new pass; 28 adjacent (nesquena#1699/nesquena#1633/display-resolver) pass; 54 models_cache/fingerprint suite pass.
…n wakeup misroute root fix, Option 1) RCA t_f62ff1e8: WebUI bound per-turn session identity only to the process-global os.environ['HERMES_SESSION_KEY'] and released the env lock before the agent ran; it never set any contextvar, so get_current_session_key fell back to that racy slot. Two concurrent WebUI turns raced on one slot -> a notify_on_complete spawn captured a concurrent turn's session_id -> server-side wakeup turn started for the wrong session (agent.log:6632). _set/_reset/_bind_turn_session_identity bind the turn's sid to tools.approval._approval_session_key + gateway.session_context._SESSION_KEY in the _run_agent_streaming worker thread (concurrent tool batches inherit via copy_context); reset-token restore in the same outer finally as the env restore. Does NOT use blanket set_session_vars (would break the notify_on_complete watcher gate). English-only source (no-cjk guard); pre-init ordered to preserve the Issue nesquena#765 locator. Core unchanged. Regression: tests/test_xsession_wakeup_misroute.py (RED pre / GREEN post).
…session misroute defense-in-depth, Option 3) Independent safety net at the wakeup-routing layer, decoupled from the Option 1 root fix (different file/layer, separately revertible). On _process_one, after PROCESS_SESSION_INDEX resolves the target, _resolve_wakeup_target cross-checks it against an env-immune spawn owner on the ProcessSession (forward-compatible duck-typing of spawn_session_id/owner_session_id/turn_session_id since adding the field to core ProcessSession is out of scope for this WebUI-only change). Positive mismatch -> log ERROR + re-route the wakeup to the true owner. Pure pass-through when no env-immune owner is available (today's core, cron/CLI procs, pre-Option-1 spawns) so Option Z is never suppressed on uncertainty. Regression covered by tests/test_xsession_wakeup_misroute.py.
d76801c to
64b8294
Compare
…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
@nesquena-hermes @nesquena |
…ena#2550/nesquena#2551/nesquena#2554/nesquena#2560/nesquena#2567/nesquena#2568) into process-complete-event PR # Conflicts: # static/sessions.js
…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
1de6e8a to
6458542
Compare
Restores green CI for tests/test_xsession_wakeup_misroute.py and tests/test_process_complete_ab_coexistence.py::test_registry_completion_consumed_contract and tests/test_session_channel_option_x.py::test_real_completion_event_shape_routes_to_session_channel on Python 3.11 / 3.12 / 3.13. Failure: `ModuleNotFoundError: No module named 'tools'` from imports `from tools.process_registry import ...` and `from tools.approval import ...`. Tests depend on the hermes-agent sibling project's `tools/` namespace; they now skip when hermes-agent is not installed, matching the repo's existing agent-dependent test convention. Co-authored-by: Hermes Agent <hermes-agent@users.noreply.github.com>

Thinking Path
terminal(background=true, notify_on_complete=true)task wakes the agent when it finishes, so the agent can act on the result on its own. The CLI and the gateway hosts (Telegram, Discord, etc.) already do this by draining the completion queue inside their own run loops.What Changed
This PR layers a proactive, server-side wakeup on top of the merged #2279
next-turn drain. It carries none of #2279's code (that is upstream now);
it is rebased onto current
master.api/routes.py::start_session_turn) that reuses the exact session-load → workspace → model/provider → worker-thread path/api/chat/startalready uses. No second agent-construction path is introduced. If a turn is already active, it does not start a competing turn and instead leaves the completion for the merged Fix WebUI stream completion recovery gaps #2279 next-turn drain.GET /api/session/stream). A long-lived SSE channel keyed on the WebUIsession_id(so it survives across turns, unlike the per-turn chat stream). A reaper thread collects channels with no subscribers after a grace period and idle channels after a configurable TTL; a channel with a live subscriber is never collected. When a turn is started server-side, an open tab attaches the existing chat-stream renderer to it and shows it live; a closed tab simply gets the persisted result on next load (parity with CLI behavior).Why It Matters
The headline use case for
notify_on_completeis "start something long, walkaway, let the agent pick it up when it's done." Before this PR a WebUI user
who did that — especially with the tab closed — got nothing: the agent
appeared to hang forever. After this PR the agent wakes itself on the server
the moment the task completes, an open tab sees that turn render live, and a
closed tab finds the completed turn already persisted on next load. WebUI
now matches the wakeup behavior the CLI and gateway hosts have always had.
Relationship to #2279
PR #2279 (merged into
master) is the next-turn safety-net drain — itdelivers a pending completion at the start of the next user turn. This PR is
the proactive server-side layer on top of it: it wakes the agent without
waiting for a next user turn and without a browser round-trip. The two share a
single completion-consumed dedupe key (the contract #2279 established), so a
completion wakes the agent exactly once regardless of which path fires first;
neither double-fires and there is no wakeup loop. This PR carries none of
#2279's code.
Refs #2278 (root issue — WebUI never fires
notify_on_completewakeups;closed by #2279 for the next-turn case, completed here for the
idle/closed-tab case). Builds on #2279. The #2262 compression-marker fix was
shipped upstream via #2279 and is intentionally not re-added here.
Verification
masterbaseline.notify_on_completebackground task with the tab closed, and an open tab rendered the server-started turn live (verified with Playwright).git diff --checkclean; touched modules py-compile clean.Risks / Follow-ups
session_idand survives across turns; the reaper TTL is the upper bound on an abandoned channel's lifetime.Model Used
AI-assisted change with repository inspection, source-level root-cause
analysis, targeted editing, automated regression suites, and live-instance
verification. AI assisted with code, tests, verification, and PR text.