feat(streaming): wake the agent server-side when a notify_on_complete background task finishes while the WebUI tab/session is idle by Isla-Liu · Pull Request #2242 · nesquena/hermes-webui

Isla-Liu · 2026-05-14T08:49:28Z

Thinking Path

A core Hermes Agent guarantee is that a terminal(background=true, notify_on_complete=true) task wakes the agent when it finishes, so the agent can act on the result on its own. The CLI and the gateway hosts (Telegram, Discord, etc.) already do this by draining the completion queue inside their own run loops.
In the WebUI this guarantee did not hold. The background process exited, its completion was queued in the agent process registry, and nothing read from it — the agent only resumed when a human typed the next message. The agent had told the user "I'll continue when this finishes," and then silently never did.
The merged PR Fix WebUI stream completion recovery gaps #2279 closed part of this gap with a next-turn safety-net drain: at the start of the next user-initiated turn it drains any pending completions and prepends them to the user's message. That is the correct backstop, but it cannot help the primary case this PR targets: fire a long task, close the tab, come back later — an autonomous agent in that state has no "next user turn," and a closed tab can never re-POST anything to the server.
The structurally correct fix is therefore to start the wakeup turn on the server, with no browser round-trip — exactly the way the CLI and gateway hosts already self-wake — and to keep an open tab able to watch that server-started turn live.

What Changed

This PR layers a proactive, server-side wakeup on top of the merged #2279
next-turn drain. It carries none of #2279's code (that is upstream now);
it is rebased onto current master.

Server-side wakeup turn (primary path). A startup drain thread reads the agent process completion queue and, when the owning session is idle, starts the agent turn directly on the server via a new headless entrypoint (api/routes.py::start_session_turn) that reuses the exact session-load → workspace → model/provider → worker-thread path /api/chat/start already uses. No second agent-construction path is introduced. If a turn is already active, it does not start a competing turn and instead leaves the completion for the merged Fix WebUI stream completion recovery gaps #2279 next-turn drain.
Persistent per-session live-view channel (GET /api/session/stream). A long-lived SSE channel keyed on the WebUI session_id (so it survives across turns, unlike the per-turn chat stream). A reaper thread collects channels with no subscribers after a grace period and idle channels after a configurable TTL; a channel with a live subscriber is never collected. When a turn is started server-side, an open tab attaches the existing chat-stream renderer to it and shows it live; a closed tab simply gets the persisted result on next load (parity with CLI behavior).
SSE write-deadline backpressure. All long-lived SSE endpoints now arm a socket write deadline, so a slow or backgrounded tab whose TCP receive window has filled can no longer pin a server worker thread indefinitely (which previously also blocked the live-view reaper).
Cold model-catalog resolution no longer blocks a wakeup. A server-side wakeup fires exactly when the model catalog is cold; the cold rebuild could make a blocking network call (provider token exchange) that hung the turn on restricted networks. The wakeup now resolves from warm/disk cache or a network-free minimal catalog, and the already-persisted session model always wins. Integrated with the existing upstream model-resolve fast-path.
Fast-task completion race. When a task finishes in the narrow window while a turn is tearing down, the wakeup prompt is persisted and redelivered by a turn-teardown idle hook, making the busy-at-completion case behave the same as the idle case. The claim is atomic, so it fires exactly once and never double-fires with the merged Fix WebUI stream completion recovery gaps #2279 drain.
Cross-session wakeup isolation. Per-turn session identity is bound to a context variable for the duration of the turn (plus a completion-time target cross-check as defense in depth), so with multiple concurrent WebUI sessions a completed background task always wakes the session that launched it, never another open session.
Copilot review comments on this PR have been addressed (agent working notes removed from the repo root, a misleading ack response field corrected, owner-unknown streams no longer broadcast cross-session, the dedupe contract asserted instead of silently swallowing a registry rename, and local absolute paths stripped from a test).

Why It Matters

The headline use case for notify_on_complete is "start something long, walk
away, let the agent pick it up when it's done." Before this PR a WebUI user
who did that — especially with the tab closed — got nothing: the agent
appeared to hang forever. After this PR the agent wakes itself on the server
the moment the task completes, an open tab sees that turn render live, and a
closed tab finds the completed turn already persisted on next load. WebUI
now matches the wakeup behavior the CLI and gateway hosts have always had.

Relationship to #2279

PR #2279 (merged into master) is the next-turn safety-net drain — it
delivers a pending completion at the start of the next user turn. This PR is
the proactive server-side layer on top of it: it wakes the agent without
waiting for a next user turn and without a browser round-trip. The two share a
single completion-consumed dedupe key (the contract #2279 established), so a
completion wakes the agent exactly once regardless of which path fires first;
neither double-fires and there is no wakeup loop. This PR carries none of
#2279's code.

Refs #2278 (root issue — WebUI never fires notify_on_complete wakeups;
closed by #2279 for the next-turn case, completed here for the
idle/closed-tab case). Builds on #2279. The #2262 compression-marker fix was
shipped upstream via #2279 and is intentionally not re-added here.

Verification

Full Python regression suite: 0 regressions vs the post-Fix WebUI stream completion recovery gaps #2279-merge master baseline.
Wakeup-critical suites pass, covering: closed-tab self-wake when idle; fast vs. slow completion (completion-during-teardown); live-view of a server-started turn plus SSE write-deadline backpressure; no double-fire / no loop across the merged Fix WebUI stream completion recovery gaps #2279 and this PR in both interleavings; cold model-catalog resolution surviving a hung network probe; cross-session wakeup isolation under concurrent sessions.
Upstream Fix WebUI stream completion recovery gaps #2279 contract suites remain green and unchanged.
Live-instance verification: a real WebUI instance self-woke on both a fast and a slow notify_on_complete background task with the tab closed, and an open tab rendered the server-started turn live (verified with Playwright).
git diff --check clean; touched modules py-compile clean.

Risks / Follow-ups

Adds two long-lived server threads (the completion drain and the live-view reaper). Both are daemon threads with bounded work; the reaper's grace period and idle TTL are configurable.
The live-view channel is keyed on session_id and survives across turns; the reaper TTL is the upper bound on an abandoned channel's lifetime.
The completion-consumed dedupe key is the contract established by the merged Fix WebUI stream completion recovery gaps #2279; if a future change alters that contract, both the next-turn drain and this proactive path must be updated together.

Model Used

AI-assisted change with repository inspection, source-level root-cause
analysis, targeted editing, automated regression suites, and live-instance
verification. AI assisted with code, tests, verification, and PR text.

Isla-Liu · 2026-05-15T02:18:06Z

Self-review Q1-Q4 from internal reviewer (decisions, ahead of upstream review):

Q1 Event name — keeping process_complete. Matches outline/RFC/tests/CHANGELOG; renaming pre-merge isnt worth the churn.

Q2 /api/process-complete-ack auth — keeping current (mirror /api/goal-continue ownership check). Endpoint is diagnostic-only; the real auth-bearing path is /api/chat/start.

Q3 watch-pattern UX — keeping single event channel; will branch toast text in frontend on data.type as a follow-up commit if maintainers want polish. Splitting events doubles the contract surface for a UX nit.

Q4 hermes-agent ±5 LOC companion patch — deferring indefinitely. WebUI-side env export is sufficient and matches the CLI/gateway pattern.

Heads-up for maintainers: PR #2279 was opened today by another contributor as the Option A drain implementation for issue #2278. We are rebasing this PR on top of #2279 to position it as the Option B proactive-notification layer (per maintainer endorsement in #2278: "Option B can land later if there is demand"). #2262 compression-marker fix attribution stays with #2279. Updated PR body + integrated branch coming after local verification.

nesquena-hermes · 2026-05-17T02:37:43Z

Maintainer signal on Q1–Q4 so you can move forward:

Decisions

Q1 — Event name process_complete: keep as-is. Agreed, renaming pre-merge is churn.

Q2 — /api/process-complete-ack auth model: the ownership check mirroring /api/goal-continue is fine for the diagnostic ack endpoint. The real auth-bearing path is /api/chat/start, as you noted. Approved.

Q3 — Single event channel + frontend branching on data.type: approved. Splitting events would double the contract surface for a UX polish concern. If toast-text variants are wanted later, branching in messages.js based on the payload is the right place.

Q4 — Defer hermes-agent companion patch: approved. WebUI-side env export matches the CLI/gateway pattern, no need for an agent-side change in this slice.

Status

PR #2279 already merged (May 15) as the Option A drain layer. Path forward for this PR:

Rebase on current master. v0.51.78 / v0.51.79 / v0.51.80 have shipped since you opened this. mergeable: CONFLICTING right now.
Reposition the PR body to reflect the new layering: Option B proactive-notification layer on top of Fix WebUI stream completion recovery gaps #2279's drain helper.
CI run. Once rebased and green, mark "ready for review" — we'll route through independent review for the streaming changes (api/streaming.py + api/background_process.py are sensitive paths).

If the rebase is clean, no maintainer intervention needed — just push the rebase + flip out of draft and we'll review.

Ping back if any of the Q1–Q4 decisions affect implementation enough that you need to revisit before rebasing.

Copilot

Pull request overview

This PR layers proactive server-side agent wakeup ("Option Z") and a persistent per-session SSE live-view channel ("Option X") on top of the recently merged upstream PR #2279 (next-turn drain for terminal(notify_on_complete=true)). It adds a new api/background_process.py module containing a drain thread that reads tools.process_registry.completion_queue and, when the owning session is idle, starts a wakeup agent turn directly server-side — making the "fire a background task, close the tab, come back later" use case work the way CLI/gateway hosts already do. It also adds several follow-ups: an SSE write-deadline to prevent thread exhaustion, a model-resolve cache-only fast path to avoid a Copilot token-exchange hang during cold-cache wakeups, and a defer-race teardown hook so fast background tasks that complete during a turn's teardown window are redelivered when the session goes idle.

Changes:

New api/background_process.py (drain thread, SessionChannel registry + reaper, deferred-wakeup persistence, server-side wakeup starter) and a new /api/session/stream SSE endpoint and start_session_turn headless entrypoint in api/routes.py.
api/config.py gains get_available_models(prefer_cache=True) + a bounded live-rebuild worker (_LIVE_REBUILD_BUDGET_SECONDS, _minimal_static_models_catalog) and new state (DEFERRED_PROCESS_WAKEUPS, PROCESS_SESSION_INDEX, SESSION_CHANNEL_* TTLs); api/streaming.py adds _sse_set_write_deadline, env wiring for HERMES_SESSION_CHAT_ID, and a turn-teardown drain hook.
Frontend (static/messages.js, static/sessions.js) opens/closes a per-session SSE stream, dedupes process_complete events module-wide, and attaches the existing attachLiveStream renderer when the server fans a server_turn_started frame.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`api/background_process.py`	New module: drain thread, SessionChannel + reaper, deferred-wakeup persistence, server-side wakeup runner.
`api/config.py`	Adds `prefer_cache` mode, bounded rebuild worker, minimal static catalog, and new state buckets for wakeup/SessionChannel.
`api/routes.py`	Adds `start_session_turn`, `/api/session/stream` + `/api/process-complete-ack` handlers, threads `prefer_cached_catalog`, arms SSE write deadlines, fans `server_turn_started`.
`api/streaming.py`	Adds `_sse_set_write_deadline`, exports `HERMES_SESSION_CHAT_ID`, registers per-session process mapping, calls turn-teardown deferred-wakeup drain.
`server.py`	Starts/stops the drain thread and SessionChannel reaper.
`static/messages.js`	Module-scope dedupe set, `process_complete` listener, `startSessionStream`/`stopSessionStream`, `server_turn_started` attach-existing-renderer logic.
`static/sessions.js`	Wires session-stream start/stop into `newSession`/`loadSession`.
`CHANGELOG.md`	Long entries describing the Option Z pivot, Option X channel, model-resolve fix, defer-race fix, and SSE backpressure fix.
`tests/test_session_channel_option_x.py`, `test_process_complete_wakeup.py`, `test_process_complete_ab_coexistence.py`, `test_optionz_liveview_perf.py`, `test_wakeup_model_resolve_hang.py`, `test_wakeup_defer_race.py`, `manual_repro_wakeup_hang.py`	New tests for SessionChannel lifecycle, AB-coexistence dedupe, live-view fan-out + SSE backpressure, model-resolve hang fix, and defer-race teardown hook.
`tests/test_bugbatch_apr2026.py`	Source-grep updated for new `get_available_models(` signature.
`wakeup-model-resolve-hang-fix.md`, `wakeup-defer-race-fix.md`, `optionz-liveview-perf-fix.md`	New design/repro notes (markdown only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Isla-Liu · 2026-05-18T05:08:02Z

background task working good behavior image:

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Isla-Liu · 2026-05-18T07:18:22Z

+The user restarted the real instance (pid 765), ran an empty background task
+with `notify_on_complete=true`. The `process_complete` event fired


Thanks — agreed. These three files were working notes, not maintained project documentation, and the substance is already captured (operator details stripped) in the [Unreleased] section of CHANGELOG.md. I've removed all three from the repo root rather than relocating them, since nothing in them is unique to keep. Done in commit a76c050.

Isla-Liu · 2026-05-18T07:18:24Z

+        "ok": True,
+        "session_id": s.session_id,
+        "process_id": pid,
+        "pending_consumed": False,  # frontend re-POSTs /api/chat/start to actually consume


Good catch — that field and comment were stale after the Option Z pivot and contradicted the function's own docstring. Since the endpoint is now a pure no-op for state (the wakeup turn is started server-side; the browser never re-POSTs /api/chat/start), I've dropped "pending_consumed": False and replaced it with "noop": True plus a comment that matches the documented behavior. I verified nothing reads the old field first — the frontend fires the ack fire-and-forget (.catch(() => {})) without inspecting the body, and no test asserts it. Fixed in commit cee897a.

Isla-Liu · 2026-05-18T07:18:26Z

+    for stream_id, channel in items:
+        meta = _cfg.ACTIVE_RUNS.get(stream_id) if hasattr(_cfg, "ACTIVE_RUNS") else None
+        owner_sid = (meta or {}).get("session_id") if isinstance(meta, dict) else None
+        # When no ACTIVE_RUNS row matches, fall back to broadcasting — the
+        # frontend ignores events for the wrong session_id by inspecting the
+        # payload (data.session_id !== activeSid).
+        if owner_sid and owner_sid != session_id:
+            continue
+        try:
+            channel.put_nowait((event, data))
+            emitted += 1
+        except Exception:
+            logger.debug("process_complete emit failed for stream %s", stream_id, exc_info=True)


Resolved by skipping non-matching streams when the owner is unknown — the option you preferred. _emit_to_session_streams now does if owner_sid != session_id: continue (previously if owner_sid and owner_sid != session_id), so a stream with no matching ACTIVE_RUNS row is no longer broadcast to, removing the cross-session-leak surface.

This was justified with a deterministic Playwright reproduction of the related open-tab live-view defect: the per-session SessionChannel (emitted unconditionally below this loop), now backed by an on-subscribe recovery replay, is the sole authoritative cross-turn / server-initiated live-view carrier after the Option X/Z work. The STREAMS loop is in-turn defense-in-depth only and is never the path that delivers a between-turns or server-initiated wakeup, so the broadcast fallback is not load-bearing for any live-view path. Covered by regression tests (owner-unknown skip, matching-owner still delivered, cross-session isolation). Fixed in commit 29cdce2.

Isla-Liu · 2026-05-18T07:18:28Z

+    if process_id:
+        try:
+            from tools.process_registry import process_registry as _pr
+            with _pr._lock:
+                _pr._completion_consumed.add(process_id)
+        except Exception:
+            logger.debug("Failed to mark process completion consumed on B drain", exc_info=True)


Addressed with the defensive option (b). The read side already uses the public process_registry.is_completion_consumed(); only the write side reached into the private _lock / _completion_consumed, because there is no public mark_completion_consumed() upstream. I've extracted _mark_registry_completion_consumed() which documents the coupling and narrows the exception handling: ImportError stays best-effort (the registry is legitimately absent in non-agent unit-test contexts), but an AttributeError/TypeError from a future rename or retype is now logged at ERROR with an explicit contract-violation message instead of being swallowed silently at DEBUG. A future rename therefore fails loud rather than silently reintroducing the double-wakeup.

I also added a contract test (test_registry_completion_consumed_contract) plus a fail-loud test in tests/test_process_complete_ab_coexistence.py that pin the expected ProcessRegistry private surface, so an upstream rename breaks CI here visibly. Fixed in commit 9f3c7b6.

For the longer term, option (a) — a public mark_completion_consumed(process_id) / is_completion_consumed(process_id) pair on ProcessRegistry upstream — is the durable fix; I'm happy to submit that upstream and switch this call site to it if you'd like.

Isla-Liu · 2026-05-18T07:18:30Z

+
+Run:
+  /home/islaliu/.hermes/hermes-agent/venv/bin/python \
+    tests/manual_repro_wakeup_hang.py
+"""
+


Moved it out of the test root: git mv tests/manual_repro_wakeup_hang.py tests/manual/repro_wakeup_hang.py. I stripped the hardcoded absolute interpreter path from the docstring (the run instruction is now parametric — python tests/manual/repro_wakeup_hang.py from the repo root) and fixed its repo-root sys.path computation for the deeper location.

For the collection exclusion I used collect_ignore_glob = ["manual/*.py"] in the existing top-level tests/conftest.py rather than a nested tests/manual/conftest.py: a second conftest module collides with the suite's top-level from conftest import TEST_BASE/TEST_STATE_DIR imports (the suite uses rootdir-relative module names, no packages). Verified pytest --collect-only tests/ collects cleanly with no collection errors and the manual script no longer surfaces. Fixed in commit 3b61772.

…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).

Isla-Liu · 2026-05-19T02:27:56Z

Rebased onto latest origin/master to resolve the merge conflicts (the branch had fallen 41 commits behind).

Rebase summary

Semantic rebase (not merge): 15 commits replayed linearly on origin/master, 0 merge commits, original authorship preserved. New head d6753e55.
Zero semantic loss: git diff <merge-base>..<old-head> --stat is byte-identical to git diff origin/master..<new-head> --stat (20 files, +4935/−20 both sides). The only content delta is one indentation change on a single static/sessions.js line, grafted into upstream's coalesced IIFE.
Conflict files resolved hunk-by-hunk preserving both intents: api/streaming.py (our Option B/Z + Option X SSE alongside upstream's reasoning-replay / fallback-warning / keyless-endpoint changes), api/routes.py (our wakeup/SSE dispatch + the prefer_cached_catalog fix integrated with upstream's bug: /api/chat/start and /api/sessions can wedge for ~125s while health remains responsive #1855 fast-path, both kept), api/config.py, static/sessions.js, static/messages.js, CHANGELOG.md (upstream release sections byte-untouched).
All 5 Copilot review items (Portability #1–fix(css): mobile responsive layout and dvh viewport fix #5) and the prior maintainer Q1–Q4 resolutions are preserved through the rebase.

Verification on the rebased tree

Wakeup-critical + RCA + fingerprint suites: 85/85 pass.
Full suite: 5961 passed; the only 3 failures are tests/test_issue1106_custom_providers_models.py — an upstream-owned test file none of these commits touch. Root-caused (4-phase) as the already-accepted model-resolve-hang budget tripping under this runner's slow DNS: re-running with the documented HERMES_WEBUI_MODELS_REBUILD_BUDGET=0 switch → 17/17 pass. Not a rebase regression; not investigated further per that prior decision.
Live-instance self-wake verified on the rebased code (d6753e55): both the fast path (background process completing during turn-teardown) and the slow path (completing after teardown, idle-hook) trigger a server-initiated wakeup turn that renders live in an already-open tab with no refresh.

Ready for independent maintainer review of the sensitive streaming paths.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.

…ion X per-session live-view SSE on top of merged nesquena#2279 Layered ON TOP of the now-merged upstream PR nesquena#2279 (Option A next-turn drain). This branch carries ZERO nesquena#2279 code — nesquena#2279 (_drain_webui_process _notifications / _format_process_notification / _mark_process_completion _consumed / HERMES_SESSION_ID+PLATFORM env wiring / drain call-site / and intentionally NOT re-added here. Our-original contributions only: - api/background_process.py (NEW): completion-queue drain thread; Option Z server-side wakeup primary (_start_server_side_wakeup_turn, _session_has_active_turn); Option X SessionChannel registry + reaper + /api/session/stream fan-out; defer-race persistence (record/claim/drain_deferred_wakeups_for_session). Idempotency aligns to the REAL upstream dedupe contract: process_registry._completion _consumed via the same private-marker bridge nesquena#2279 uses, plus the ours-original PROCESS_COMPLETE_EVENTS_SEEN secondary gate. - api/config.py: PROCESS_SESSION_INDEX, PENDING_PROCESS_COMPLETIONS, PROCESS_COMPLETE_EVENTS_SEEN, DEFERRED_PROCESS_WAKEUPS; model-resolve hang fix (get_available_models(prefer_cache=), _minimal_static_models _catalog, bounded _LIVE_REBUILD_BUDGET_SECONDS); SessionChannel TTLs. - api/streaming.py: HERMES_SESSION_CHAT_ID env wiring + restore (the HERMES_SESSION_ID/PLATFORM lines are upstream nesquena#2279, kept once); register_process_session bind at chat start; SSE_WRITE_DEADLINE_SECONDS + _sse_set_write_deadline (Defect A backpressure); defer-race turn-teardown drain_deferred_wakeups_for_session idle-hook. - api/routes.py: start_session_turn() reusable headless entrypoint + server_turn_started live-view fan-out (Defect B); _handle_session_sse _stream (/api/session/stream); _handle_process_complete_ack diagnostic; PENDING_PROCESS_COMPLETIONS import + discard hook; prefer_cached_catalog threaded through _resolve_compatible_session_model_state (integrated with upstream nesquena#1855 fast-path — both kept); _sse_set_write_deadline on all 6 SSE endpoints. - server.py: drain + SessionChannel reaper start/stop (integrated with upstream drain_all_on_shutdown, both run). - static/messages.js + sessions.js: per-session live-view EventSource, _handleProcessCompleteEvent shared handler, server_turn_started -> attachLiveStream, startSessionStream/stopSessionStream lifecycle. Closed-tab + fast/slow self-wake + live-view + no-double-fire-with-nesquena#2279 + no-loop preserved.

…ab (lost fire-and-forget server_turn_started) Root cause (Playwright-reproduced, see liveview-open-tab-fix.md §1): the server_turn_started fan-out in start_session_turn is a fire-and-forget SessionChannel.emit with no replay buffer. A tab whose /api/session/stream subscriber is momentarily absent at the emit instant (transient SSE drop / reverse-proxy idle-timeout / browser conn-pool starvation) misses it permanently, so a server-initiated wakeup never renders live (needs hard refresh). Server-side wakeup itself ran+persisted fine; only live-view was lost. Fix: background_process.active_stream_id_for_session() + on-subscribe recovery in _handle_session_sse_stream replays a synthetic server_turn_started {recovered:true} to a freshly-subscribed tab; messages.js honours recovered via the reconnecting (replay) attach path. Reuses the single attachLiveStream renderer; idempotent with the original frame.

…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).

…lot nesquena#3 owner-unknown skip 8 new tests in test_optionz_liveview_perf.py: active_stream_id_for_session lookup; on-subscribe recovery behavioural contract; handler + frontend (recovered->reconnecting) wiring grep; Copilot nesquena#3 skip-owner-unknown behaviour, owner-match regression guard, cross-session isolation, source-grep. 40/40 in liveview+optionx; 37/37 wakeup-critical green.

… Copilot nesquena#3 resolution

…iew nesquena#1) The 3 top-level fix-notes (wakeup-model-resolve-hang-fix.md, wakeup-defer-race-fix.md, optionz-liveview-perf-fix.md) were internal working notes with local pids, absolute machine paths, and task tokens. Their substance is already release-note captured in CHANGELOG.md [Unreleased]; nothing unique is lost. They do not belong in the repo.

…omment (Copilot nesquena#2) The Option Z pivot made /api/process-complete-ack a pure no-op for state (the server-side drain thread starts the wakeup turn; the browser never re-POSTs /api/chat/start). The response field "pending_consumed": False with the comment "frontend re-POSTs /api/chat/start to actually consume" directly contradicted the function's own docstring and the actual frontend (which fires the ack fire-and-forget and never reads the body). Replace it with an honest "noop": True that matches the documented behavior. No caller reads the old field (frontend .catch()s the fetch; no test asserts it).

…me; use/assert the dedupe contract (Copilot nesquena#4) _process_one() reached into process_registry._lock / _completion_consumed inside a broad `except Exception` that swallowed an AttributeError, so a future upstream rename of those privates would silently reintroduce the cross-A/B double-wakeup bug. There is no public mark_completion_consumed() upstream (only the READ side is public via is_completion_consumed(), already used here), so per Copilot's option (b): - Extract _mark_registry_completion_consumed() that documents the coupling and narrows exception handling: ImportError stays best-effort (registry legitimately absent off-agent), but AttributeError/TypeError from a renamed/retyped private now logs at ERROR with an explicit contract-violation message instead of a silent DEBUG. - Add _REGISTRY_CONSUMED_CONTRACT + two tests in test_process_complete_ab_coexistence.py: test_registry_completion_consumed_contract pins the private surface so an upstream rename breaks CI loudly HERE; the fail-loud test asserts a rename surfaces as ERROR not silent DEBUG. An upstream public mark_completion_consumed() remains the durable fix (noted in code + the PR reply as a maintainer suggestion). 72/72 wakeup-critical + changed-area tests green.

…Copilot nesquena#5) manual_repro_wakeup_hang.py sat in the pytest test root with a hardcoded /home/islaliu/.hermes/... venv path in its docstring. pytest never ran it (no test_ functions) but it was confusing. - git mv tests/manual_repro_wakeup_hang.py -> tests/manual/repro_wakeup_hang.py - Strip the operator-specific absolute interpreter path from the docstring; the run instruction is now parametric ("python tests/manual/repro_wakeup_hang.py" from repo root) and fix the sys.path repo-root math for the now-deeper location. - Exclude tests/manual/ from collection via collect_ignore_glob in the existing tests/conftest.py rather than a nested conftest.py: a second conftest module would collide with the suite's top-level 'from conftest import TEST_BASE/TEST_STATE_DIR' imports (no packages / rootdir-relative module names). Verified: pytest --collect-only tests/ collects 6014 tests, 0 collection errors, the manual script no longer surfaces.

…esquena#5 follow-up) Copilot review nesquena#5 'strip local paths' missed one residue: tests/test_wakeup_model_resolve_hang.py:3-4 still embedded the reviewer's machine path /home/islaliu/.hermes/webui/bootstrap-8787.log. Rewrote lines 3-4 to a neutral description ('a live thread-stack capture of a hung wakeup turn') — root-cause explanation itself unchanged. Full re-grep of tests/ + tests/manual/ confirms this was the only remaining residue. 7/7 tests pass.

…ider-catalog rebuild Multi-tab streaming interlock RCA (kanban t_d127953d). The two display-only resolvers _resolve_effective_session_model_for_display / _resolve_effective_session_model_provider_for_display called _resolve_compatible_session_model_state() without prefer_cached_catalog=True. For sessions with no persisted model_provider (kanban/imported), the fast path is skipped and it falls through to a full live get_available_models() cold rebuild (botocore IMDS ~4s on non-AWS/WSL + anthropic/openrouter /models ~0.9s). GET /api/session is a hot, side-effect-free per-tab/per-poll path; concurrent tabs serialized on the models-cache lock and starved SSE/streaming -> BrokenPipe/Cancelled storm. POST /api/chat/start wakeup path already passed prefer_cached_catalog=True; the GET display path was the missed blind spot. Fix: pass prefer_cached_catalog=True at both display resolvers (persisted session model authoritative; network-free minimal catalog backstops the default). One logical change, 2 call sites. ~7742ms -> 133-148ms (~52x); display-path live rebuilds N -> 0. - tests/test_session_display_resolver_no_live_rebuild.py: new invariant tripwire regression (4 tests; pre-fix 3 fail proving non-tautology) - tests/test_provider_mismatch.py: stub signature aligned (lambda **_kw)

…h.json fingerprint auth.json is rewritten by credential-pool/OAuth token refresh roughly every 14 minutes. _models_cache_source_fingerprint() hashed it via mtime/size (nesquena#1699 _models_cache_file_fingerprint), so every token refresh churned the fingerprint and the 24h /api/models cache was effectively dead -- the hot GET /api/session?resolve_model=1 path paid a cold ~11.5s rebuild every few minutes (RCA t_d127953d residual nesquena#2, t_16551f61). Add _auth_store_semantic_fingerprint(): content-hash auth.json with a DENY-list of known credential-rotation-only keys (access/refresh token, expiry, per-credential status/telemetry, request_count, save updated_at) stripped. Deny-list (not allow-list) is deliberate -- any unknown field, or a real provider/endpoint/model-set change (active_provider, a new credential_pool entry, base_url, source, label, auth_type, the providers{} block, ...) stays in the fingerprint and still correctly busts the cache. Conservative fallbacks: missing file -> marked; unreadable/corrupt -> stat-based fallback (never less safe than pre-fix). config.yaml keeps the cheap stat fingerprint (deliberate edits, no timer churn). Bidirectional invariant regression test (non-tautological -- the end-to-end churn test flips RED when the auth_json axis is reverted to stat-based): token-only churn keeps fingerprint byte-identical AND keeps a valid disk cache loadable; active_provider change / new credential_pool entry / changed base_url each flip the fingerprint AND reject the stale disk cache. Measured: 5/5 cold rebuilds per 5 refresh cycles -> 0/5. Tests: 9 new pass; 28 adjacent (nesquena#1699/nesquena#1633/display-resolver) pass; 54 models_cache/fingerprint suite pass.

…n wakeup misroute root fix, Option 1) RCA t_f62ff1e8: WebUI bound per-turn session identity only to the process-global os.environ['HERMES_SESSION_KEY'] and released the env lock before the agent ran; it never set any contextvar, so get_current_session_key fell back to that racy slot. Two concurrent WebUI turns raced on one slot -> a notify_on_complete spawn captured a concurrent turn's session_id -> server-side wakeup turn started for the wrong session (agent.log:6632). _set/_reset/_bind_turn_session_identity bind the turn's sid to tools.approval._approval_session_key + gateway.session_context._SESSION_KEY in the _run_agent_streaming worker thread (concurrent tool batches inherit via copy_context); reset-token restore in the same outer finally as the env restore. Does NOT use blanket set_session_vars (would break the notify_on_complete watcher gate). English-only source (no-cjk guard); pre-init ordered to preserve the Issue nesquena#765 locator. Core unchanged. Regression: tests/test_xsession_wakeup_misroute.py (RED pre / GREEN post).

…session misroute defense-in-depth, Option 3) Independent safety net at the wakeup-routing layer, decoupled from the Option 1 root fix (different file/layer, separately revertible). On _process_one, after PROCESS_SESSION_INDEX resolves the target, _resolve_wakeup_target cross-checks it against an env-immune spawn owner on the ProcessSession (forward-compatible duck-typing of spawn_session_id/owner_session_id/turn_session_id since adding the field to core ProcessSession is out of scope for this WebUI-only change). Positive mismatch -> log ERROR + re-route the wakeup to the true owner. Pure pass-through when no env-immune owner is available (today's core, cron/CLI procs, pre-Option-1 spawns) so Option Z is never suppressed on uncertainty. Regression covered by tests/test_xsession_wakeup_misroute.py.

…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.

Isla-Liu · 2026-05-19T06:30:32Z

If the rebase is clean, no maintainer intervention needed — just push the rebase + flip out of draft and we'll review.

Ping back if any of the Q1–Q4 decisions affect implementation enough that you need to revisit before rebasing.

@nesquena-hermes @nesquena
Hope for some idea or advice, thanks!

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.

…ena#2550/nesquena#2551/nesquena#2554/nesquena#2560/nesquena#2567/nesquena#2568) into process-complete-event PR # Conflicts: # static/sessions.js

…own streams in _emit_to_session_streams (no cross-session broadcast) Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).

Restores green CI for tests/test_xsession_wakeup_misroute.py and tests/test_process_complete_ab_coexistence.py::test_registry_completion_consumed_contract and tests/test_session_channel_option_x.py::test_real_completion_event_shape_routes_to_session_channel on Python 3.11 / 3.12 / 3.13. Failure: `ModuleNotFoundError: No module named 'tools'` from imports `from tools.process_registry import ...` and `from tools.approval import ...`. Tests depend on the hermes-agent sibling project's `tools/` namespace; they now skip when hermes-agent is not installed, matching the repo's existing agent-dependent test convention. Co-authored-by: Hermes Agent <hermes-agent@users.noreply.github.com>

nesquena added the hold label May 14, 2026

Isla-Liu force-pushed the feat/process-complete-event-isla branch from d168100 to 4358d2f Compare May 18, 2026 04:56

Isla-Liu marked this pull request as ready for review May 18, 2026 04:59

Copilot AI review requested due to automatic review settings May 18, 2026 04:59

Copilot started reviewing on behalf of Isla-Liu May 18, 2026 05:00 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Isla-Liu requested a review from Copilot May 18, 2026 05:08

Copilot started reviewing on behalf of Isla-Liu May 18, 2026 05:09 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Isla-Liu requested a review from Copilot May 18, 2026 07:22

Copilot started reviewing on behalf of Isla-Liu May 18, 2026 07:22 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Isla-Liu requested a review from Copilot May 18, 2026 08:54

Copilot started reviewing on behalf of Isla-Liu May 18, 2026 08:54 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

nesquena-hermes mentioned this pull request May 18, 2026

Stage 384 / v0.51.91 — Release BO — 5-PR full sweep batch #2557

Merged

Isla-Liu force-pushed the feat/process-complete-event-isla branch from 0ad2fde to d6753e5 Compare May 19, 2026 02:27

Isla-Liu requested a review from Copilot May 19, 2026 02:29

Copilot started reviewing on behalf of Isla-Liu May 19, 2026 02:30 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

nesquena-hermes mentioned this pull request May 19, 2026

Stage 385 / v0.51.92 — Release BP — 7-PR full sweep batch #2570

Merged

Isla-Liu added 13 commits May 19, 2026 14:03

docs(CHANGELOG): fix open-tab live-view of server-side wakeup turns +…

7fb78ed

… Copilot nesquena#3 resolution

Isla-Liu force-pushed the feat/process-complete-event-isla branch from d76801c to 64b8294 Compare May 19, 2026 06:21

Isla-Liu requested a review from Copilot May 19, 2026 06:25

Copilot started reviewing on behalf of Isla-Liu May 19, 2026 06:25 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Isla-Liu changed the title ~~feat(streaming): process_complete event for terminal(notify_on_complete=true) agent wakeup~~ feat(streaming): wake the agent server-side when a notify_on_complete background task finishes while the WebUI tab/session is idle May 19, 2026

Isla-Liu requested a review from Copilot May 19, 2026 07:12

Copilot started reviewing on behalf of Isla-Liu May 19, 2026 07:12 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

This was referenced May 19, 2026

Stage 386 / v0.51.93 — Release BQ — 10-PR full sweep batch #2596

Merged

Stage 387 / v0.51.94 — Release BR — 10-PR full sweep batch #2606

Merged

merge: incorporate origin/master (Stage-385 + PRs nesquena#2548/nesqu…

6458542

…ena#2550/nesquena#2551/nesquena#2554/nesquena#2560/nesquena#2567/nesquena#2568) into process-complete-event PR # Conflicts: # static/sessions.js

Isla-Liu force-pushed the feat/process-complete-event-isla branch from 1de6e8a to 6458542 Compare May 22, 2026 04:50

Isla-Liu and others added 2 commits May 22, 2026 13:31

Merge branch 'nesquena:master' into feat/process-complete-event-isla

8455692

		The user restarted the real instance (pid 765), ran an empty background task
		with `notify_on_complete=true`. The `process_complete` event fired

Conversation

Isla-Liu commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thinking Path

What Changed

Why It Matters

Relationship to #2279

Verification

Risks / Follow-ups

Model Used

Uh oh!

Isla-Liu commented May 15, 2026

Uh oh!

nesquena-hermes commented May 17, 2026

Decisions

Status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Isla-Liu commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Isla-Liu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Isla-Liu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Isla-Liu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Isla-Liu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Isla-Liu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Isla-Liu commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Isla-Liu commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Isla-Liu commented May 14, 2026 •

edited

Loading

Isla-Liu commented May 18, 2026 •

edited

Loading

Isla-Liu commented May 19, 2026 •

edited

Loading