Skip to content

feat(streaming): wake the agent server-side when a notify_on_complete background task finishes while the WebUI tab/session is idle#2242

Open
Isla-Liu wants to merge 20 commits into
nesquena:masterfrom
Isla-Liu:feat/process-complete-event-isla
Open

feat(streaming): wake the agent server-side when a notify_on_complete background task finishes while the WebUI tab/session is idle#2242
Isla-Liu wants to merge 20 commits into
nesquena:masterfrom
Isla-Liu:feat/process-complete-event-isla

Conversation

@Isla-Liu
Copy link
Copy Markdown
Contributor

@Isla-Liu Isla-Liu commented May 14, 2026

Thinking Path

  • A core Hermes Agent guarantee is that a terminal(background=true, notify_on_complete=true) task wakes the agent when it finishes, so the agent can act on the result on its own. The CLI and the gateway hosts (Telegram, Discord, etc.) already do this by draining the completion queue inside their own run loops.
  • In the WebUI this guarantee did not hold. The background process exited, its completion was queued in the agent process registry, and nothing read from it — the agent only resumed when a human typed the next message. The agent had told the user "I'll continue when this finishes," and then silently never did.
  • The merged PR Fix WebUI stream completion recovery gaps #2279 closed part of this gap with a next-turn safety-net drain: at the start of the next user-initiated turn it drains any pending completions and prepends them to the user's message. That is the correct backstop, but it cannot help the primary case this PR targets: fire a long task, close the tab, come back later — an autonomous agent in that state has no "next user turn," and a closed tab can never re-POST anything to the server.
  • The structurally correct fix is therefore to start the wakeup turn on the server, with no browser round-trip — exactly the way the CLI and gateway hosts already self-wake — and to keep an open tab able to watch that server-started turn live.

What Changed

This PR layers a proactive, server-side wakeup on top of the merged #2279
next-turn drain. It carries none of #2279's code (that is upstream now);
it is rebased onto current master.

  • Server-side wakeup turn (primary path). A startup drain thread reads the agent process completion queue and, when the owning session is idle, starts the agent turn directly on the server via a new headless entrypoint (api/routes.py::start_session_turn) that reuses the exact session-load → workspace → model/provider → worker-thread path /api/chat/start already uses. No second agent-construction path is introduced. If a turn is already active, it does not start a competing turn and instead leaves the completion for the merged Fix WebUI stream completion recovery gaps #2279 next-turn drain.
  • Persistent per-session live-view channel (GET /api/session/stream). A long-lived SSE channel keyed on the WebUI session_id (so it survives across turns, unlike the per-turn chat stream). A reaper thread collects channels with no subscribers after a grace period and idle channels after a configurable TTL; a channel with a live subscriber is never collected. When a turn is started server-side, an open tab attaches the existing chat-stream renderer to it and shows it live; a closed tab simply gets the persisted result on next load (parity with CLI behavior).
  • SSE write-deadline backpressure. All long-lived SSE endpoints now arm a socket write deadline, so a slow or backgrounded tab whose TCP receive window has filled can no longer pin a server worker thread indefinitely (which previously also blocked the live-view reaper).
  • Cold model-catalog resolution no longer blocks a wakeup. A server-side wakeup fires exactly when the model catalog is cold; the cold rebuild could make a blocking network call (provider token exchange) that hung the turn on restricted networks. The wakeup now resolves from warm/disk cache or a network-free minimal catalog, and the already-persisted session model always wins. Integrated with the existing upstream model-resolve fast-path.
  • Fast-task completion race. When a task finishes in the narrow window while a turn is tearing down, the wakeup prompt is persisted and redelivered by a turn-teardown idle hook, making the busy-at-completion case behave the same as the idle case. The claim is atomic, so it fires exactly once and never double-fires with the merged Fix WebUI stream completion recovery gaps #2279 drain.
  • Cross-session wakeup isolation. Per-turn session identity is bound to a context variable for the duration of the turn (plus a completion-time target cross-check as defense in depth), so with multiple concurrent WebUI sessions a completed background task always wakes the session that launched it, never another open session.
  • Copilot review comments on this PR have been addressed (agent working notes removed from the repo root, a misleading ack response field corrected, owner-unknown streams no longer broadcast cross-session, the dedupe contract asserted instead of silently swallowing a registry rename, and local absolute paths stripped from a test).

Why It Matters

The headline use case for notify_on_complete is "start something long, walk
away, let the agent pick it up when it's done." Before this PR a WebUI user
who did that — especially with the tab closed — got nothing: the agent
appeared to hang forever. After this PR the agent wakes itself on the server
the moment the task completes, an open tab sees that turn render live, and a
closed tab finds the completed turn already persisted on next load. WebUI
now matches the wakeup behavior the CLI and gateway hosts have always had.

Relationship to #2279

PR #2279 (merged into master) is the next-turn safety-net drain — it
delivers a pending completion at the start of the next user turn. This PR is
the proactive server-side layer on top of it: it wakes the agent without
waiting for a next user turn and without a browser round-trip. The two share a
single completion-consumed dedupe key (the contract #2279 established), so a
completion wakes the agent exactly once regardless of which path fires first;
neither double-fires and there is no wakeup loop. This PR carries none of
#2279's code.

Refs #2278 (root issue — WebUI never fires notify_on_complete wakeups;
closed by #2279 for the next-turn case, completed here for the
idle/closed-tab case). Builds on #2279. The #2262 compression-marker fix was
shipped upstream via #2279 and is intentionally not re-added here.

Verification

  • Full Python regression suite: 0 regressions vs the post-Fix WebUI stream completion recovery gaps #2279-merge master baseline.
  • Wakeup-critical suites pass, covering: closed-tab self-wake when idle; fast vs. slow completion (completion-during-teardown); live-view of a server-started turn plus SSE write-deadline backpressure; no double-fire / no loop across the merged Fix WebUI stream completion recovery gaps #2279 and this PR in both interleavings; cold model-catalog resolution surviving a hung network probe; cross-session wakeup isolation under concurrent sessions.
  • Upstream Fix WebUI stream completion recovery gaps #2279 contract suites remain green and unchanged.
  • Live-instance verification: a real WebUI instance self-woke on both a fast and a slow notify_on_complete background task with the tab closed, and an open tab rendered the server-started turn live (verified with Playwright).
  • git diff --check clean; touched modules py-compile clean.

Risks / Follow-ups

  • Adds two long-lived server threads (the completion drain and the live-view reaper). Both are daemon threads with bounded work; the reaper's grace period and idle TTL are configurable.
  • The live-view channel is keyed on session_id and survives across turns; the reaper TTL is the upper bound on an abandoned channel's lifetime.
  • The completion-consumed dedupe key is the contract established by the merged Fix WebUI stream completion recovery gaps #2279; if a future change alters that contract, both the next-turn drain and this proactive path must be updated together.

Model Used

AI-assisted change with repository inspection, source-level root-cause
analysis, targeted editing, automated regression suites, and live-instance
verification. AI assisted with code, tests, verification, and PR text.

@nesquena nesquena added the hold label May 14, 2026
@Isla-Liu
Copy link
Copy Markdown
Contributor Author

Self-review Q1-Q4 from internal reviewer (decisions, ahead of upstream review):

Q1 Event name — keeping process_complete. Matches outline/RFC/tests/CHANGELOG; renaming pre-merge isnt worth the churn.

Q2 /api/process-complete-ack auth — keeping current (mirror /api/goal-continue ownership check). Endpoint is diagnostic-only; the real auth-bearing path is /api/chat/start.

Q3 watch-pattern UX — keeping single event channel; will branch toast text in frontend on data.type as a follow-up commit if maintainers want polish. Splitting events doubles the contract surface for a UX nit.

Q4 hermes-agent ±5 LOC companion patch — deferring indefinitely. WebUI-side env export is sufficient and matches the CLI/gateway pattern.

Heads-up for maintainers: PR #2279 was opened today by another contributor as the Option A drain implementation for issue #2278. We are rebasing this PR on top of #2279 to position it as the Option B proactive-notification layer (per maintainer endorsement in #2278: "Option B can land later if there is demand"). #2262 compression-marker fix attribution stays with #2279. Updated PR body + integrated branch coming after local verification.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Maintainer signal on Q1–Q4 so you can move forward:

Decisions

Q1 — Event name process_complete: keep as-is. Agreed, renaming pre-merge is churn.

Q2 — /api/process-complete-ack auth model: the ownership check mirroring /api/goal-continue is fine for the diagnostic ack endpoint. The real auth-bearing path is /api/chat/start, as you noted. Approved.

Q3 — Single event channel + frontend branching on data.type: approved. Splitting events would double the contract surface for a UX polish concern. If toast-text variants are wanted later, branching in messages.js based on the payload is the right place.

Q4 — Defer hermes-agent companion patch: approved. WebUI-side env export matches the CLI/gateway pattern, no need for an agent-side change in this slice.

Status

PR #2279 already merged (May 15) as the Option A drain layer. Path forward for this PR:

  1. Rebase on current master. v0.51.78 / v0.51.79 / v0.51.80 have shipped since you opened this. mergeable: CONFLICTING right now.
  2. Reposition the PR body to reflect the new layering: Option B proactive-notification layer on top of Fix WebUI stream completion recovery gaps #2279's drain helper.
  3. CI run. Once rebased and green, mark "ready for review" — we'll route through independent review for the streaming changes (api/streaming.py + api/background_process.py are sensitive paths).

If the rebase is clean, no maintainer intervention needed — just push the rebase + flip out of draft and we'll review.

Ping back if any of the Q1–Q4 decisions affect implementation enough that you need to revisit before rebasing.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR layers proactive server-side agent wakeup ("Option Z") and a persistent per-session SSE live-view channel ("Option X") on top of the recently merged upstream PR #2279 (next-turn drain for terminal(notify_on_complete=true)). It adds a new api/background_process.py module containing a drain thread that reads tools.process_registry.completion_queue and, when the owning session is idle, starts a wakeup agent turn directly server-side — making the "fire a background task, close the tab, come back later" use case work the way CLI/gateway hosts already do. It also adds several follow-ups: an SSE write-deadline to prevent thread exhaustion, a model-resolve cache-only fast path to avoid a Copilot token-exchange hang during cold-cache wakeups, and a defer-race teardown hook so fast background tasks that complete during a turn's teardown window are redelivered when the session goes idle.

Changes:

  • New api/background_process.py (drain thread, SessionChannel registry + reaper, deferred-wakeup persistence, server-side wakeup starter) and a new /api/session/stream SSE endpoint and start_session_turn headless entrypoint in api/routes.py.
  • api/config.py gains get_available_models(prefer_cache=True) + a bounded live-rebuild worker (_LIVE_REBUILD_BUDGET_SECONDS, _minimal_static_models_catalog) and new state (DEFERRED_PROCESS_WAKEUPS, PROCESS_SESSION_INDEX, SESSION_CHANNEL_* TTLs); api/streaming.py adds _sse_set_write_deadline, env wiring for HERMES_SESSION_CHAT_ID, and a turn-teardown drain hook.
  • Frontend (static/messages.js, static/sessions.js) opens/closes a per-session SSE stream, dedupes process_complete events module-wide, and attaches the existing attachLiveStream renderer when the server fans a server_turn_started frame.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file
File Description
api/background_process.py New module: drain thread, SessionChannel + reaper, deferred-wakeup persistence, server-side wakeup runner.
api/config.py Adds prefer_cache mode, bounded rebuild worker, minimal static catalog, and new state buckets for wakeup/SessionChannel.
api/routes.py Adds start_session_turn, /api/session/stream + /api/process-complete-ack handlers, threads prefer_cached_catalog, arms SSE write deadlines, fans server_turn_started.
api/streaming.py Adds _sse_set_write_deadline, exports HERMES_SESSION_CHAT_ID, registers per-session process mapping, calls turn-teardown deferred-wakeup drain.
server.py Starts/stops the drain thread and SessionChannel reaper.
static/messages.js Module-scope dedupe set, process_complete listener, startSessionStream/stopSessionStream, server_turn_started attach-existing-renderer logic.
static/sessions.js Wires session-stream start/stop into newSession/loadSession.
CHANGELOG.md Long entries describing the Option Z pivot, Option X channel, model-resolve fix, defer-race fix, and SSE backpressure fix.
tests/test_session_channel_option_x.py, test_process_complete_wakeup.py, test_process_complete_ab_coexistence.py, test_optionz_liveview_perf.py, test_wakeup_model_resolve_hang.py, test_wakeup_defer_race.py, manual_repro_wakeup_hang.py New tests for SessionChannel lifecycle, AB-coexistence dedupe, live-view fan-out + SSE backpressure, model-resolve hang fix, and defer-race teardown hook.
tests/test_bugbatch_apr2026.py Source-grep updated for new get_available_models( signature.
wakeup-model-resolve-hang-fix.md, wakeup-defer-race-fix.md, optionz-liveview-perf-fix.md New design/repro notes (markdown only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Isla-Liu
Copy link
Copy Markdown
Contributor Author

Isla-Liu commented May 18, 2026

background task working good behavior image:
螢幕擷取畫面 2026-05-18 104714

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Comment thread wakeup-model-resolve-hang-fix.md Outdated
Comment on lines +10 to +11
The user restarted the real instance (pid 765), ran an empty background task
with `notify_on_complete=true`. The `process_complete` event fired
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — agreed. These three files were working notes, not maintained project documentation, and the substance is already captured (operator details stripped) in the [Unreleased] section of CHANGELOG.md. I've removed all three from the repo root rather than relocating them, since nothing in them is unique to keep. Done in commit a76c050.

Comment thread api/routes.py Outdated
"ok": True,
"session_id": s.session_id,
"process_id": pid,
"pending_consumed": False, # frontend re-POSTs /api/chat/start to actually consume
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — that field and comment were stale after the Option Z pivot and contradicted the function's own docstring. Since the endpoint is now a pure no-op for state (the wakeup turn is started server-side; the browser never re-POSTs /api/chat/start), I've dropped "pending_consumed": False and replaced it with "noop": True plus a comment that matches the documented behavior. I verified nothing reads the old field first — the frontend fires the ack fire-and-forget (.catch(() => {})) without inspecting the body, and no test asserts it. Fixed in commit cee897a.

Comment thread api/background_process.py
Comment on lines +307 to +319
for stream_id, channel in items:
meta = _cfg.ACTIVE_RUNS.get(stream_id) if hasattr(_cfg, "ACTIVE_RUNS") else None
owner_sid = (meta or {}).get("session_id") if isinstance(meta, dict) else None
# When no ACTIVE_RUNS row matches, fall back to broadcasting — the
# frontend ignores events for the wrong session_id by inspecting the
# payload (data.session_id !== activeSid).
if owner_sid and owner_sid != session_id:
continue
try:
channel.put_nowait((event, data))
emitted += 1
except Exception:
logger.debug("process_complete emit failed for stream %s", stream_id, exc_info=True)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by skipping non-matching streams when the owner is unknown — the option you preferred. _emit_to_session_streams now does if owner_sid != session_id: continue (previously if owner_sid and owner_sid != session_id), so a stream with no matching ACTIVE_RUNS row is no longer broadcast to, removing the cross-session-leak surface.

This was justified with a deterministic Playwright reproduction of the related open-tab live-view defect: the per-session SessionChannel (emitted unconditionally below this loop), now backed by an on-subscribe recovery replay, is the sole authoritative cross-turn / server-initiated live-view carrier after the Option X/Z work. The STREAMS loop is in-turn defense-in-depth only and is never the path that delivers a between-turns or server-initiated wakeup, so the broadcast fallback is not load-bearing for any live-view path. Covered by regression tests (owner-unknown skip, matching-owner still delivered, cross-session isolation). Fixed in commit 29cdce2.

Comment thread api/background_process.py Outdated
Comment on lines +413 to +419
if process_id:
try:
from tools.process_registry import process_registry as _pr
with _pr._lock:
_pr._completion_consumed.add(process_id)
except Exception:
logger.debug("Failed to mark process completion consumed on B drain", exc_info=True)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with the defensive option (b). The read side already uses the public process_registry.is_completion_consumed(); only the write side reached into the private _lock / _completion_consumed, because there is no public mark_completion_consumed() upstream. I've extracted _mark_registry_completion_consumed() which documents the coupling and narrows the exception handling: ImportError stays best-effort (the registry is legitimately absent in non-agent unit-test contexts), but an AttributeError/TypeError from a future rename or retype is now logged at ERROR with an explicit contract-violation message instead of being swallowed silently at DEBUG. A future rename therefore fails loud rather than silently reintroducing the double-wakeup.

I also added a contract test (test_registry_completion_consumed_contract) plus a fail-loud test in tests/test_process_complete_ab_coexistence.py that pin the expected ProcessRegistry private surface, so an upstream rename breaks CI here visibly. Fixed in commit 9f3c7b6.

For the longer term, option (a) — a public mark_completion_consumed(process_id) / is_completion_consumed(process_id) pair on ProcessRegistry upstream — is the durable fix; I'm happy to submit that upstream and switch this call site to it if you'd like.

Comment on lines +17 to +22

Run:
/home/islaliu/.hermes/hermes-agent/venv/bin/python \
tests/manual_repro_wakeup_hang.py
"""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it out of the test root: git mv tests/manual_repro_wakeup_hang.py tests/manual/repro_wakeup_hang.py. I stripped the hardcoded absolute interpreter path from the docstring (the run instruction is now parametric — python tests/manual/repro_wakeup_hang.py from the repo root) and fixed its repo-root sys.path computation for the deeper location.

For the collection exclusion I used collect_ignore_glob = ["manual/*.py"] in the existing top-level tests/conftest.py rather than a nested tests/manual/conftest.py: a second conftest module collides with the suite's top-level from conftest import TEST_BASE/TEST_STATE_DIR imports (the suite uses rootdir-relative module names, no packages). Verified pytest --collect-only tests/ collects cleanly with no collection errors and the manual script no longer surfaces. Fixed in commit 3b61772.

Isla-Liu pushed a commit to Isla-Liu/hermes-webui that referenced this pull request May 18, 2026
…own streams in _emit_to_session_streams (no cross-session broadcast)

Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
@Isla-Liu Isla-Liu requested a review from Copilot May 18, 2026 07:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

@Isla-Liu Isla-Liu force-pushed the feat/process-complete-event-isla branch from 0ad2fde to d6753e5 Compare May 19, 2026 02:27
Isla-Liu pushed a commit to Isla-Liu/hermes-webui that referenced this pull request May 19, 2026
…own streams in _emit_to_session_streams (no cross-session broadcast)

Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
@Isla-Liu
Copy link
Copy Markdown
Contributor Author

Rebased onto latest origin/master to resolve the merge conflicts (the branch had fallen 41 commits behind).

Rebase summary

  • Semantic rebase (not merge): 15 commits replayed linearly on origin/master, 0 merge commits, original authorship preserved. New head d6753e55.
  • Zero semantic loss: git diff <merge-base>..<old-head> --stat is byte-identical to git diff origin/master..<new-head> --stat (20 files, +4935/−20 both sides). The only content delta is one indentation change on a single static/sessions.js line, grafted into upstream's coalesced IIFE.
  • Conflict files resolved hunk-by-hunk preserving both intents: api/streaming.py (our Option B/Z + Option X SSE alongside upstream's reasoning-replay / fallback-warning / keyless-endpoint changes), api/routes.py (our wakeup/SSE dispatch + the prefer_cached_catalog fix integrated with upstream's bug: /api/chat/start and /api/sessions can wedge for ~125s while health remains responsive #1855 fast-path, both kept), api/config.py, static/sessions.js, static/messages.js, CHANGELOG.md (upstream release sections byte-untouched).
  • All 5 Copilot review items (Portability #1fix(css): mobile responsive layout and dvh viewport fix #5) and the prior maintainer Q1–Q4 resolutions are preserved through the rebase.

Verification on the rebased tree

  • Wakeup-critical + RCA + fingerprint suites: 85/85 pass.
  • Full suite: 5961 passed; the only 3 failures are tests/test_issue1106_custom_providers_models.py — an upstream-owned test file none of these commits touch. Root-caused (4-phase) as the already-accepted model-resolve-hang budget tripping under this runner's slow DNS: re-running with the documented HERMES_WEBUI_MODELS_REBUILD_BUDGET=0 switch → 17/17 pass. Not a rebase regression; not investigated further per that prior decision.
  • Live-instance self-wake verified on the rebased code (d6753e55): both the fast path (background process completing during turn-teardown) and the slow path (completing after teardown, idle-hook) trigger a server-initiated wakeup turn that renders live in an already-open tab with no refresh.

Ready for independent maintainer review of the sensitive streaming paths.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.

…ion X per-session live-view SSE on top of merged nesquena#2279

Layered ON TOP of the now-merged upstream PR nesquena#2279 (Option A next-turn
drain). This branch carries ZERO nesquena#2279 code — nesquena#2279 (_drain_webui_process
_notifications / _format_process_notification / _mark_process_completion
_consumed / HERMES_SESSION_ID+PLATFORM env wiring / drain call-site /
and intentionally NOT re-added here.

Our-original contributions only:
- api/background_process.py (NEW): completion-queue drain thread;
  Option Z server-side wakeup primary (_start_server_side_wakeup_turn,
  _session_has_active_turn); Option X SessionChannel registry + reaper
  + /api/session/stream fan-out; defer-race persistence
  (record/claim/drain_deferred_wakeups_for_session). Idempotency aligns
  to the REAL upstream dedupe contract: process_registry._completion
  _consumed via the same private-marker bridge nesquena#2279 uses, plus the
  ours-original PROCESS_COMPLETE_EVENTS_SEEN secondary gate.
- api/config.py: PROCESS_SESSION_INDEX, PENDING_PROCESS_COMPLETIONS,
  PROCESS_COMPLETE_EVENTS_SEEN, DEFERRED_PROCESS_WAKEUPS; model-resolve
  hang fix (get_available_models(prefer_cache=), _minimal_static_models
  _catalog, bounded _LIVE_REBUILD_BUDGET_SECONDS); SessionChannel TTLs.
- api/streaming.py: HERMES_SESSION_CHAT_ID env wiring + restore (the
  HERMES_SESSION_ID/PLATFORM lines are upstream nesquena#2279, kept once);
  register_process_session bind at chat start; SSE_WRITE_DEADLINE_SECONDS
  + _sse_set_write_deadline (Defect A backpressure); defer-race
  turn-teardown drain_deferred_wakeups_for_session idle-hook.
- api/routes.py: start_session_turn() reusable headless entrypoint +
  server_turn_started live-view fan-out (Defect B); _handle_session_sse
  _stream (/api/session/stream); _handle_process_complete_ack diagnostic;
  PENDING_PROCESS_COMPLETIONS import + discard hook; prefer_cached_catalog
  threaded through _resolve_compatible_session_model_state (integrated
  with upstream nesquena#1855 fast-path — both kept); _sse_set_write_deadline on
  all 6 SSE endpoints.
- server.py: drain + SessionChannel reaper start/stop (integrated with
  upstream drain_all_on_shutdown, both run).
- static/messages.js + sessions.js: per-session live-view EventSource,
  _handleProcessCompleteEvent shared handler, server_turn_started ->
  attachLiveStream, startSessionStream/stopSessionStream lifecycle.

Closed-tab + fast/slow self-wake + live-view + no-double-fire-with-nesquena#2279
+ no-loop preserved.
Isla-Liu added 13 commits May 19, 2026 14:03
…ab (lost fire-and-forget server_turn_started)

Root cause (Playwright-reproduced, see liveview-open-tab-fix.md §1): the server_turn_started fan-out in start_session_turn is a fire-and-forget SessionChannel.emit with no replay buffer. A tab whose /api/session/stream subscriber is momentarily absent at the emit instant (transient SSE drop / reverse-proxy idle-timeout / browser conn-pool starvation) misses it permanently, so a server-initiated wakeup never renders live (needs hard refresh). Server-side wakeup itself ran+persisted fine; only live-view was lost.

Fix: background_process.active_stream_id_for_session() + on-subscribe recovery in _handle_session_sse_stream replays a synthetic server_turn_started {recovered:true} to a freshly-subscribed tab; messages.js honours recovered via the reconnecting (replay) attach path. Reuses the single attachLiveStream renderer; idempotent with the original frame.
…own streams in _emit_to_session_streams (no cross-session broadcast)

Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
…lot nesquena#3 owner-unknown skip

8 new tests in test_optionz_liveview_perf.py: active_stream_id_for_session lookup; on-subscribe recovery behavioural contract; handler + frontend (recovered->reconnecting) wiring grep; Copilot nesquena#3 skip-owner-unknown behaviour, owner-match regression guard, cross-session isolation, source-grep. 40/40 in liveview+optionx; 37/37 wakeup-critical green.
…iew nesquena#1)

The 3 top-level fix-notes (wakeup-model-resolve-hang-fix.md,
wakeup-defer-race-fix.md, optionz-liveview-perf-fix.md) were internal
working notes with local pids, absolute machine paths, and task tokens.
Their substance is already release-note captured in CHANGELOG.md
[Unreleased]; nothing unique is lost. They do not belong in the repo.
…omment (Copilot nesquena#2)

The Option Z pivot made /api/process-complete-ack a pure no-op for
state (the server-side drain thread starts the wakeup turn; the
browser never re-POSTs /api/chat/start). The response field
"pending_consumed": False with the comment "frontend re-POSTs
/api/chat/start to actually consume" directly contradicted the
function's own docstring and the actual frontend (which fires the ack
fire-and-forget and never reads the body). Replace it with an honest
"noop": True that matches the documented behavior. No caller reads
the old field (frontend .catch()s the fetch; no test asserts it).
…me; use/assert the dedupe contract (Copilot nesquena#4)

_process_one() reached into process_registry._lock /
_completion_consumed inside a broad `except Exception` that swallowed
an AttributeError, so a future upstream rename of those privates would
silently reintroduce the cross-A/B double-wakeup bug.

There is no public mark_completion_consumed() upstream (only the READ
side is public via is_completion_consumed(), already used here), so per
Copilot's option (b):

- Extract _mark_registry_completion_consumed() that documents the
  coupling and narrows exception handling: ImportError stays
  best-effort (registry legitimately absent off-agent), but
  AttributeError/TypeError from a renamed/retyped private now logs at
  ERROR with an explicit contract-violation message instead of a
  silent DEBUG.
- Add _REGISTRY_CONSUMED_CONTRACT + two tests in
  test_process_complete_ab_coexistence.py:
  test_registry_completion_consumed_contract pins the private surface
  so an upstream rename breaks CI loudly HERE; the fail-loud test
  asserts a rename surfaces as ERROR not silent DEBUG.

An upstream public mark_completion_consumed() remains the durable fix
(noted in code + the PR reply as a maintainer suggestion). 72/72
wakeup-critical + changed-area tests green.
…Copilot nesquena#5)

manual_repro_wakeup_hang.py sat in the pytest test root with a
hardcoded /home/islaliu/.hermes/... venv path in its docstring.
pytest never ran it (no test_ functions) but it was confusing.

- git mv tests/manual_repro_wakeup_hang.py -> tests/manual/repro_wakeup_hang.py
- Strip the operator-specific absolute interpreter path from the
  docstring; the run instruction is now parametric ("python
  tests/manual/repro_wakeup_hang.py" from repo root) and fix the
  sys.path repo-root math for the now-deeper location.
- Exclude tests/manual/ from collection via collect_ignore_glob in the
  existing tests/conftest.py rather than a nested conftest.py: a second
  conftest module would collide with the suite's top-level
  'from conftest import TEST_BASE/TEST_STATE_DIR' imports (no packages
  / rootdir-relative module names).

Verified: pytest --collect-only tests/ collects 6014 tests, 0
collection errors, the manual script no longer surfaces.
…esquena#5 follow-up)

Copilot review nesquena#5 'strip local paths' missed one residue:
tests/test_wakeup_model_resolve_hang.py:3-4 still embedded the
reviewer's machine path /home/islaliu/.hermes/webui/bootstrap-8787.log.

Rewrote lines 3-4 to a neutral description ('a live thread-stack
capture of a hung wakeup turn') — root-cause explanation itself
unchanged. Full re-grep of tests/ + tests/manual/ confirms this was
the only remaining residue. 7/7 tests pass.
…ider-catalog rebuild

Multi-tab streaming interlock RCA (kanban t_d127953d). The two display-only
resolvers _resolve_effective_session_model_for_display /
_resolve_effective_session_model_provider_for_display called
_resolve_compatible_session_model_state() without prefer_cached_catalog=True.
For sessions with no persisted model_provider (kanban/imported), the fast
path is skipped and it falls through to a full live get_available_models()
cold rebuild (botocore IMDS ~4s on non-AWS/WSL + anthropic/openrouter
/models ~0.9s). GET /api/session is a hot, side-effect-free per-tab/per-poll
path; concurrent tabs serialized on the models-cache lock and starved
SSE/streaming -> BrokenPipe/Cancelled storm. POST /api/chat/start wakeup
path already passed prefer_cached_catalog=True; the GET display path was the
missed blind spot.

Fix: pass prefer_cached_catalog=True at both display resolvers (persisted
session model authoritative; network-free minimal catalog backstops the
default). One logical change, 2 call sites. ~7742ms -> 133-148ms (~52x);
display-path live rebuilds N -> 0.

- tests/test_session_display_resolver_no_live_rebuild.py: new invariant
  tripwire regression (4 tests; pre-fix 3 fail proving non-tautology)
- tests/test_provider_mismatch.py: stub signature aligned (lambda **_kw)
…h.json fingerprint

auth.json is rewritten by credential-pool/OAuth token refresh roughly every
14 minutes. _models_cache_source_fingerprint() hashed it via mtime/size
(nesquena#1699 _models_cache_file_fingerprint), so every token refresh churned the
fingerprint and the 24h /api/models cache was effectively dead -- the hot
GET /api/session?resolve_model=1 path paid a cold ~11.5s rebuild every few
minutes (RCA t_d127953d residual nesquena#2, t_16551f61).

Add _auth_store_semantic_fingerprint(): content-hash auth.json with a
DENY-list of known credential-rotation-only keys (access/refresh token,
expiry, per-credential status/telemetry, request_count, save updated_at)
stripped. Deny-list (not allow-list) is deliberate -- any unknown field, or
a real provider/endpoint/model-set change (active_provider, a new
credential_pool entry, base_url, source, label, auth_type, the providers{}
block, ...) stays in the fingerprint and still correctly busts the cache.
Conservative fallbacks: missing file -> marked; unreadable/corrupt ->
stat-based fallback (never less safe than pre-fix). config.yaml keeps the
cheap stat fingerprint (deliberate edits, no timer churn).

Bidirectional invariant regression test (non-tautological -- the
end-to-end churn test flips RED when the auth_json axis is reverted to
stat-based): token-only churn keeps fingerprint byte-identical AND keeps a
valid disk cache loadable; active_provider change / new credential_pool
entry / changed base_url each flip the fingerprint AND reject the stale
disk cache. Measured: 5/5 cold rebuilds per 5 refresh cycles -> 0/5.

Tests: 9 new pass; 28 adjacent (nesquena#1699/nesquena#1633/display-resolver) pass;
54 models_cache/fingerprint suite pass.
…n wakeup misroute root fix, Option 1)

RCA t_f62ff1e8: WebUI bound per-turn session identity only to the
process-global os.environ['HERMES_SESSION_KEY'] and released the env
lock before the agent ran; it never set any contextvar, so
get_current_session_key fell back to that racy slot. Two concurrent
WebUI turns raced on one slot -> a notify_on_complete spawn captured a
concurrent turn's session_id -> server-side wakeup turn started for the
wrong session (agent.log:6632).

_set/_reset/_bind_turn_session_identity bind the turn's sid to
tools.approval._approval_session_key + gateway.session_context._SESSION_KEY
in the _run_agent_streaming worker thread (concurrent tool batches
inherit via copy_context); reset-token restore in the same outer
finally as the env restore. Does NOT use blanket set_session_vars
(would break the notify_on_complete watcher gate). English-only source
(no-cjk guard); pre-init ordered to preserve the Issue nesquena#765 locator.
Core unchanged.

Regression: tests/test_xsession_wakeup_misroute.py (RED pre / GREEN post).
…session misroute defense-in-depth, Option 3)

Independent safety net at the wakeup-routing layer, decoupled from the
Option 1 root fix (different file/layer, separately revertible). On
_process_one, after PROCESS_SESSION_INDEX resolves the target,
_resolve_wakeup_target cross-checks it against an env-immune spawn
owner on the ProcessSession (forward-compatible duck-typing of
spawn_session_id/owner_session_id/turn_session_id since adding the
field to core ProcessSession is out of scope for this WebUI-only
change). Positive mismatch -> log ERROR + re-route the wakeup to the
true owner. Pure pass-through when no env-immune owner is available
(today's core, cron/CLI procs, pre-Option-1 spawns) so Option Z is
never suppressed on uncertainty.

Regression covered by tests/test_xsession_wakeup_misroute.py.
@Isla-Liu Isla-Liu force-pushed the feat/process-complete-event-isla branch from d76801c to 64b8294 Compare May 19, 2026 06:21
Isla-Liu pushed a commit to Isla-Liu/hermes-webui that referenced this pull request May 19, 2026
…own streams in _emit_to_session_streams (no cross-session broadcast)

Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
@Isla-Liu Isla-Liu requested a review from Copilot May 19, 2026 06:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.

@Isla-Liu
Copy link
Copy Markdown
Contributor Author

Isla-Liu commented May 19, 2026

If the rebase is clean, no maintainer intervention needed — just push the rebase + flip out of draft and we'll review.

Ping back if any of the Q1–Q4 decisions affect implementation enough that you need to revisit before rebasing.

@nesquena-hermes @nesquena
Hope for some idea or advice, thanks!

@Isla-Liu Isla-Liu changed the title feat(streaming): process_complete event for terminal(notify_on_complete=true) agent wakeup feat(streaming): wake the agent server-side when a notify_on_complete background task finishes while the WebUI tab/session is idle May 19, 2026
@Isla-Liu Isla-Liu requested a review from Copilot May 19, 2026 07:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.

Isla-Liu pushed a commit to Isla-Liu/hermes-webui that referenced this pull request May 21, 2026
…own streams in _emit_to_session_streams (no cross-session broadcast)

Copilot PR nesquena#2242 review nesquena#3: when no ACTIVE_RUNS row matched, the STREAMS loop fell through and broadcast the event to that stream anyway, relying on every frontend consumer to filter by data.session_id — a fragile cross-session-leak surface. The live-view repro proved the per-session SessionChannel (now backed by on-subscribe recovery) is the SOLE authoritative cross-turn live-view carrier post Option X/Z and this STREAMS loop is in-turn defense-in-depth only. Resolution: skip non-matching AND owner-unknown streams (if owner_sid != session_id: continue). Removes the leak surface; no live-view regression (verified by repro + tests).
@Isla-Liu Isla-Liu force-pushed the feat/process-complete-event-isla branch from 1de6e8a to 6458542 Compare May 22, 2026 04:50
Isla-Liu and others added 2 commits May 22, 2026 13:31
Restores green CI for tests/test_xsession_wakeup_misroute.py and tests/test_process_complete_ab_coexistence.py::test_registry_completion_consumed_contract and tests/test_session_channel_option_x.py::test_real_completion_event_shape_routes_to_session_channel on Python 3.11 / 3.12 / 3.13.

Failure: `ModuleNotFoundError: No module named 'tools'` from imports `from tools.process_registry import ...` and `from tools.approval import ...`. Tests depend on the hermes-agent sibling project's `tools/` namespace; they now skip when hermes-agent is not installed, matching the repo's existing agent-dependent test convention.

Co-authored-by: Hermes Agent <hermes-agent@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants