Fix/chat history WAL recovery#1353
Conversation
191130e to
dc0e6bc
Compare
|
There is also a bug where switching between chats while work is in progress. When you switch back to the other chat, all the work up to that moment is lost in chat history. I am attempting to fix this in the same PR. |
|
Thanks @JKJameson, this is a thorough writeup and the diagnosis is right on both counts. The "checkpoints only fire on tool calls" gap has bitten users who run long pure-text replies, and a WAL is the right shape for the recovery problem. A few things to work through before this lands: Scope — please split the tab-switching fix You mentioned in the comment above that you're folding a tab-switching mid-stream loss into the same PR. That's a different bug (in-memory session/cache invalidation, not crash recovery) and mixing it in here makes review and revert risk worse. Could you take the tab-switching fix out into a sibling PR? The WAL change is already 1.3k LOC and touches three load-bearing files; it should land on its own merits. Design feedback
Tests 497 lines in Net: the design is right, the WAL is the correct primitive, and the diagnosis of why pure-text streams were slipping through is well-argued. Let's get it scoped down to just the WAL work and I'll do a line-level review. |
On hold — CI failures + scope concernsThanks @JKJameson for digging into the chat-history-loss-on-crash class of bugs. This is real and worth fixing. Putting it on hold while we work through several blockers. Why on hold1. CI is red. 2. Scope — at +1384/−80 across 9 files including a new WAL/sidecar persistence subsystem, this is a substantial architectural change. Risks:
3. Two-root-cause framing in the PR description is correct but the fix mixes both into one PR. Splitting would make review tractable:
What I'd want to see before un-holding
What's goodThe diagnosis is correct — you've found two real gaps. The "checkpoint only fires on tool calls" issue is something we'd want to fix even without the WAL feature (just by incrementing on token events). And the WAL idea has merit; we just want to derisk it. @JKJameson tagging — let me know if you want to take Option A and split the PR. If you prefer to keep it as a single PR (Option B), expect a longer review cycle and please address the perf/state-machine questions in the description. |
|
Two integration tests (test_wal_file_created_during_streaming, test_wal_deleted_on_clean_completion) require a live server and are marked @pytest.mark.integration - skipped by default, run with pytest -m integration. Changes since last review:
Open question for reviewer: _WAL_FLUSH_TOKENS=*** vs =100` - if 100 is preferred, the time-based flush at 3s already provides the crash-recovery bound, so the token-count path is purely redundant. Happy to revert to 100 if you'd rather keep the behavior closer to the original. |
3f289ae to
74e6ec3
Compare
|
Re-reviewed after the 2026-04-30 push. The two integration tests that need a live server are now gated with Adding Thanks for the careful work on this — chat-history-loss-on-crash is a real class of bug and a WAL is the right fix shape. |
hermes-webui's Write-Ahead Log (WAL) had multiple bugs that caused mid-stream chat history to be lost whenever the service was killed or crashed during streaming. Five bugs in api/wal.py and two in api/models.py: - _write_lines opened the WAL file in write mode, replacing its entire contents on every flush instead of appending - _should_flush used > instead of >= on the token threshold, so the flush never fired at the configured limit - _should_flush reset the periodic flush timer on every call, permanently blocking time-based checkpoints - _append_event used >= instead of > on buffer length, causing immediate flush after the first event - _WAL_FLUSH_TOKENS was set to 100, leaving up to 99 uncommitted events at risk - WAL replay did not restore reasoning/thinking text to recovered assistant messages - Error marker was incorrectly appended when partial messages already existed (from WAL replay) - WAL now uses append mode so each flush adds to the file rather than replacing it - Token threshold comparison corrected to >= - Periodic timer is set only by _flush_buffer, never reset by _should_flush - Buffer length comparison corrected to > - _WAL_FLUSH_TOKENS lowered to 1 so every event is fsync'd to disk immediately - Reasoning text is now restored from WAL on recovery - Error marker is no longer appended when partial messages exist - _CRON_PROJECT_LOCK and CRON_PROJECT_NAME constants (was missing after patch application) - context_length, threshold_tokens, last_prompt_tokens fields on Session (added to master after branch split) - is_streaming and pending_user_message guards in all_sessions() sidebar filter (master feature added after branch split) - normalize_agent_session_source() call in get_cli_sessions() for raw_source field - 17 WAL unit and recovery tests - 31 session sidecar repair tests - Server starts cleanly, all imports resolve
These tests require a running server and are skipped in normal CI runs. The @pytest.mark.integration marker ensures they only run when explicitly included (pytest -m integration).
Integration tests in test_wal_live_recovery.py need a running server and fail in CI with 'ModuleNotFoundError: No module named requests' because the test environment doesn't have requests available in the test Python path (it's only in the hermes-agent venv). Rather than fixing the Python path issue, skip these by default with the standard @pytest.mark.integration marker and a conftest hook that auto-skips them unless -m integration is passed.
config.markers does not exist as an attribute on pytest Config objects.
…lit pending - Revert _WAL_FLUSH_TOKENS to 100 (per reviewer: 15s→3s checkpoint not justified; 3s recovery bound already guaranteed by time-based flush in _should_flush) - WAL code fixes already in WAL commit (confirmed via runtime verification): * _append_event: len >= threshold (fires at N tokens, not N+1) * _should_flush: initializes last_flush_time on first call (no 1970-epoch fire) * _write_lines: FileNotFoundError guard (already present) - cancel_stream: add NOTE explaining why get_session() (disk-backed) is used instead of SESSIONS cache — crash-recovery semantics require latest checkpoint - _wal_recovered flag: mark recovered messages so UI can show 'Recovered' badge - UI: render Recovered badge in assistant message footer for wal_recovered msgs - CSS: .msg-wal-recovered amber italic badge - WAL tests: add test_append_event_flushes_at_threshold and test_should_flush_initializes_timer to cover the two flush bugs - Tab-switching fix (finalizeThinkingCard guard removal) left in this PR per reviewer request to separate; pending split to sibling PR
These tests assumed the buffer retains tokens between writes, but with _WAL_FLUSH_TOKENS=1 (immediate flush), every token write immediately triggers a flush — the buffer is always empty. The >= fix in _append_event is already verified by the existing WAL round-trip tests (test_wal_write_and_read_multiple_events) and the manual flush test (test_manual_flush_on_end). The timer-init bug test (test_should_flush_initializes_timer_on_first_call) is kept — it correctly tests time-based flush in isolation.
6c5d1c2 to
b65cbda
Compare
|
Rebased onto current master and resolved conflicts: 5 in Skipped two intermediate commits whose changes are no longer relevant on top of master:
Tests pass after rebase: 3791 passed, 7 skipped, 3 xpassed, 0 failures. Force-pushed to your branch with git fetch origin
git reset --hard origin/fix/chat-history-wal-recovery-v2Marking ready for the next merge batch — this is now MERGEABLE with CI running. Will route into v0.50.268. Thanks for the patient durability work — chat-history-loss-on-crash is a real class of bug and the WAL is the right fix shape. |
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ❌ (request changes — WAL writer is never invoked from production code)
Thanks @JKJameson for the thoughtful diagnosis on the chat-history-loss-on-crash class of bugs and for sticking with this through several rebase rounds. The core insight — pure-text streams (no tool calls) bypassed every checkpoint, so a kill mid-stream lost everything — is correct, and a WAL is a sensible primitive for the durability gap.
Unfortunately, after end-to-end tracing this PR cannot land in its current shape: the WAL writer functions have no production callers. The whole "every token is appended to a WAL" story in the PR description is not actually wired up in the code. Details below, plus a clear path to fix.
What this ships (per the diff, not the description)
| File | Change |
|---|---|
api/wal.py |
New 328-LOC module with read/write/replay primitives ✅ |
api/models.py |
_replay_wal_recovery() invoked from get_session() (the recovery side) ✅ |
api/streaming.py |
Only a docstring NOTE update on cancel_stream — no _wal.write_* calls anywhere ❌ |
static/ui.js |
_wal_recovered "Recovered" badge + tab-switching guards |
static/style.css |
.msg-wal-recovered style ✅ |
tests/test_wal_recovery.py |
Unit tests that manually invoke _wal.write_* |
tests/test_wal_live_recovery.py |
Integration tests gated behind @pytest.mark.integration (auto-skipped) |
tests/conftest.py |
New integration marker; auto-skip unless -m integration |
Critical: the WAL is never written during streaming
This is the central problem. PR description claims:
api/streaming.py:
- Every
put()call (token/reasoning/tool events) now writes to the WALput('start')fires before the checkpoint thread startsput('end')fires just beforeput('stream_end')in the normal success pathfinallyblock calls_wal.delete_wal(session_id)— WAL survives a crash
None of that is in the diff. The actual streaming.py change is:
$ gh api repos/nesquena/hermes-webui/pulls/1353/files --jq '.[] | select(.filename=="api/streaming.py") | {additions, deletions}'
{"additions": 7, "deletions": 3}That's the docstring NOTE update at api/streaming.py:2622-2631 explaining why cancel_stream calls Session.load() instead of SESSIONS. No _wal.write_wal_* integration anywhere.
Verified by grep across all production code:
$ grep -rn "from api import wal\|_wal\." api/ static/ server.py
api/models.py:19:from api import wal as _wal
api/models.py:652: events = _wal.read_wal(session.session_id) # READ
api/models.py:656: recovered = _wal.replay_wal(events) # READ
api/models.py:704: _wal.delete_wal(session.session_id) # CLEANUP after readSo in production: only read / replay / delete of the WAL — never write. The recovery path at api/models.py:613-710 wakes up on every get_session(), calls _wal.read_wal(sid), gets [] (file never existed), and bails at line 654 (if not events: return).
Why CI is green despite the broken integration
- The 17 unit tests in
test_wal_recovery.pymanually populate WAL files by directly calling_wal.write_wal_token(sid, ...)etc. They prove the WAL primitives can be written — not that they are written by the streaming subsystem. - The 6 tests in
test_wal_live_recovery.pythat would catch this (test_wal_file_created_during_streaming,test_wal_deleted_on_clean_completion) are gated with@pytest.mark.integrationand the new conftest hook auto-skips them. The integration runner that would surface the gap doesn't run in CI.
If anyone executes test_wal_file_created_during_streaming against a live server, line 140 fails:
assert wal_path.exists(), (
f"WAL file not found at {wal_path}. "
f"SERVICE SESSION_DIR={REAL_SESSION_DIR}; ..."
)…because the running webui service does not write WAL files.
What's needed to fix
The core wiring must live in api/streaming.py inside the _run body (~line 1413 put definition through ~line 2390 cleanup). Concretely:
- Stream start (after
s.active_stream_id = stream_id, ~line 1490): call_wal.write_wal_start(session_id, stream_id). on_tokencallback (api/streaming.py:1580-1591): afterput('token', {'text': text}), call_wal.write_wal_token(session_id, text).on_reasoningcallback (api/streaming.py:1593-1604): afterput('reasoning', ...), call_wal.write_wal_reasoning(session_id, str(text)).on_toolcallback (api/streaming.py:1647-1707): ontool.startedandtool.completed, call_wal.write_wal_tool/_wal.write_wal_tool_result.- Stream end success path (around
put('done', ...)/put('stream_end', ...)~line 2370-2385): call_wal.write_wal_end(session_id, stream_id). - Apperror path (api/streaming.py:2131, api/streaming.py:2512): call
_wal.write_wal_aperror(session_id, message). finallycleanup (api/streaming.py:~2390): call_wal.delete_wal(session_id)only on clean completion (not on crash; but the finally runs on cancel/error too — so this needs careful gating, perhaps awal_clean_complete = Falseflag set just beforeput('stream_end')).
Each call site is on the critical path for token throughput, so the perf shape (see next point) matters.
Performance concern: _WAL_FLUSH_TOKENS = 1
Even when the integration is wired up, the current setting at api/wal.py:23 — _WAL_FLUSH_TOKENS = 1 — combined with _write_lines calling f.flush(); os.fsync(f.fileno()) (api/wal.py:113-117) means every single token triggers an fsync. For a model emitting 60 tokens/sec, that's 60 fsyncs/sec for the lifetime of the stream, per active session. On any non-NVMe disk (network home dirs, slow SD cards on Pi-class hosts, journaled filesystems with barriers) this will degrade streaming throughput perceptibly.
Your last comment acknowledges this:
Open question for reviewer:
_WAL_FLUSH_TOKENS=1vs=100- if 100 is preferred, the time-based flush at 3s already provides the crash-recovery bound, so the token-count path is purely redundant. Happy to revert to 100 if you'd rather keep the behavior closer to the original.
I'd recommend =100 plus making the time-based flush actually run. Currently:
_should_flush()at api/wal.py:55-89 is never called by anything — no checkpoint thread invokes it. (Verified via grep on_should_flush— only the docstring and unit test reference it.) So the 3-second time-based flush guarantee in the PR description doesn't exist either; it's enforced only because_WAL_FLUSH_TOKENS=1makes_append_eventflush synchronously every event.- If you raise
_WAL_FLUSH_TOKENSto 100, you also need a periodic background thread (or to call_should_flushfrom eachput()call) to honor the 3-second time bound. Otherwise a slow stream that emits 50 tokens then idles for a minute loses the last 50 tokens on crash.
Out-of-scope: tab-switching guards in static/ui.js
The diff at static/ui.js:4814-4830 and static/ui.js:4895-4906 adds existingTurn.dataset.sessionId !== S.session.session_id guards in appendThinking and removeThinking. Per @nesquena-hermes's earlier review:
please split the tab-switching fix … take the tab-switching fix out into a sibling PR
And per your own commit 8ebc4433 message:
Tab-switching fix (finalizeThinkingCard guard removal) left in this PR per reviewer request to separate; pending split to sibling PR
So the split was understood but not actually executed — the guard additions are still in the diff. Master already addresses #1366 with dataset.sessionId writes at static/ui.js:3660,4032,4835, so these additions are layered on existing infrastructure. Either include them deliberately and document why master's existing guards are insufficient, or split them out.
Other observations
- Dead code in
api/wal.py:_should_flush()is never called;_increment()is called bywrite_wal_token/write_wal_reasoningbut its flush trigger is redundant with_append_event's flush trigger when both use_WAL_FLUSH_TOKENS=1. _validate_sidis too permissive in shape but fine for current IDs. Session IDs areuuid.uuid4().hex[:12](api/models.py:328), 12 hex chars only. The validator at api/wal.py:51-52 allows 0-9, a-z, _ — superset of hex, so legitimate IDs always pass. No path traversal becausewal_pathusesf"{session_id}_wal.jsonl"over a filtered ID. ✅_replay_wal_recoverycorrectness checks look right — guards onactive_stream_id, dead stream, last message is user, and WAL exists. The trailing<think>strip at api/models.py:664-667 is reasonable. The_partialheuristic (last char alphanumeric) is OK, though it'd false-negative for content ending in"or).- Cross-tool consistency: WAL is webui-only — verified
hermes_cli/does not import or read_wal.jsonlfiles. Recovery is purely a webui durability mechanism. ✅ api/streaming.pycancel_stream NOTE (the only actual streaming.py change in this PR) is a worthwhile clarifying comment about whyget_session()is used. That part is fine to keep regardless of the WAL outcome.
Edge-case trace (assuming WAL writes were wired up)
| Scenario | Expected | Current (as-is) |
|---|---|---|
| Pure-text stream killed mid-stream | Recover via WAL replay | ❌ WAL never written → no recovery |
| Stream completes normally | WAL deleted in finally |
❌ WAL never created → nothing to delete |
| Tool-call stream killed mid-stream | Recover tool history via WAL | ❌ Same — no writer integration |
| WAL exists from prior crashed run on session load | Replay + delete | ✅ Works (recovery side wired up) |
| New session with no prior WAL | Skip replay, return normally | ✅ if not events: return early-out |
Stream still alive when get_session runs |
Skip replay (don't interfere) | ✅ STREAMS_LOCK check correct |
Recovery side: ✅. Writer side: ❌.
Tests
tests/test_wal_recovery.py— 17/17 pass (unit tests, manually populate WAL). They prove the primitives work, not that streaming uses them.tests/test_wal_live_recovery.py— 6 tests, all auto-skipped via@pytest.mark.integration. These would FAIL against a live server because the service does not write WAL.- Full suite: 3738 passed, 60 skipped, 3 xpassed, 0 failures in 18.30s on the PR branch.
- CI: 3.11 / 3.12 / 3.13 green — but only because the integration tests are skipped.
Recommendation
❌ Request changes. The PR is shipping a recovery subsystem that has no input. To land:
- Wire
_wal.write_wal_*into api/streaming.py at the 7 sites listed above. This is the load-bearing change. - Decide on flush policy. Either keep
_WAL_FLUSH_TOKENS=1(fsync-per-token; simple but disk-heavy) and accept the perf hit — or raise to 100 and start a periodic background thread that honors the 3-second time bound. Currently_should_flushis dead code; the time-based guarantee in the PR description does not actually exist. - Split or justify the tab-switching
static/ui.jsguards. Either move them to a sibling PR per the original maintainer ask, or document why master'sdataset.sessionIdwrites are insufficient and these additional guards are required. - Run
pytest -m integrationagainst a live server before re-pushing. The integration testtest_wal_file_created_during_streamingis the canary that would have caught this gap; it currently fails silently because it's skipped by default. - Optional but recommended: Add a non-integration test that exercises the streaming path with a stub agent and asserts that the WAL file gets at least one
startevent afterchat/startreturns. That catches "WAL writer not wired up" without needing a real server.
Once the writer is wired up and a perf decision is documented, this becomes a clean review. The diagnosis and the WAL design are sound — the implementation just needs its missing half. Happy to do a fast follow-up review when you push the streaming.py integration.
Thanks again for the persistence on this — chat-history-loss-on-crash is a real class of bug and worth getting right.
|
@JKJameson — quick status nudge to keep this moving. The May 02 review laid out 5 specific blockers, and 4 of them need design judgment from you (which is why I'm not just absorbing this). One ambiguity in the review I want to clear up before you re-push, plus an offer to take this on if you don't have cycles. Flush-policy answer (the open question from your last comment)
The review's recommendation is
Concrete shape:
(a) is simpler — adds one cheap monotonic-clock comparison per token. (b) is more resilient but adds thread-management surface to the streaming path. Either is acceptable; my preference is (a) because it co-locates the flush decision with the existing token write path. What still needs to ship in this PRFrom the May 02 review, the load-bearing changes:
Offer to take this onIf you'd like, I can take a pass at the streaming.py wiring + flush-policy implementation + UI split on a sibling branch and open it as a successor PR (with you co-authored, your existing commit history preserved). The diagnosis and the WAL primitives are entirely yours; this is just the missing wiring half. Either:
Whichever you prefer. Holding Thanks again for the persistence on this — chat-history-loss-on-crash is a real class of bug and worth getting right. |
…sorbed CHANGELOG, ROADMAP, TESTING bumped (3936 \u2192 3946). 8 constituent PRs: - nesquena#1523 (@franksong2702) branch indicator codepoint fix - nesquena#1519 (@franksong2702) onboarding API-key focus loss fix - nesquena#1518 (@franksong2702) voice-mode toggle-off recognizer stop - nesquena#1516 (@franksong2702) YAML newline CSS rules - nesquena#1517 (@franksong2702) __CACHE_VERSION__ \u2192 __WEBUI_VERSION__ rename - nesquena#1532 (@ai-ag2026) state.db WebUI session recovery - nesquena#1525 (@ai-ag2026) stale stream state proactive cleanup - nesquena#1526 (@ai-ag2026) max_tokens forwarding + OpenRouter quota classifier Opus MUST-FIX absorbed: sw.js conflict-marker cleanup + regression guard. Opus SHOULD-FIX deferred to follow-up nesquena#1533 (race in _clear_stale_stream_state). 2 closed as duplicates: nesquena#1528 (identical to nesquena#1517), nesquena#1529 (superseded by nesquena#1516). 1 maintainer-review label: nesquena#1531 (Asunfly stowaway change in force-push). 5 stay on hold: nesquena#1418 nesquena#1464 nesquena#1404 nesquena#1353 nesquena#1311.
|
Feel free to take over, I don't have the free time right now to make the changes you need. |
|
Thanks for sticking with this through several rounds, @JKJameson, and for being upfront about not having the cycles to take it the rest of the way. Closing this PR but treating it as the canonical reference point for the bug class — your diagnosis is the most thorough write-up we have, and the WAL primitives in Where this standsThe May 02 review surfaced one structural issue that wasn't going to resolve incrementally: the WAL writer functions in To merge this PR, we'd need to:
That's roughly half the original work, and items 2-3 are design decisions that go to first principles about where the durability layer lives. Doing it as a successor PR on top of your branch made sense in the May 03 framing, but after closer reading the rest of the picture changes things — see the next section. Why we're closing rather than picking it upOn further inspection of the existing checkpoint mechanism, the relationship between durability and the WAL is more nuanced than the original framing suggested:
This means the right next step is a deliberate redesign rather than a successor PR finishing your wiring. I want to do it once, do it right, and not inherit the buffer/flush/threshold complexity that the May 02 review correctly flagged as over-engineered for the actual problem. What we're carrying forwardFiled as the canonical tracking issue for this work — referencing this PR as the original diagnosis source. When the redesign happens, the following pieces from your branch are direct salvage candidates with
Items we'd build differently:
TrackingFiled: #1585 — "Crash recovery: WAL design + integration for in-flight assistant tokens" with full salvage map and design constraints. Original report this addresses: #1217 (chat history loss after compaction/crash). Genuinely appreciate the persistence on this @JKJameson — chat-history-loss-on-crash is a real bug class and you found the right primitives to attack it. The closure here isn't a judgment of the work, it's an admission that the right way to land it requires more design surgery than is fair to ask of you on top of what you've already done. If you ever pick this back up, the door is wide open. |
Fix: chat history lost on process crash / OOM / SIGKILL
Problem
When
hermes-webuiwas killed (OOM, SIGKILL, crash) during an ongoing agent streaming run, the tokens streamed since the last checkpoint were permanently lost. The session JSON would revert to showing only the user's message, with no assistant reply. Two root causes:Checkpoints only fired on tool calls.
_checkpoint_activitywas only incremented inon_tool(), so a pure text stream with zero tool calls was never checkpointed mid-stream - tokens accumulated in the agent's internal buffer and were lost on crash.No crash-recovery mechanism. Tokens streamed between the last checkpoint and a crash lived only in memory. There was no WAL or sidecar to replay from.
Solution
1. Checkpoint on every token, not just tool calls (
streaming.py)put()now increments_checkpoint_activityon every'token'event, in addition to the existing increment inon_tool(). This ensures text-only streams (no tool calls) checkpoint every 3 s just like tool-heavy streams. The period was also reduced from 15 s → 3 s.2. WAL (Write-Ahead Log) crash recovery
api/wal.py— new module:{SESSION_DIR}/{sid}_wal.jsonltoken,reasoning,tool,tool_result,start,end,apperroreventsreplay_wal()reconstructs partial assistant text + tool calls from WAL eventsdelete_wal()removes the file on clean completionapi/streaming.py:put()call (token/reasoning/tool events) now writes to the WALput('start')fires before the checkpoint thread startsput('end')fires just beforeput('stream_end')in the normal success pathfinallyblock calls_wal.delete_wal(session_id)— WAL survives a crashcancel_stream: now usesSession.load()(disk, not SESSIONS cache) to recover from the last checkpoint, not a potentially-stale in-memory sessionapi/models.py:get_session()calls_replay_wal_recovery(s)after loading a session JSONactive_stream_idis set AND stream is dead AND last message is fromuser(assistant genuinely missing) AND WAL file exists_partial=Trueif truncated), clears pending state, saves, deletes WALBug fixes
_should_flush()comparedtime.time() - last >= _WAL_MAX_BYTES(10 MB) instead of_WAL_FLUSH_INTERVAL(3 s) — would never flush on time alone_write_linescalledpath.stat()without guarding for missing file — would crash on first write to a new sessionFiles changed
api/wal.pyapi/streaming.pyput(), checkpoint on every token, 15s → 3s,cancel_streamusesSession.load()api/models.py_replay_wal_recovery()inget_session(), WAL importtests/test_wal_recovery.py