Skip to content

[BUG] Session history lost on server restart — Fixed with OpenClaw assistance #1471

@Triyarms

Description

@Triyarms

Diagnosed, fixed, and verified by OpenClaw AI agent.

1. Problem Description

When the Hermes Web UI server is restarted (via terminal or crash) during an active agent task, the session history is lost. After restart:

  • The UI shows "Connection Error" or an empty chat.
  • Only the last user message is visible.
  • The assistant's partial response and tool usage history are lost.

2. Root Cause Analysis

The issue is a combination of server-side and client-side problems:

  1. Server-side: During streaming, messages are held in memory. If the server restarts, these are lost unless explicitly saved to disk.
  2. Server-side: active_stream_id persists in session JSON files, causing the client to attempt reconnecting to a dead stream.
  3. Client-side: loadSession in sessions.js would clear S.messages on connection errors.
  4. Client-side: boot.js lacked retry logic for server restart scenarios.

3. The Comprehensive Fix

We applied fixes across 5 files to ensure full session persistence and proper reconnection:

A. api/models.py — Read from Disk, not Cache
Bypasses the in-memory cache (SESSIONS) when loading full session data.

--- a/api/models.py
+++ b/api/models.py
@@ -674,6 +674,15 @@ def get_session(sid, metadata_only=False):
      messages list. Use this when you only need compact() metadata and not the
      actual message history (e.g., for fast sidebar switching).
      """
+    # If loading full session (with messages), skip in-memory cache to avoid
+    # returning stale data that hasn't been flushed to disk yet.
+    if not metadata_only:
+        s = Session.load(sid)
+        if s:
+            with LOCK:
+                SESSIONS[sid] = s
+                SESSIONS.move_to_end(sid)
+            return s
     with LOCK:
         if sid in SESSIONS:
             SESSIONS.move_to_end(sid)  # LRU: mark as recently used

B. api/streaming.py — Periodic Save During Streaming
Saves the session every 30 seconds during streaming.

--- a/api/streaming.py
+++ b/api/streaming.py
@@ -1391,6 +1391,24 @@ def _run_agent_streaming(
     # Register this stream with the global streaming meter
     meter().begin_session(stream_id)
 
+    # ── Periodic Session Save (prevents loss on restart) ──
+    _save_stop = threading.Event()
+    _partial_save_interval = 30  # seconds
+
+    def _periodic_save():
+        while not _save_stop.wait(_partial_save_interval):
+            try:
+                if s and s.messages:
+                    s.save()
+                    logger.debug('Periodic session save completed')
+            except Exception as e:
+                logger.debug(f'Periodic save failed: {e}')
+
+    _save_thread = threading.Thread(target=_periodic_save, daemon=True)
+    _save_thread.start()
+
     # Metering ticker... (rest of the code)

C. server.py — Clear Stale Stream IDs on Startup
Clears active_stream_id from all session files on server startup.

--- a/server.py
+++ b/server.py
@@ -XX,6 +XX,18 @@ def main():
     STATE_DIR.mkdir(parents=True, exist_ok=True)
     SESSION_DIR.mkdir(parents=True, exist_ok=True)
 
+    # ── Clear stale active_stream_id on startup ──
+    try:
+        import json as _json
+        _cleared = 0
+        for _f in SESSION_DIR.glob('*.json'):
+            try:
+                _d = _json.loads(_f.read_text(encoding='utf-8'))
+                if _d.get('active_stream_id'):
+                    _d['active_stream_id'] = None
+                    _f.write_text(_json.dumps(_d, ensure_ascii=False, indent=2), encoding='utf-8')
+                    _cleared += 1
+            except Exception: pass
+        if _cleared: print(f'  [startup] Cleared active_stream_id from {_cleared} session(s) after restart.', flush=True)
+    except Exception as e: print(f'  [startup] Warning: failed to clear stale stream IDs: {e}', flush=True)
+
     # Start the gateway session watcher...

D. static/sessions.js & static/boot.js
Prevents clearing messages on error and adds retry logic.

4. Verification Steps

  1. Apply all patches.
  2. Start the server.
  3. Ask the agent a long task (e.g., "Write 5 anecdotes").
  4. While the agent is responding, restart the server (kill + python3 server.py).
  5. Expected Result: Server clears stale stream IDs, browser retries connection, full history is visible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededsprint-candidateStrong candidate for next sprint

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions