Skip to content

MCP server startup blocked by synchronous orphan session scan — causes connection timeout with large EventStore #310

@isac322

Description

@isac322

Summary

When the EventStore database grows large, the MCP server's synchronous cancel_orphaned_sessions() call during startup takes too long to complete, causing MCP clients (e.g., Claude Code) to time out before the JSON-RPC initialize handshake finishes.

Reproduction

  • Environment: ouroboros-ai 0.27.0, Claude Code CLI, Linux
  • DB: ~/.ouroboros/ouroboros.db — ~1.4 GB, 165K events, 109 sessions
  • Symptom: claude mcp listplugin:ouroboros:ouroboros — ✗ Failed to connect
  • Root cause: Server takes ~38s to produce first JSON-RPC response, exceeding the client's connection timeout.

Measured startup times (time to first JSON-RPC response)

DB size Events Sessions Startup time
1.4 GB 165,187 109 ~38 s
547 MB 57,184 22 ~17 s
121 MB 12,129 5 ~5 s

Bottleneck analysis

The hot path in _run_mcp_server() (cli/commands/mcp.py):

await event_store.initialize()
repo = SessionRepository(event_store)
cancelled = await repo.cancel_orphaned_sessions()   # ← blocks serving
await server.serve()                                  # ← clients can connect only after this

find_orphaned_sessions() (orchestrator/session.py:866) performs:

  1. get_all_sessions() — loads all orchestrator.session.* events
  2. For each session → replay("session", session_id) — loads every event for that session into memory
  3. Iterates all events to determine status via _status_from_event()

This is O(S × E) where S = number of sessions and E = average events per session. The vast majority of sessions are already terminal (COMPLETED/CANCELLED/FAILED), yet they are still fully replayed every time.

Proposed optimizations

1. Defer orphan scan to after server starts serving (highest impact, simplest change)

Move cancel_orphaned_sessions() to a background task so the MCP handshake completes immediately:

async def _deferred_cleanup(repo):
    await asyncio.sleep(2)
    await repo.cancel_orphaned_sessions()

asyncio.create_task(_deferred_cleanup(repo))
await server.serve()  # clients can connect immediately

2. SQL-level filtering to skip terminal sessions (avoids unnecessary replay)

Before calling replay() per session, exclude sessions that already have a terminal event with a single query:

SELECT DISTINCT aggregate_id FROM events
WHERE event_type = 'orchestrator.session.started'
  AND aggregate_id NOT IN (
    SELECT aggregate_id FROM events
    WHERE event_type IN (
      'orchestrator.session.completed',
      'orchestrator.session.failed',
      'orchestrator.session.cancelled'
    )
  )

Only truly RUNNING/PAUSED sessions (typically 0–1) would need full replay.

3. (Longer term) Snapshot table for session status

Maintain a lightweight read-model table updated on each status-changing event, eliminating the need for full event replay during orphan detection entirely.

Current workaround

Manually trim old events and vacuum the database:

import sqlite3
conn = sqlite3.connect(os.path.expanduser("~/.ouroboros/ouroboros.db"))
conn.execute("DELETE FROM events WHERE timestamp < '2025-01-01'")
conn.commit()
conn.execute("VACUUM")
conn.close()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugReproducible defect or broken behavior

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions