Summary
When the EventStore database grows large, the MCP server's synchronous cancel_orphaned_sessions() call during startup takes too long to complete, causing MCP clients (e.g., Claude Code) to time out before the JSON-RPC initialize handshake finishes.
Reproduction
- Environment: ouroboros-ai 0.27.0, Claude Code CLI, Linux
- DB: ~/.ouroboros/ouroboros.db — ~1.4 GB, 165K events, 109 sessions
- Symptom:
claude mcp list → plugin:ouroboros:ouroboros — ✗ Failed to connect
- Root cause: Server takes ~38s to produce first JSON-RPC response, exceeding the client's connection timeout.
Measured startup times (time to first JSON-RPC response)
| DB size |
Events |
Sessions |
Startup time |
| 1.4 GB |
165,187 |
109 |
~38 s |
| 547 MB |
57,184 |
22 |
~17 s |
| 121 MB |
12,129 |
5 |
~5 s |
Bottleneck analysis
The hot path in _run_mcp_server() (cli/commands/mcp.py):
await event_store.initialize()
repo = SessionRepository(event_store)
cancelled = await repo.cancel_orphaned_sessions() # ← blocks serving
await server.serve() # ← clients can connect only after this
find_orphaned_sessions() (orchestrator/session.py:866) performs:
get_all_sessions() — loads all orchestrator.session.* events
- For each session →
replay("session", session_id) — loads every event for that session into memory
- Iterates all events to determine status via
_status_from_event()
This is O(S × E) where S = number of sessions and E = average events per session. The vast majority of sessions are already terminal (COMPLETED/CANCELLED/FAILED), yet they are still fully replayed every time.
Proposed optimizations
1. Defer orphan scan to after server starts serving (highest impact, simplest change)
Move cancel_orphaned_sessions() to a background task so the MCP handshake completes immediately:
async def _deferred_cleanup(repo):
await asyncio.sleep(2)
await repo.cancel_orphaned_sessions()
asyncio.create_task(_deferred_cleanup(repo))
await server.serve() # clients can connect immediately
2. SQL-level filtering to skip terminal sessions (avoids unnecessary replay)
Before calling replay() per session, exclude sessions that already have a terminal event with a single query:
SELECT DISTINCT aggregate_id FROM events
WHERE event_type = 'orchestrator.session.started'
AND aggregate_id NOT IN (
SELECT aggregate_id FROM events
WHERE event_type IN (
'orchestrator.session.completed',
'orchestrator.session.failed',
'orchestrator.session.cancelled'
)
)
Only truly RUNNING/PAUSED sessions (typically 0–1) would need full replay.
3. (Longer term) Snapshot table for session status
Maintain a lightweight read-model table updated on each status-changing event, eliminating the need for full event replay during orphan detection entirely.
Current workaround
Manually trim old events and vacuum the database:
import sqlite3
conn = sqlite3.connect(os.path.expanduser("~/.ouroboros/ouroboros.db"))
conn.execute("DELETE FROM events WHERE timestamp < '2025-01-01'")
conn.commit()
conn.execute("VACUUM")
conn.close()
Summary
When the EventStore database grows large, the MCP server's synchronous
cancel_orphaned_sessions()call during startup takes too long to complete, causing MCP clients (e.g., Claude Code) to time out before the JSON-RPCinitializehandshake finishes.Reproduction
claude mcp list→plugin:ouroboros:ouroboros — ✗ Failed to connectMeasured startup times (time to first JSON-RPC response)
Bottleneck analysis
The hot path in
_run_mcp_server()(cli/commands/mcp.py):find_orphaned_sessions()(orchestrator/session.py:866) performs:get_all_sessions()— loads allorchestrator.session.*eventsreplay("session", session_id)— loads every event for that session into memory_status_from_event()This is O(S × E) where S = number of sessions and E = average events per session. The vast majority of sessions are already terminal (COMPLETED/CANCELLED/FAILED), yet they are still fully replayed every time.
Proposed optimizations
1. Defer orphan scan to after server starts serving (highest impact, simplest change)
Move
cancel_orphaned_sessions()to a background task so the MCP handshake completes immediately:2. SQL-level filtering to skip terminal sessions (avoids unnecessary replay)
Before calling
replay()per session, exclude sessions that already have a terminal event with a single query:Only truly RUNNING/PAUSED sessions (typically 0–1) would need full replay.
3. (Longer term) Snapshot table for session status
Maintain a lightweight read-model table updated on each status-changing event, eliminating the need for full event replay during orphan detection entirely.
Current workaround
Manually trim old events and vacuum the database: