MCP server startup blocked by synchronous orphan session scan — causes connection timeout with large EventStore

## Summary

When the EventStore database grows large, the MCP server's synchronous `cancel_orphaned_sessions()` call during startup takes too long to complete, causing MCP clients (e.g., Claude Code) to time out before the JSON-RPC `initialize` handshake finishes.

## Reproduction

- **Environment:** ouroboros-ai 0.27.0, Claude Code CLI, Linux
- **DB:** ~/.ouroboros/ouroboros.db — ~1.4 GB, 165K events, 109 sessions
- **Symptom:** `claude mcp list` → `plugin:ouroboros:ouroboros — ✗ Failed to connect`
- **Root cause:** Server takes ~38s to produce first JSON-RPC response, exceeding the client's connection timeout.

### Measured startup times (time to first JSON-RPC response)

| DB size | Events | Sessions | Startup time |
|---------|--------|----------|-------------|
| 1.4 GB | 165,187 | 109 | **~38 s** |
| 547 MB | 57,184 | 22 | ~17 s |
| 121 MB | 12,129 | 5 | ~5 s |

## Bottleneck analysis

The hot path in `_run_mcp_server()` (`cli/commands/mcp.py`):

```python
await event_store.initialize()
repo = SessionRepository(event_store)
cancelled = await repo.cancel_orphaned_sessions()   # ← blocks serving
await server.serve()                                  # ← clients can connect only after this
```

`find_orphaned_sessions()` (`orchestrator/session.py:866`) performs:

1. `get_all_sessions()` — loads all `orchestrator.session.*` events
2. For **each** session → `replay("session", session_id)` — loads **every** event for that session into memory
3. Iterates all events to determine status via `_status_from_event()`

This is **O(S × E)** where S = number of sessions and E = average events per session. The vast majority of sessions are already terminal (COMPLETED/CANCELLED/FAILED), yet they are still fully replayed every time.

## Proposed optimizations

### 1. Defer orphan scan to after server starts serving (highest impact, simplest change)

Move `cancel_orphaned_sessions()` to a background task so the MCP handshake completes immediately:

```python
async def _deferred_cleanup(repo):
    await asyncio.sleep(2)
    await repo.cancel_orphaned_sessions()

asyncio.create_task(_deferred_cleanup(repo))
await server.serve()  # clients can connect immediately
```

### 2. SQL-level filtering to skip terminal sessions (avoids unnecessary replay)

Before calling `replay()` per session, exclude sessions that already have a terminal event with a single query:

```sql
SELECT DISTINCT aggregate_id FROM events
WHERE event_type = 'orchestrator.session.started'
  AND aggregate_id NOT IN (
    SELECT aggregate_id FROM events
    WHERE event_type IN (
      'orchestrator.session.completed',
      'orchestrator.session.failed',
      'orchestrator.session.cancelled'
    )
  )
```

Only truly RUNNING/PAUSED sessions (typically 0–1) would need full replay.

### 3. (Longer term) Snapshot table for session status

Maintain a lightweight read-model table updated on each status-changing event, eliminating the need for full event replay during orphan detection entirely.

## Current workaround

Manually trim old events and vacuum the database:

```python
import sqlite3
conn = sqlite3.connect(os.path.expanduser("~/.ouroboros/ouroboros.db"))
conn.execute("DELETE FROM events WHERE timestamp < '2025-01-01'")
conn.commit()
conn.execute("VACUUM")
conn.close()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCP server startup blocked by synchronous orphan session scan — causes connection timeout with large EventStore #310

Summary

Reproduction

Measured startup times (time to first JSON-RPC response)

Bottleneck analysis

Proposed optimizations

1. Defer orphan scan to after server starts serving (highest impact, simplest change)

2. SQL-level filtering to skip terminal sessions (avoids unnecessary replay)

3. (Longer term) Snapshot table for session status

Current workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DB size	Events	Sessions	Startup time
1.4 GB	165,187	109	~38 s
547 MB	57,184	22	~17 s
121 MB	12,129	5	~5 s

MCP server startup blocked by synchronous orphan session scan — causes connection timeout with large EventStore #310

Description

Summary

Reproduction

Measured startup times (time to first JSON-RPC response)

Bottleneck analysis

Proposed optimizations

1. Defer orphan scan to after server starts serving (highest impact, simplest change)

2. SQL-level filtering to skip terminal sessions (avoids unnecessary replay)

3. (Longer term) Snapshot table for session status

Current workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions