perf(#310): defer orphan scan and skip terminal sessions at MCP startup#311
perf(#310): defer orphan scan and skip terminal sessions at MCP startup#311
Conversation
…artup The MCP server startup was blocked by `cancel_orphaned_sessions()` which replays the full event history for every session to determine its status. With a large EventStore (e.g. 165K events / 1.4 GB), this took ~38 seconds — far exceeding MCP client connection timeouts. Two changes: 1. **Defer orphan cleanup to a background task** (`cli/commands/mcp.py`): The scan now runs via `asyncio.create_task()` after `server.serve()` begins, so the JSON-RPC handshake completes immediately. 2. **SQL-level pre-filter for terminal sessions** (`session.py` + `event_store.py`): New `get_non_terminal_session_ids()` method uses an index-only subquery to exclude sessions that already have a completed/failed/cancelled event. Only the (typically 0–1) truly active sessions are replayed. Measured improvement (165K events, 109 sessions): - Before: ~38 s to first JSON-RPC response - After: <1 s (orphan scan runs in background) Closes Q00#310 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
da30ee1| Triggered by: PR opened by @isac322
Branch: perf/defer-orphan-scan | 3 files, +94/-24 | CI: not checked in this environment
Issue #310 Requirements
| Requirement | Status |
|---|---|
| MCP startup must stop blocking the JSON-RPC initialize handshake on a full orphan scan | MET — _run_mcp_server() now initializes the store and schedules orphan cleanup in a background task before entering server.serve(), so startup no longer waits on cancel_orphaned_sessions() (src/ouroboros/cli/commands/mcp.py:141-145, src/ouroboros/cli/commands/mcp.py:210-227) |
| Orphan detection should stop replaying obviously terminal sessions | MET — find_orphaned_sessions() now prefilters candidates through get_non_terminal_session_ids() and only replays sessions that still lack a terminal event (src/ouroboros/orchestrator/session.py:895-900, src/ouroboros/persistence/event_store.py:405-456) |
| The optimization should remain runtime-agnostic | MET — the cleanup path still uses heartbeat-based liveness checks and does not reintroduce runtime-specific process-name coupling (src/ouroboros/orchestrator/session.py:947-959) |
Previous Review Follow-up
First review — no previous findings.
Blocking Findings
No verified merge-blocking issues found in the current diff.
Test Coverage
I could not execute the targeted test suites in this sandbox because the environment lacked a usable local test setup (pytest was not installed directly, and uv run then failed on external package resolution/network restrictions). That means this approval is code-reviewed but not runtime-verified here.
Coverage that would be worth confirming in CI:
- startup cleanup behavior still reports the expected stderr warnings/messages after the move to a background task (
tests/unit/cli/test_mcp_startup_cleanup.py) - non-terminal session filtering handles mixed histories correctly (started → paused/running vs started → completed/failed/cancelled) (
src/ouroboros/persistence/event_store.py:405-456) - deferred cleanup does not leave noisy background-task warnings on fast shutdown paths (
src/ouroboros/cli/commands/mcp.py:213-227)
Design Notes
This is a good trade: it removes the user-visible startup timeout from the hot path while still preserving the cleanup mechanism as best-effort maintenance. The SQL prefilter also narrows the expensive replay work to sessions that are plausibly still active, which addresses the actual scaling failure rather than just hiding it behind async scheduling.
Follow-up Recommendations
- Add/adjust unit tests for the new deferred-startup behavior so the contract matches the implementation, especially around stderr output timing and cleanup task execution.
- If startup races around orphan visibility become user-visible later, consider a short post-bind delay or an explicit “cleanup scheduled” signal rather than moving the scan back onto the handshake path.
- Longer-term, a session-status read model/snapshot table would remove the remaining replay cost entirely.
Files Reviewed
src/ouroboros/cli/commands/mcp.pysrc/ouroboros/orchestrator/session.pysrc/ouroboros/persistence/event_store.py
Reviewed by ouroboros-agent[bot] via Codex deep analysis
- Add `get_non_terminal_session_ids` mock to all session test fixtures - Add `await asyncio.sleep(0)` in MCP startup tests to let background cleanup task execute - Update `test_cleanup_failure_does_not_prevent_startup` to fail cleanup (not initialize), matching the new control flow - Fix ruff formatting in event_store.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
8d8d5c1| Triggered by: new push by @isac322
Branch: perf/defer-orphan-scan | 5 files, +135/-27 | CI: MyPy Type Check pass 43s https://github.com/Q00/ouroboros/actions/runs/23941675953/job/69829163759 ; Ruff Lint pass 13s https://github.com/Q00/ouroboros/actions/runs/23941675953/job/69829163778 ; Test Python 3.12 pass 1m9s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/69829163855 ; Test Python 3.13 pass 1m13s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/69829163866 ; Test Python 3.14 pass 1m28s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/6982916
Issue #310 Requirements
| Requirement | Status |
|---|---|
| Environment: ouroboros-ai 0.27.0, Claude Code CLI, Linux | CHECK REVIEW — See findings below against current diff. |
| DB: ~/.ouroboros/ouroboros.db — ~1.4 GB, 165K events, 109 sessions | CHECK REVIEW — See findings below against current diff. |
Symptom: claude mcp list → `plugin:ouroboros:ouroboros — ✗ Failed to conne |
CHECK REVIEW — See findings below against current diff. |
| Root cause: Server takes ~38s to produce first JSON-RPC response, exceeding | CHECK REVIEW — See findings below against current diff. |
get_all_sessions() — loads all orchestrator.session.* events |
CHECK REVIEW — See findings below against current diff. |
Previous Review Follow-up
| Previous Finding | Status |
|---|---|
| Prior bot review exists. | Re-checked against current head; see blocking findings and design notes. |
Blocking Findings
| # | File:Line | Severity | Confidence | Finding |
|---|---|---|---|---|
| 1 | src/ouroboros/cli/commands/mcp.py:226 | High | High | Deferring cancel_orphaned_sessions() into an uncoordinated background task reintroduces a correctness race that did not exist before: the server can now accept MCP requests before startup cleanup finishes, so a client that immediately resumes or otherwise reactivates a stale session can still be auto-cancelled if the orphan scan snapshots it first. |
| 2 | src/ouroboros/cli/commands/mcp.py:226 | Medium | High | The orphan-scan task handle is discarded and never awaited or cancelled on shutdown. If server.serve() exits quickly or raises, the process can leave a pending task behind or cancel it mid-write against the shared EventStore, which matches the “fast shutdown” risk the previous review called out. |
| 3 | tests/unit/cli/test_mcp_startup_cleanup.py:98 | Low | High | The new startup-cleanup tests only yield the loop once with await asyncio.sleep(0) and then assert on background-task side effects. That is not a deterministic way to wait for cancel_orphaned_sessions() to finish, so these tests can miss leaked-task behavior and become timing-sensitive as soon as the mocked cleanup path gains another await. |
| 4 | tests/unit/orchestrator/test_session.py:1044 | Low | High | Several orphan-detection tests that are meant to cover completed/failed/cancelled sessions no longer exercise the new SQL prefilter at all. They rely on the fixture default get_non_terminal_session_ids=[], so the method exits before replay and the assertions no longer validate the behavior this PR actually changed. |
| 5 | src/ouroboros/persistence/event_store.py:448 | Low | Medium | The new candidate query only considers aggregates that have an explicit orchestrator.session.started event. The previous enumeration walked all orchestrator.session.* lifecycle events, so any historical/imported stream missing started but still carrying later lifecycle events will now be silently invisible to orphan cleanup. |
Test Coverage
Missing or weakened coverage in this update:
- tests/unit/cli/test_mcp_startup_cleanup.py:76 does not verify fast-shutdown behavior for the deferred cleanup task, so the pending-task lifecycle at src/ouroboros/cli/commands/mcp.py:226 is untested.
- tests/unit/cli/test_mcp_startup_cleanup.py:97 and the similar cases at lines 138/169/200/247 use a single
asyncio.sleep(0)instead of synchronizing on the background task, which makes the assertions timing-dependent. - tests/unit/orchestrator/test_session.py:1044 and the neighboring completed/failed/cancelled cases no longer prove that
get_non_terminal_session_ids()excludes explicitly terminal sessions, because the fixture default short-circuits the new code path. - There is no regression test for the new startup race where a stale session is resumed immediately after MCP startup while deferred cleanup is still scanning; that is the main behavioral change introduced by src/ouroboros/cli/commands/mcp.py:210.
Design Notes
The SQL-side prefilter in EventStore.get_non_terminal_session_ids() is a good direction: it attacks the real replay-cost problem instead of only hiding it. The risky part is moving orphan cancellation from a pre-serve maintenance step into a fire-and-forget concurrent task. That improves initialize latency, but it also changes the contract from “cleanup before serving” to “cleanup eventually, racing with normal traffic,” and the implementation currently does not manage that task’s lifetime. Prior review had no findings, so there are no prior blockers to mark maintained/modified/withdrawn; this pass is flagging newly relevant concerns in the updated implementation.
Follow-up Recommendations
- Re-run the impacted test subset locally/CI after addressing the blocking findings.
- Add regression coverage for the changed control flow and newly introduced edge cases.
Files Reviewed
src/ouroboros/cli/commands/mcp.pysrc/ouroboros/orchestrator/session.pysrc/ouroboros/persistence/event_store.pytests/unit/cli/test_mcp_startup_cleanup.pytests/unit/orchestrator/test_session.py
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
The performance goal is absolutely the right one. Improving MCP startup latency on large event stores fits the current direction of the repo very well, and the SQL-side prefilter is especially promising because it attacks the real replay-cost problem instead of just masking it. That said, I would not merge the current shape of this PR. The reason is that the optimization now changes a correctness contract, not just a performance characteristic:
So my maintainer take is:
I would be much more comfortable with one of these shapes:
In other words: good performance direction, but the current implementation trades too much correctness clarity for latency. |
Summary
cancel_orphaned_sessions()to anasyncio.create_task()background task so the MCP JSON-RPC handshake completes immediatelyEventStore.get_non_terminal_session_ids()— a single SQL subquery that skips sessions with terminal events, avoiding expensive full-replay for completed/failed/cancelled sessionsMeasured improvement (165K events, 109 sessions, 1.4 GB DB)
server.serve()Changes
cli/commands/mcp.py— move orphan cleanup afterserver.serve()via background taskorchestrator/session.py— useget_non_terminal_session_ids()instead ofget_all_sessions()+ per-session replaypersistence/event_store.py— newget_non_terminal_session_ids()method with index-only subqueryCloses #310