Skip to content

perf(#310): defer orphan scan and skip terminal sessions at MCP startup#311

Open
isac322 wants to merge 2 commits intoQ00:mainfrom
isac322:perf/defer-orphan-scan
Open

perf(#310): defer orphan scan and skip terminal sessions at MCP startup#311
isac322 wants to merge 2 commits intoQ00:mainfrom
isac322:perf/defer-orphan-scan

Conversation

@isac322
Copy link
Copy Markdown

@isac322 isac322 commented Apr 3, 2026

Summary

  • Defer cancel_orphaned_sessions() to an asyncio.create_task() background task so the MCP JSON-RPC handshake completes immediately
  • Add EventStore.get_non_terminal_session_ids() — a single SQL subquery that skips sessions with terminal events, avoiding expensive full-replay for completed/failed/cancelled sessions

Measured improvement (165K events, 109 sessions, 1.4 GB DB)

Before After
Time to first JSON-RPC response ~38 s < 1 s
Orphan scan Blocks server.serve() Runs in background

Changes

  • cli/commands/mcp.py — move orphan cleanup after server.serve() via background task
  • orchestrator/session.py — use get_non_terminal_session_ids() instead of get_all_sessions() + per-session replay
  • persistence/event_store.py — new get_non_terminal_session_ids() method with index-only subquery

Closes #310

…artup

The MCP server startup was blocked by `cancel_orphaned_sessions()` which
replays the full event history for every session to determine its status.
With a large EventStore (e.g. 165K events / 1.4 GB), this took ~38 seconds
— far exceeding MCP client connection timeouts.

Two changes:

1. **Defer orphan cleanup to a background task** (`cli/commands/mcp.py`):
   The scan now runs via `asyncio.create_task()` after `server.serve()`
   begins, so the JSON-RPC handshake completes immediately.

2. **SQL-level pre-filter for terminal sessions** (`session.py` +
   `event_store.py`): New `get_non_terminal_session_ids()` method uses
   an index-only subquery to exclude sessions that already have a
   completed/failed/cancelled event. Only the (typically 0–1) truly
   active sessions are replayed.

Measured improvement (165K events, 109 sessions):
- Before: ~38 s to first JSON-RPC response
- After:  <1 s (orphan scan runs in background)

Closes Q00#310

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit da30ee1 | Triggered by: PR opened by @isac322

Branch: perf/defer-orphan-scan | 3 files, +94/-24 | CI: not checked in this environment

Issue #310 Requirements

Requirement Status
MCP startup must stop blocking the JSON-RPC initialize handshake on a full orphan scan MET_run_mcp_server() now initializes the store and schedules orphan cleanup in a background task before entering server.serve(), so startup no longer waits on cancel_orphaned_sessions() (src/ouroboros/cli/commands/mcp.py:141-145, src/ouroboros/cli/commands/mcp.py:210-227)
Orphan detection should stop replaying obviously terminal sessions METfind_orphaned_sessions() now prefilters candidates through get_non_terminal_session_ids() and only replays sessions that still lack a terminal event (src/ouroboros/orchestrator/session.py:895-900, src/ouroboros/persistence/event_store.py:405-456)
The optimization should remain runtime-agnostic MET — the cleanup path still uses heartbeat-based liveness checks and does not reintroduce runtime-specific process-name coupling (src/ouroboros/orchestrator/session.py:947-959)

Previous Review Follow-up

First review — no previous findings.

Blocking Findings

No verified merge-blocking issues found in the current diff.

Test Coverage

I could not execute the targeted test suites in this sandbox because the environment lacked a usable local test setup (pytest was not installed directly, and uv run then failed on external package resolution/network restrictions). That means this approval is code-reviewed but not runtime-verified here.

Coverage that would be worth confirming in CI:

  • startup cleanup behavior still reports the expected stderr warnings/messages after the move to a background task (tests/unit/cli/test_mcp_startup_cleanup.py)
  • non-terminal session filtering handles mixed histories correctly (started → paused/running vs started → completed/failed/cancelled) (src/ouroboros/persistence/event_store.py:405-456)
  • deferred cleanup does not leave noisy background-task warnings on fast shutdown paths (src/ouroboros/cli/commands/mcp.py:213-227)

Design Notes

This is a good trade: it removes the user-visible startup timeout from the hot path while still preserving the cleanup mechanism as best-effort maintenance. The SQL prefilter also narrows the expensive replay work to sessions that are plausibly still active, which addresses the actual scaling failure rather than just hiding it behind async scheduling.

Follow-up Recommendations

  • Add/adjust unit tests for the new deferred-startup behavior so the contract matches the implementation, especially around stderr output timing and cleanup task execution.
  • If startup races around orphan visibility become user-visible later, consider a short post-bind delay or an explicit “cleanup scheduled” signal rather than moving the scan back onto the handshake path.
  • Longer-term, a session-status read model/snapshot table would remove the remaining replay cost entirely.

Files Reviewed

  • src/ouroboros/cli/commands/mcp.py
  • src/ouroboros/orchestrator/session.py
  • src/ouroboros/persistence/event_store.py

Reviewed by ouroboros-agent[bot] via Codex deep analysis

- Add `get_non_terminal_session_ids` mock to all session test fixtures
- Add `await asyncio.sleep(0)` in MCP startup tests to let background
  cleanup task execute
- Update `test_cleanup_failure_does_not_prevent_startup` to fail cleanup
  (not initialize), matching the new control flow
- Fix ruff formatting in event_store.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 8d8d5c1 | Triggered by: new push by @isac322

Branch: perf/defer-orphan-scan | 5 files, +135/-27 | CI: MyPy Type Check pass 43s https://github.com/Q00/ouroboros/actions/runs/23941675953/job/69829163759 ; Ruff Lint pass 13s https://github.com/Q00/ouroboros/actions/runs/23941675953/job/69829163778 ; Test Python 3.12 pass 1m9s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/69829163855 ; Test Python 3.13 pass 1m13s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/69829163866 ; Test Python 3.14 pass 1m28s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/6982916

Issue #310 Requirements

Requirement Status
Environment: ouroboros-ai 0.27.0, Claude Code CLI, Linux CHECK REVIEW — See findings below against current diff.
DB: ~/.ouroboros/ouroboros.db — ~1.4 GB, 165K events, 109 sessions CHECK REVIEW — See findings below against current diff.
Symptom: claude mcp list → `plugin:ouroboros:ouroboros — ✗ Failed to conne CHECK REVIEW — See findings below against current diff.
Root cause: Server takes ~38s to produce first JSON-RPC response, exceeding CHECK REVIEW — See findings below against current diff.
get_all_sessions() — loads all orchestrator.session.* events CHECK REVIEW — See findings below against current diff.

Previous Review Follow-up

Previous Finding Status
Prior bot review exists. Re-checked against current head; see blocking findings and design notes.

Blocking Findings

# File:Line Severity Confidence Finding
1 src/ouroboros/cli/commands/mcp.py:226 High High Deferring cancel_orphaned_sessions() into an uncoordinated background task reintroduces a correctness race that did not exist before: the server can now accept MCP requests before startup cleanup finishes, so a client that immediately resumes or otherwise reactivates a stale session can still be auto-cancelled if the orphan scan snapshots it first.
2 src/ouroboros/cli/commands/mcp.py:226 Medium High The orphan-scan task handle is discarded and never awaited or cancelled on shutdown. If server.serve() exits quickly or raises, the process can leave a pending task behind or cancel it mid-write against the shared EventStore, which matches the “fast shutdown” risk the previous review called out.
3 tests/unit/cli/test_mcp_startup_cleanup.py:98 Low High The new startup-cleanup tests only yield the loop once with await asyncio.sleep(0) and then assert on background-task side effects. That is not a deterministic way to wait for cancel_orphaned_sessions() to finish, so these tests can miss leaked-task behavior and become timing-sensitive as soon as the mocked cleanup path gains another await.
4 tests/unit/orchestrator/test_session.py:1044 Low High Several orphan-detection tests that are meant to cover completed/failed/cancelled sessions no longer exercise the new SQL prefilter at all. They rely on the fixture default get_non_terminal_session_ids=[], so the method exits before replay and the assertions no longer validate the behavior this PR actually changed.
5 src/ouroboros/persistence/event_store.py:448 Low Medium The new candidate query only considers aggregates that have an explicit orchestrator.session.started event. The previous enumeration walked all orchestrator.session.* lifecycle events, so any historical/imported stream missing started but still carrying later lifecycle events will now be silently invisible to orphan cleanup.

Test Coverage

Missing or weakened coverage in this update:

Design Notes

The SQL-side prefilter in EventStore.get_non_terminal_session_ids() is a good direction: it attacks the real replay-cost problem instead of only hiding it. The risky part is moving orphan cancellation from a pre-serve maintenance step into a fire-and-forget concurrent task. That improves initialize latency, but it also changes the contract from “cleanup before serving” to “cleanup eventually, racing with normal traffic,” and the implementation currently does not manage that task’s lifetime. Prior review had no findings, so there are no prior blockers to mark maintained/modified/withdrawn; this pass is flagging newly relevant concerns in the updated implementation.

Follow-up Recommendations

  • Re-run the impacted test subset locally/CI after addressing the blocking findings.
  • Add regression coverage for the changed control flow and newly introduced edge cases.

Files Reviewed

  • src/ouroboros/cli/commands/mcp.py
  • src/ouroboros/orchestrator/session.py
  • src/ouroboros/persistence/event_store.py
  • tests/unit/cli/test_mcp_startup_cleanup.py
  • tests/unit/orchestrator/test_session.py

Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator

The performance goal is absolutely the right one. Improving MCP startup latency on large event stores fits the current direction of the repo very well, and the SQL-side prefilter is especially promising because it attacks the real replay-cost problem instead of just masking it.

That said, I would not merge the current shape of this PR.

The reason is that the optimization now changes a correctness contract, not just a performance characteristic:

  1. Moving orphan cleanup into a fire-and-forget background task introduces a real startup race.
    The server can now begin serving requests before cleanup finishes, which means stale-session cancellation can race with legitimate immediate traffic. That is a materially different behavior from the previous "cleanup before serving" contract.

  2. The background task lifecycle is not managed.
    The task handle is discarded, so there is no coordinated shutdown / cancellation / await path. In this codebase, once a task shares mutable persistence state, I want its lifetime to be explicit.

  3. The updated tests do not fully prove the new control flow.
    Yielding once with await asyncio.sleep(0) is not a strong synchronization point for background work, and several of the session tests look weaker against the new SQL-prefilter path than they were against the old replay-based path.

So my maintainer take is:

  • keep the SQL prefilter direction,
  • but do not land the uncoordinated deferred-cleanup behavior as-is.

I would be much more comfortable with one of these shapes:

  • a coordinated startup task whose lifecycle is tracked explicitly and shut down cleanly,
  • a narrower pre-serve cleanup that only touches the prefitered candidate set,
  • or a more explicit "cleanup scheduled / cleanup complete" contract if serving-before-cleanup is truly intended.

In other words: good performance direction, but the current implementation trades too much correctness clarity for latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP server startup blocked by synchronous orphan session scan — causes connection timeout with large EventStore

2 participants