perf(#310): defer orphan scan and skip terminal sessions at MCP startup by isac322 · Pull Request #311 · Q00/ouroboros

isac322 · 2026-04-03T09:24:43Z

Summary

Defer cancel_orphaned_sessions() to an asyncio.create_task() background task so the MCP JSON-RPC handshake completes immediately
Add EventStore.get_non_terminal_session_ids() — a single SQL subquery that skips sessions with terminal events, avoiding expensive full-replay for completed/failed/cancelled sessions

Measured improvement (165K events, 109 sessions, 1.4 GB DB)

	Before	After
Time to first JSON-RPC response	~38 s	< 1 s
Orphan scan	Blocks `server.serve()`	Runs in background

Changes

cli/commands/mcp.py — move orphan cleanup after server.serve() via background task
orchestrator/session.py — use get_non_terminal_session_ids() instead of get_all_sessions() + per-session replay
persistence/event_store.py — new get_non_terminal_session_ids() method with index-only subquery

Closes #310

…artup The MCP server startup was blocked by `cancel_orphaned_sessions()` which replays the full event history for every session to determine its status. With a large EventStore (e.g. 165K events / 1.4 GB), this took ~38 seconds — far exceeding MCP client connection timeouts. Two changes: 1. **Defer orphan cleanup to a background task** (`cli/commands/mcp.py`): The scan now runs via `asyncio.create_task()` after `server.serve()` begins, so the JSON-RPC handshake completes immediately. 2. **SQL-level pre-filter for terminal sessions** (`session.py` + `event_store.py`): New `get_non_terminal_session_ids()` method uses an index-only subquery to exclude sessions that already have a completed/failed/cancelled event. Only the (typically 0–1) truly active sessions are replayed. Measured improvement (165K events, 109 sessions): - Before: ~38 s to first JSON-RPC response - After: <1 s (orphan scan runs in background) Closes Q00#310 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ouroboros-agent

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit da30ee1 | Triggered by: PR opened by @isac322

Branch: perf/defer-orphan-scan | 3 files, +94/-24 | CI: not checked in this environment

Issue #310 Requirements

Requirement	Status
MCP startup must stop blocking the JSON-RPC initialize handshake on a full orphan scan	MET — `_run_mcp_server()` now initializes the store and schedules orphan cleanup in a background task before entering `server.serve()`, so startup no longer waits on `cancel_orphaned_sessions()` (`src/ouroboros/cli/commands/mcp.py:141-145`, `src/ouroboros/cli/commands/mcp.py:210-227`)
Orphan detection should stop replaying obviously terminal sessions	MET — `find_orphaned_sessions()` now prefilters candidates through `get_non_terminal_session_ids()` and only replays sessions that still lack a terminal event (`src/ouroboros/orchestrator/session.py:895-900`, `src/ouroboros/persistence/event_store.py:405-456`)
The optimization should remain runtime-agnostic	MET — the cleanup path still uses heartbeat-based liveness checks and does not reintroduce runtime-specific process-name coupling (`src/ouroboros/orchestrator/session.py:947-959`)

Previous Review Follow-up

First review — no previous findings.

Blocking Findings

No verified merge-blocking issues found in the current diff.

Test Coverage

I could not execute the targeted test suites in this sandbox because the environment lacked a usable local test setup (pytest was not installed directly, and uv run then failed on external package resolution/network restrictions). That means this approval is code-reviewed but not runtime-verified here.

Coverage that would be worth confirming in CI:

startup cleanup behavior still reports the expected stderr warnings/messages after the move to a background task (tests/unit/cli/test_mcp_startup_cleanup.py)
non-terminal session filtering handles mixed histories correctly (started → paused/running vs started → completed/failed/cancelled) (src/ouroboros/persistence/event_store.py:405-456)
deferred cleanup does not leave noisy background-task warnings on fast shutdown paths (src/ouroboros/cli/commands/mcp.py:213-227)

Design Notes

This is a good trade: it removes the user-visible startup timeout from the hot path while still preserving the cleanup mechanism as best-effort maintenance. The SQL prefilter also narrows the expensive replay work to sessions that are plausibly still active, which addresses the actual scaling failure rather than just hiding it behind async scheduling.

Follow-up Recommendations

Add/adjust unit tests for the new deferred-startup behavior so the contract matches the implementation, especially around stderr output timing and cleanup task execution.
If startup races around orphan visibility become user-visible later, consider a short post-bind delay or an explicit “cleanup scheduled” signal rather than moving the scan back onto the handshake path.
Longer-term, a session-status read model/snapshot table would remove the remaining replay cost entirely.

Files Reviewed

src/ouroboros/cli/commands/mcp.py
src/ouroboros/orchestrator/session.py
src/ouroboros/persistence/event_store.py

Reviewed by ouroboros-agent[bot] via Codex deep analysis

- Add `get_non_terminal_session_ids` mock to all session test fixtures - Add `await asyncio.sleep(0)` in MCP startup tests to let background cleanup task execute - Update `test_cleanup_failure_does_not_prevent_startup` to fail cleanup (not initialize), matching the new control flow - Fix ruff formatting in event_store.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ouroboros-agent

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 8d8d5c1 | Triggered by: new push by @isac322

Branch: perf/defer-orphan-scan | 5 files, +135/-27 | CI: MyPy Type Check pass 43s https://github.com/Q00/ouroboros/actions/runs/23941675953/job/69829163759 ; Ruff Lint pass 13s https://github.com/Q00/ouroboros/actions/runs/23941675953/job/69829163778 ; Test Python 3.12 pass 1m9s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/69829163855 ; Test Python 3.13 pass 1m13s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/69829163866 ; Test Python 3.14 pass 1m28s https://github.com/Q00/ouroboros/actions/runs/23941675959/job/6982916

Issue #310 Requirements

Requirement	Status
Environment: ouroboros-ai 0.27.0, Claude Code CLI, Linux	CHECK REVIEW — See findings below against current diff.
DB: ~/.ouroboros/ouroboros.db — ~1.4 GB, 165K events, 109 sessions	CHECK REVIEW — See findings below against current diff.
Symptom: `claude mcp list` → `plugin:ouroboros:ouroboros — ✗ Failed to conne	CHECK REVIEW — See findings below against current diff.
Root cause: Server takes ~38s to produce first JSON-RPC response, exceeding	CHECK REVIEW — See findings below against current diff.
`get_all_sessions()` — loads all `orchestrator.session.*` events	CHECK REVIEW — See findings below against current diff.

Previous Review Follow-up

Previous Finding	Status
Prior bot review exists.	Re-checked against current head; see blocking findings and design notes.

Blocking Findings

#	File:Line	Severity	Confidence	Finding
1	src/ouroboros/cli/commands/mcp.py:226	High	High	Deferring `cancel_orphaned_sessions()` into an uncoordinated background task reintroduces a correctness race that did not exist before: the server can now accept MCP requests before startup cleanup finishes, so a client that immediately resumes or otherwise reactivates a stale session can still be auto-cancelled if the orphan scan snapshots it first.
2	src/ouroboros/cli/commands/mcp.py:226	Medium	High	The orphan-scan task handle is discarded and never awaited or cancelled on shutdown. If `server.serve()` exits quickly or raises, the process can leave a pending task behind or cancel it mid-write against the shared `EventStore`, which matches the “fast shutdown” risk the previous review called out.
3	tests/unit/cli/test_mcp_startup_cleanup.py:98	Low	High	The new startup-cleanup tests only yield the loop once with `await asyncio.sleep(0)` and then assert on background-task side effects. That is not a deterministic way to wait for `cancel_orphaned_sessions()` to finish, so these tests can miss leaked-task behavior and become timing-sensitive as soon as the mocked cleanup path gains another await.
4	tests/unit/orchestrator/test_session.py:1044	Low	High	Several orphan-detection tests that are meant to cover completed/failed/cancelled sessions no longer exercise the new SQL prefilter at all. They rely on the fixture default `get_non_terminal_session_ids=[]`, so the method exits before replay and the assertions no longer validate the behavior this PR actually changed.
5	src/ouroboros/persistence/event_store.py:448	Low	Medium	The new candidate query only considers aggregates that have an explicit `orchestrator.session.started` event. The previous enumeration walked all `orchestrator.session.*` lifecycle events, so any historical/imported stream missing `started` but still carrying later lifecycle events will now be silently invisible to orphan cleanup.

Test Coverage

Missing or weakened coverage in this update:

tests/unit/cli/test_mcp_startup_cleanup.py:76 does not verify fast-shutdown behavior for the deferred cleanup task, so the pending-task lifecycle at src/ouroboros/cli/commands/mcp.py:226 is untested.
tests/unit/cli/test_mcp_startup_cleanup.py:97 and the similar cases at lines 138/169/200/247 use a single asyncio.sleep(0) instead of synchronizing on the background task, which makes the assertions timing-dependent.
tests/unit/orchestrator/test_session.py:1044 and the neighboring completed/failed/cancelled cases no longer prove that get_non_terminal_session_ids() excludes explicitly terminal sessions, because the fixture default short-circuits the new code path.
There is no regression test for the new startup race where a stale session is resumed immediately after MCP startup while deferred cleanup is still scanning; that is the main behavioral change introduced by src/ouroboros/cli/commands/mcp.py:210.

Design Notes

The SQL-side prefilter in EventStore.get_non_terminal_session_ids() is a good direction: it attacks the real replay-cost problem instead of only hiding it. The risky part is moving orphan cancellation from a pre-serve maintenance step into a fire-and-forget concurrent task. That improves initialize latency, but it also changes the contract from “cleanup before serving” to “cleanup eventually, racing with normal traffic,” and the implementation currently does not manage that task’s lifetime. Prior review had no findings, so there are no prior blockers to mark maintained/modified/withdrawn; this pass is flagging newly relevant concerns in the updated implementation.

Follow-up Recommendations

Re-run the impacted test subset locally/CI after addressing the blocking findings.
Add regression coverage for the changed control flow and newly introduced edge cases.

Files Reviewed

src/ouroboros/cli/commands/mcp.py
src/ouroboros/orchestrator/session.py
src/ouroboros/persistence/event_store.py
tests/unit/cli/test_mcp_startup_cleanup.py
tests/unit/orchestrator/test_session.py

Reviewed by ouroboros-agent[bot] via Codex deep analysis

shaun0927 · 2026-04-23T09:45:43Z

The performance goal is absolutely the right one. Improving MCP startup latency on large event stores fits the current direction of the repo very well, and the SQL-side prefilter is especially promising because it attacks the real replay-cost problem instead of just masking it.

That said, I would not merge the current shape of this PR.

The reason is that the optimization now changes a correctness contract, not just a performance characteristic:

Moving orphan cleanup into a fire-and-forget background task introduces a real startup race.
The server can now begin serving requests before cleanup finishes, which means stale-session cancellation can race with legitimate immediate traffic. That is a materially different behavior from the previous "cleanup before serving" contract.
The background task lifecycle is not managed.
The task handle is discarded, so there is no coordinated shutdown / cancellation / await path. In this codebase, once a task shares mutable persistence state, I want its lifetime to be explicit.
The updated tests do not fully prove the new control flow.
Yielding once with await asyncio.sleep(0) is not a strong synchronization point for background work, and several of the session tests look weaker against the new SQL-prefilter path than they were against the old replay-based path.

So my maintainer take is:

keep the SQL prefilter direction,
but do not land the uncoordinated deferred-cleanup behavior as-is.

I would be much more comfortable with one of these shapes:

a coordinated startup task whose lifecycle is tracked explicitly and shut down cleanly,
a narrower pre-serve cleanup that only touches the prefitered candidate set,
or a more explicit "cleanup scheduled / cleanup complete" contract if serving-before-cleanup is truly intended.

In other words: good performance direction, but the current implementation trades too much correctness clarity for latency.

ouroboros-agent Bot approved these changes Apr 3, 2026

View reviewed changes

ouroboros-agent Bot requested changes Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(#310): defer orphan scan and skip terminal sessions at MCP startup#311

perf(#310): defer orphan scan and skip terminal sessions at MCP startup#311
isac322 wants to merge 2 commits intoQ00:mainfrom
isac322:perf/defer-orphan-scan

isac322 commented Apr 3, 2026

Uh oh!

ouroboros-agent Bot left a comment

Uh oh!

ouroboros-agent Bot left a comment

Uh oh!

shaun0927 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

isac322 commented Apr 3, 2026

Summary

Measured improvement (165K events, 109 sessions, 1.4 GB DB)

Changes

Uh oh!

ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

Review — ouroboros-agent[bot]

Issue #310 Requirements

Previous Review Follow-up

Blocking Findings

Test Coverage

Design Notes

Follow-up Recommendations

Files Reviewed

Uh oh!

ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

Review — ouroboros-agent[bot]

Issue #310 Requirements

Previous Review Follow-up

Blocking Findings

Test Coverage

Design Notes

Follow-up Recommendations

Files Reviewed

Uh oh!

shaun0927 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants