Conversation
CancelJob now requests session cancellation and also cancels the in-process job and runner tasks for linked session jobs. This lets cancellation propagate into runtime generators immediately so their subprocess cleanup handlers can terminate child CLI processes instead of waiting for another output event. Fixes #341 Services: shared Affected files: - src/ouroboros/mcp/job_manager.py - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
0dded05for PR #464
Review record:
326c9dd5-1e5b-4848-81c0-9c40b5408d64
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
The change is narrow and appropriate: cancel_job() now applies both cancellation mechanisms for linked-session jobs, which aligns the session-linked path with the existing in-process task lifecycle. The added integration test covers the previously missing regression path where a linked job’s subprocess would otherwise outlive cancellation.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Apply Ruff import ordering to the linked-session subprocess cancellation regression test so the PR passes lint. Affected files: - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
a11c19cfor PR #464
Review record:
0e2ba0ac-4f8b-42d2-923a-5828582cc338
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:435 | BLOCKING | cancel_job() now sets the session-level cancellation flag and then immediately force-cancels the local job task for linked-session jobs. The orchestrator only clears that in-memory flag when its own cancellation/terminal path runs; if the task is interrupted first, the flag can linger indefinitely. In practice that means a later resume_session() for the same session_id can be cancelled immediately by the stale registry entry, so this change fixes subprocess teardown at the cost of breaking same-process resume/cancellation semantics. |
Follow-up Findings
tests/integration/test_cancel_subprocess_termination.py:57[warning] The new integration test mutates the module-global cancellation registry by cancellingsession_id=\"orch_cancel_123\", but it never clears that flag in teardown. There is no suite-wide reset, and existing tests assert the registry is empty, so this introduces order-dependent failures/flakiness in the test suite.
Non-blocking Suggestions
None.
Design Notes
The intent of the patch is correct: linked-session jobs need both orchestration-level cancellation and local task/process teardown. The current implementation couples those two mechanisms without preserving the runner’s cleanup contract, so the lifecycle model is still inconsistent.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Clear the session cancellation registry when cancel_job also cancels the local job or runner task. This preserves same-process resume semantics while still terminating linked-session subprocesses immediately. Services: shared Affected files: - src/ouroboros/mcp/job_manager.py - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
5e3c68dfor PR #464
Review record:
469c00de-5652-4f7d-adeb-e4d043dac9d5
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:434 | BLOCKING | Cancelling linked jobs now force-cancels the local wrapper task and runner task immediately after setting the session cancellation flag, then clears that flag again. That fixes the subprocess leak, but for real execute_seed jobs it bypasses the orchestrator’s graceful cancellation path: the background wrapper in execution_handlers.py only catches Exception (not asyncio.CancelledError), and the runner execution paths also rely on _check_cancellation() / _handle_cancellation() to mark the session cancelled and unregister it. In Python 3.11+, asyncio.CancelledError inherits from BaseException, so this direct task.cancel() path can leave the underlying session aggregate non-terminal / uncleaned while the job itself reports CANCELLED. The new integration test only checks process termination and registry clearing, so this regression is currently untested. |
Non-blocking Suggestions
None.
Design Notes
The change addresses the immediate subprocess-lifecycle problem, but it mixes two cancellation mechanisms with different contracts: graceful session cancellation via the runner registry, and hard task cancellation via asyncio. Those need a single authoritative cleanup path, otherwise job state and session state can diverge.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
When cancel_job hard-stops a local linked-session job, mark the session aggregate cancelled and release the heartbeat lock before clearing the cancellation flag. This keeps job, session, and process cleanup state aligned. Services: shared Affected files: - src/ouroboros/mcp/job_manager.py - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
9ec9d1dfor PR #464
Review record:
4247b2b1-5ba4-4ad5-97ec-e3e55ceee14c
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:437 | BLOCKING | Linked-session cancellation now force-cancels the local runner task before the orchestrator can reach its cooperative cancellation path. That bypasses OrchestratorRunner._handle_cancellation(), which is the code that emits the required execution.terminal event for cancelled executions (events.py:624). As a result, execution-stream consumers that only watch execution.* will never observe the terminal cancelled transition, even though the session event exists. The new integration test only asserts the session cancellation event, so this regression is currently untested. |
Non-blocking Suggestions
None.
Design Notes
The change fixes subprocess termination, but it does so by splitting cancellation responsibilities between JobManager and OrchestratorRunner. That creates a lifecycle mismatch: JobManager now kills the task and partially reproduces runner cleanup, while the runner still owns the execution-stream terminal event. Cancellation should have a single owner, or the duplicated cleanup needs to include every externally visible side effect.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Mirror hard-stopped linked-session job cancellation into the execution event stream so consumers observing execution.* see the terminal cancelled state as well as the session event. Services: shared Affected files: - src/ouroboros/mcp/job_manager.py - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
a0a45e7for PR #464
Review record:
7c0d458c-ff60-4c5a-9a91-97113c2ba61b
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:439 | BLOCKING | The new linked-session cancellation path can emit a false cancellation after the job has already reached a real terminal state. task.cancel() is attempted whenever _tasks[job_id] exists, even if that wrapper task is already done, and local_task_cancelled is then used to unconditionally append orchestrator.session.cancelled and execution.terminal(status=\"cancelled\"). There is a race where get_snapshot() still sees the job as running while _run_job() has already completed/failed but has not yet removed itself from _tasks; a late cancel_job() call will then overwrite the observable terminal state for the linked session/execution. The runner’s own cancellation code explicitly re-checks for existing terminal state before marking cancelled, but this new path does not. |
Non-blocking Suggestions
None.
Design Notes
The change is directionally correct in trying to clean up linked background sessions when the wrapper task is cancelled, but it is currently bypassing the runner’s terminal-state safeguards and duplicating terminal-event responsibility in a racy way.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Only hard-cancel linked job tasks that are still running. This avoids emitting synthetic cancelled session and execution terminal events for jobs that have already completed or failed but are still present in the in-memory task registry. Services: shared Affected files: - src/ouroboros/mcp/job_manager.py - tests/unit/mcp/test_job_manager.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
776194ffor PR #464
Review record:
726def18-9148-4da3-b24b-f7545175590b
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:450 | BLOCKING | SessionRepository.mark_cancelled() returns a Result, but this new path ignores it and continues as if cancellation was persisted. If that write fails, cancel_job() still releases the lock, may append execution.terminal, clears the in-memory cancellation flag, and returns a cancelled-ish snapshot even though the authoritative session stream never recorded orchestrator.session.cancelled. The existing direct-cancel path treats this as an error (tests/unit/orchestrator/test_inflight_cancellation.py:515), so this introduces a silent inconsistency in production. |
| 2 | src/ouroboros/mcp/job_manager.py:457 | BLOCKING | cancel_job() now unconditionally calls heartbeat.release(session_id), but JobManager only holds a linked session_id; it does not acquire or own the session lock. release() simply unlinks the lock file with no ownership check (src/ouroboros/orchestrator/heartbeat.py:97), so cancelling a linked job can clear the lock for another still-running holder and allow concurrent work/resume against an active session. This is a changed contract boundary and needs an ownership/liveness guard before releasing the lock from here. |
Non-blocking Suggestions
None.
Design Notes
The PR is fixing a real gap by propagating cancellation from MCP jobs into linked orchestrator sessions, but it currently duplicates part of the runner’s cancellation responsibilities inside JobManager without preserving the same error-handling and lock-ownership guarantees. That makes the new path much more fragile than the existing orchestrator cancellation flow.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Check linked-session cancellation persistence before emitting terminal events, and only release heartbeat locks owned by the current process. Add regression coverage for persistence failures and lock ownership cleanup. Services: shared Affected files: - src/ouroboros/orchestrator/heartbeat.py - src/ouroboros/mcp/job_manager.py - tests/unit/mcp/test_job_manager.py - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
2b03891for PR #464
Review record:
d0d4e4ea-3bf7-41dd-909f-6285f060ceb4
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/orchestrator/heartbeat.py:110 | BLOCKING | release_if_owned_by_current_process() only compares the stored PID to os.getpid(), even though the lock format explicitly includes process start time to prevent PID-reuse false positives. If a stale lock file belongs to an older process that used the same PID, this helper will incorrectly delete another process's lock and make orphan detection think that session is dead. The ownership check needs to validate the recorded start time as well, like is_holder_alive() does. |
| 2 | src/ouroboros/mcp/job_manager.py:435 | BLOCKING | cancel_job() now both sets the in-memory cancellation flag and immediately writes orchestrator.session.cancelled / execution.terminal itself when it manages to cancel a local task. The runner already watches that same flag and has its own _handle_cancellation() path that persists the same terminal state. Because cancel_job() does not wait for the runner task to finish before appending its own terminal events, there is a race where both sides can emit cancellation for the same session/execution, producing duplicate terminal records in the event stream. In this event-sourced design, duplicate terminal events are a contract bug, not just noise. |
Non-blocking Suggestions
None.
Design Notes
The direction makes sense: linked MCP job cancellation should tear down the backing execution instead of only cancelling the wrapper task. The implementation is too eager, though: it duplicates runner-owned terminalization logic and introduces a weaker lock-ownership check than the existing heartbeat model.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Avoid racing the runner cancellation path by only setting the cooperative cancellation flag when no local job task is hard-cancelled. Strengthen heartbeat lock release ownership checks with process start time validation and clean up pending test tasks. Services: shared Affected files: - src/ouroboros/orchestrator/heartbeat.py - src/ouroboros/mcp/job_manager.py - tests/unit/mcp/test_job_manager.py - tests/integration/test_cancel_subprocess_termination.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
ada39fdfor PR #464
Review record:
3c2a9b2a-118a-4c89-b254-e9c92ba234b6
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:448 | BLOCKING | This branch marks the linked session/execution as cancelled purely because the local MCP job task was still alive. That is not enough to prove the orchestrator session is still non-terminal: execute_precreated_session() writes orchestrator.session.completed/failed and execution.terminal before _run_job() returns, so there is a real race where the runner has already persisted a terminal success/failure but the outer job task is still finishing. In that window, cancel_job() will append a contradictory orchestrator.session.cancelled plus execution.terminal(status=\"cancelled\"), effectively flipping a completed/failed execution to cancelled for downstream consumers. The runner’s own cancellation path explicitly reconstructs the session and skips mark_cancelled() when it is already terminal; this new path needs the same guard. |
Non-blocking Suggestions
None.
Design Notes
The overall direction is reasonable: cancelling linked background jobs should clean up the runner subprocess and release the heartbeat lock. The problem is that JobManager now duplicates part of the runner’s terminal-state persistence logic without reusing its safeguards, so the two cancellation paths can disagree about the authoritative session state.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Check linked session terminal events before hard-cancelling a job so late cancel requests cannot append contradictory cancelled session or execution terminal records after completion or failure. Services: shared Affected files: - src/ouroboros/mcp/job_manager.py - tests/unit/mcp/test_job_manager.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
0de8131for PR #464
Review record:
a126085f-d4d1-4855-a743-0efcab4eba33
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:464 | BLOCKING | The new linked-session path now force-cancels the local asyncio tasks instead of using the runner’s cooperative cancellation flow. In this codebase the runner releases important state only from its normal success/failure/cancellation paths (_unregister_session, task-workspace lock release, etc.); CancelledError is not caught in those execution methods, so task.cancel()/runner_task.cancel() can bypass that cleanup entirely. This patch only compensates by removing the heartbeat lock, which leaves other runner-owned state stale and can block later executions or cancellations. |
| 2 | src/ouroboros/mcp/job_manager.py:477 | BLOCKING | cancel_job() now persists orchestrator.session.cancelled and execution.terminal as soon as it cancels the local task, even if the linked session has never emitted orchestrator.session.started yet. ouroboros_start_execute_seed hands the caller a preallocated session_id/execution_id before the runner necessarily creates the session, so cancelling immediately after job creation can produce a “cancelled” session with no start event. SessionRepository.reconstruct_session() treats that as an invalid stream (No start event found), so this change can leave behind unreconstructable sessions and break resume/status consumers. |
Non-blocking Suggestions
None.
Design Notes
The intent is clear: make MCP job cancellation terminate linked subprocess work promptly. But the implementation crosses a contract boundary from “request cancellation” into “synthesize terminal orchestrator state,” and it does that without going through the runner’s existing lifecycle/cleanup machinery. That makes the new path fast, but it also creates state-consistency risks the current tests do not cover.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
7fbc946for PR #464
Review record:
a1c9c8b2-d2b1-4f55-8326-c005137dcfdf
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:462 | BLOCKING | cancel_job() now treats any job with a session_id link as cooperatively cancellable and returns after request_cancellation(), without cancelling the actual task. That breaks JobManager’s contract for linked jobs that have not started the orchestrator session yet, or that never consult the cancellation registry: the user sees cancel_requested, but the coroutine keeps running and can continue holding subprocesses/resources indefinitely. The new test at tests/unit/mcp/test_job_manager.py:202 codifies this non-cancelling behavior instead of catching it. |
| 2 | src/ouroboros/mcp/job_manager.py:449 | BLOCKING | The new “linked session already terminal” fast-path returns the existing snapshot without touching the job task. If the session reaches completed/failed/cancelled before the wrapper coroutine exits, cancel_job() becomes a no-op while the job remains running and the underlying task keeps executing. tests/unit/mcp/test_job_manager.py:163 demonstrates this exact situation with a sleeping runner and only passes because the test force-cancels the manager in finally; in production, job_wait/channel polling would keep treating the workflow as active and the task would leak. |
Non-blocking Suggestions
None.
Design Notes
The intent is clear: avoid synthesizing orchestrator terminal events during cancellation races. But the implementation moved cancellation authority from the job manager to optional cooperative behavior in the linked runner, which leaves the generic background-job lifecycle inconsistent. The job and linked session need separate state handling: suppress duplicate session terminalization if needed, but still guarantee the job task itself is stopped or resolved.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
3368bf5for PR #464
Review record:
0f524777-c0ae-4de9-95c9-e8a561f0ee9d
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:466 | BLOCKING | cancel_job() now unconditionally cancels the local job task for linked sessions, even when the linked orchestrator session is already terminal. If the runner behind that task is execute_precreated_session(), its new CancelledError path calls _handle_cancellation(), which skips mark_cancelled() for terminal sessions but still appends execution.terminal with status=\"cancelled\" ([runner.py:1373-1397]). That creates a real race where a session that already completed/failed can be rewritten as cancelled in the execution event stream just because the MCP job was cancelled a moment later. The local task cancellation needs to be gated on linked_session_terminal == False, or _handle_cancellation() must avoid emitting a cancelled terminal event when the session is already terminal. |
Non-blocking Suggestions
None.
Design Notes
The change is directionally right in trying to stop leaked local subprocesses, but the new job-layer cancellation now overlaps with the runner’s own session-layer cancellation contract. That boundary needs one clear owner, otherwise terminal state can diverge between the job aggregate and the execution/session aggregates.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
c77f31efor PR #464
Review record:
a14a8a42-680f-4dbc-b9a1-3686d236d80f
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:449 | BLOCKING | linked_session_started = session_result.is_ok treats any pre-created session as already running, but prepare_session() persists orchestrator.session.started before execute_precreated_session() has actually begun. If cancel_job() hits in that queued window, it now sets the in-memory cancellation flag, cancels both tasks, and skips clear_cancellation() because reconstruction succeeded. When the runner task never gets to its startup cancellation check/CancelledError handler, nothing clears that flag or persists a cancellation event, so the session remains logically RUNNING and any later retry/resume in the same process will be cancelled immediately from stale state. The new tests cover “no session record” and “already started”, but not this pre-created-but-never-started path. |
Non-blocking Suggestions
None.
Design Notes
The patch is moving in the right direction on linked-session cancellation races, but the current implementation still conflates “session exists in the event store” with “runner has started and can consume the cancellation signal”. That boundary needs to be explicit; otherwise queued sessions leak process-local cancellation state.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
ce90c9cfor PR #464
Review record:
d10cda27-2065-4d3e-ba3b-a9f5390b3320
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/orchestrator/runner.py:1585 | BLOCKING | Both new except asyncio.CancelledError branches now unconditionally convert task cancellation into _handle_cancellation(). _handle_cancellation() skips persisting a cancelled session when reconstruct_session() says the session is already terminal, but it still returns an OrchestratorResult with success=False and a cancellation message. That means a wrapper/task cancellation during teardown after a session has already completed or failed will now report a cancelled result to callers even though the persisted session state is terminal-success/terminal-failure. This is an API contract mismatch and the added test at tests/unit/orchestrator/test_inflight_cancellation.py:748 currently locks in the wrong behavior instead of catching it. |
Non-blocking Suggestions
None.
Design Notes
The PR is moving cancellation handling in the right direction for linked background jobs, but it currently conflates “task was cancelled” with “execution should be reported as cancelled.” Persistence and return-value semantics need to stay aligned, especially once the session is already terminal.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
74fbd0bfor PR #464
Review record:
49677daa-b421-4bd7-bd3d-1ac7ea0c048e
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/orchestrator/runner.py:1377 | BLOCKING | The new session_already_terminal fast-path returns before emitting any execution.terminal event. That breaks the execution-stream contract if cancellation lands after orchestrator.session.completed/failed has been persisted but before execution.terminal is appended in execute_precreated_session (runner.py:1787-1845). In that race, the session aggregate is terminal but single-stream consumers never see execution completion. |
| 2 | src/ouroboros/mcp/job_manager.py:441 | BLOCKING | cancel_job() now performs a best-effort session inspection with query_events() before it cancels the local tasks, but that read is unguarded. EventStore.query_events() raises PersistenceError on DB/engine failures, so a transient store error now aborts cancellation entirely and leaves the job running. Before this PR, the local task cancellation path did not depend on an extra event-store read. |
Non-blocking Suggestions
None.
Design Notes
The direction is reasonable: distinguish “started linked session” from “precreated but never started” and stop synthesizing cancellation state in the latter case. The main gap is that the runner now has two terminal-state channels (session.* and execution.terminal) but the new cancellation shortcut only preserves one of them, which reintroduces stream divergence.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
6a14959for PR #464
Review record:
32dcaeaa-da9f-4705-b0ad-dda268db5e5b
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:470 | BLOCKING | cancel_job() now cancels the local job task and runner task even when the linked session is already started and a cooperative cancellation request was just issued. For real orchestrator-backed jobs, that CancelledError is caught by execute_precreated_session() and translated into _handle_cancellation(), which persists orchestrator.session.cancelled and execution.terminal. That is the opposite of the intended behavior captured by the new tests for started linked sessions, and it will still synthesize terminal session state in production for the actual orchestrator runner path. |
| 2 | src/ouroboros/orchestrator/runner.py:1623 | BLOCKING | Both new except asyncio.CancelledError blocks treat any task cancellation as an external user cancellation and call _handle_cancellation() without checking whether cancellation was actually requested. That means parent task-group shutdown, timeouts, or unrelated task cancellation will now be recorded as CANCELLED sessions with terminal execution events, which is a user-visible state corruption rather than normal task teardown. The same issue exists in the later handler at line 1898. |
Non-blocking Suggestions
None.
Design Notes
The PR is trying to separate “stop local background job bookkeeping” from “mutate orchestrator session state”, but the new control flow still couples those paths through unconditional task.cancel() and broad CancelledError handling. As written, the implementation does not maintain a clear boundary between local task teardown and authoritative session termination.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
c121b03for PR #464
Review record:
67e6eb59-60ac-4523-acd2-48d11451104a
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:462 | BLOCKING | cancel_job() snapshots linked_session_started once, then may both request_cancellation() and clear_cancellation() in the same call. If the linked runner acquires its heartbeat after line 462 but before the clear at line 479, the cancellation signal is erased before execute_precreated_session() reaches its new checkpoints (runner.py:1573, 1624, 1903). In that race, the task is aborted via CancelledError, but the session is left without orchestrator.session.cancelled / execution.terminal events and can remain logically RUNNING. This is a real early-start race in the exact path the PR is trying to fix. |
Non-blocking Suggestions
None.
Design Notes
The PR is moving cancellation handling in the right direction by separating “linked session already started” from “local task never really started”, but the new logic still depends on a single heartbeat snapshot and an in-memory cancellation flag. That leaves a narrow but important race around runner startup.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
6e0173efor PR #464
Review record:
9af98898-ad61-4fca-b368-36eab290618e
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:466 | BLOCKING | cancel_job() now returns immediately whenever the linked session heartbeat is present, so it never cancels the local _tasks/_runner_tasks in the real “session already started” case. That means any linked background job that is blocked outside the runner’s message-loop checkpoints (for example, waiting on a subprocess or another long await) will keep running after the user cancels it. This is exactly the in-flight path used by execute_precreated_session(), which acquires the heartbeat up front in runner.py:1571; the new integration test only covers the pre-started/no-heartbeat case, so this regression is not caught. |
Non-blocking Suggestions
| 1 | src/ouroboros/orchestrator/heartbeat.py:110 | refactoring | release_if_owned_by_current_process() is added but not used anywhere in the changed paths. If ownership-safe lock release is part of the intended fix, wiring it into the unregister/cleanup paths would make the change meaningful and reduce future confusion. |
Design Notes
The PR moves cancellation handling in the right direction by distinguishing terminal, started, and pre-start states, but the current split still leaves the important in-flight started case on a signal-only path. That makes cancellation correctness depend on cooperative checkpoints rather than on actually stopping the local work that owns the subprocesses.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
7783b94for PR #464
Review record:
41991b1a-a419-4195-ae59-29856ae118a1
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/orchestrator/runner.py:1623 | BLOCKING | The new CancelledError paths now convert wrapper-task cancellation into _handle_cancellation() / _unregister_session() cleanup, but that cleanup still drops the session heartbeat with an unconditional release(session_id). If the same session is resumed in another process while this task is unwinding, this branch will delete the new owner's lock file and find_orphaned_sessions() can later misclassify that live session as orphaned and auto-cancel it. heartbeat.release_if_owned_by_current_process() was added in this PR, but it is never wired into the cleanup path. |
Non-blocking Suggestions
None.
Design Notes
The overall direction is sound: cancellation is now handled earlier and more consistently, and the job-manager changes avoid blindly terminalizing linked sessions. The remaining gap is lock ownership during cleanup; without owner-aware release, the new cancellation branches are not safe under concurrent resume/takeover scenarios.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
160e752for PR #464
Review record:
19527637-9785-42d4-944a-3c1b57d9fc3d
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:466 | BLOCKING | When a linked session is already running, cancel_job() now checks is_holder_alive() but still signals cancellation only via request_cancellation(), which updates the current process's in-memory registry. A runner in a different process will never observe that flag, so the linked execution can continue even though the job was cancelled locally. The runner’s own contract explicitly says cross-process cancellation requires a persisted orchestrator.session.cancelled event (_check_cancellation() falls back to the event store for that case in src/ouroboros/orchestrator/runner.py). This regresses the “runtime-agnostic” heartbeat-based path the PR is trying to support. |
Non-blocking Suggestions
None.
Design Notes
The lock-ownership cleanup in runner.py is a sound direction and avoids one process deleting another process's heartbeat. The main design gap is that job_manager.py now detects externally-running sessions but still uses a process-local cancellation mechanism, so the cancellation path is not actually runtime-agnostic end to end.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
21c1b18for PR #464
Review record:
e5d1b673-e4a5-4e9b-b882-f3788fbb21b0
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:438 | BLOCKING | If linked-session inspection throws, linked_session_terminal falls back to False, and the later cross-process branch still calls mark_cancelled() whenever a foreign lock is present. That means a transient read failure while the remote runner is finishing can append orchestrator.session.cancelled after a real completed/failed outcome, silently rewriting the session’s terminal state. Inspection failure needs to leave the state as “unknown” and skip persistent cancellation rather than assuming “not terminal.” |
Non-blocking Suggestions
None.
Design Notes
The direction makes sense: ownership-aware lock release and explicit task-cancellation handling close real race windows. The remaining problem is that JobManager.cancel_job() now mixes “could not inspect” with “safe to persist cancel,” which breaks the event-sourced terminal-state contract for linked sessions.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
ede22a5for PR #464
Review record:
6ca82dde-eff0-4338-bb18-55c6bd04aab1
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:477 | BLOCKING | Cross-process cancellation is silently dropped when linked-session inspection fails. In that branch cancel_job() still sets the in-memory cancellation flag, but it skips mark_cancelled() because linked_session_inspection_failed is true. External runners cannot see the in-memory registry, so the user gets a successful cancel response while the real linked session keeps running in the other process. |
| 2 | src/ouroboros/mcp/job_manager.py:502 | BLOCKING | Cancellation for a precreated-but-not-yet-started linked session is cleared before any future runner can observe it. cancel_job() calls request_cancellation(), then immediately clear_cancellation() whenever no lock holder is alive; later execute_precreated_session() only checks is_cancellation_requested() at startup, so a worker that starts the session after this point will run normally even though the job was already cancelled. |
Non-blocking Suggestions
None.
Design Notes
The PR is moving toward a clearer ownership model for linked-session cancellation, but the current implementation still treats the in-memory registry as if it were a cross-process signal. That makes the new race-handling logic unsafe in exactly the distributed cases it is trying to fix.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
62a8e1bfor PR #464
Review record:
4d3d94dc-719d-4b43-a8b2-91de007778a6
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:468 | BLOCKING | cancel_job() now persists orchestrator.session.cancelled for any linked session that exists, even when linked_session_started is false. That means canceling a background wrapper before the orchestrator ever acquires the session lock permanently terminalizes a precreated/dormant session. The later clear_cancellation() only removes the in-memory flag; it does not undo the persisted cancelled event, so the session is left in a terminal state even though no execution actually ran. |
| 2 | src/ouroboros/mcp/job_manager.py:468 | BLOCKING | When JobManager is the component that marks a linked session cancelled, it never mirrors that terminal state into the execution aggregate. The contract in create_execution_terminal_event() is that execution-only consumers rely on execution.terminal to observe completion/cancellation, but this path only writes orchestrator.session.cancelled. For linked jobs whose runner is just an arbitrary coroutine/subprocess wrapper, there is no other code left to append execution.terminal, so TUI/pollers watching the execution stream can hang indefinitely on a cancellation. |
Non-blocking Suggestions
None.
Design Notes
The PR is trying to separate “stop the local background job” from “cancel the orchestrator session,” but JobManager.cancel_job() is now owning part of the orchestrator terminal-state protocol itself. That introduces contract drift: session state, execution state, and local task teardown are no longer updated atomically by one component.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
6dda1f1for PR #464
Review record:
88e8f807-b2b2-4666-8fe9-3e3f9808a69b
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:474 | BLOCKING | cancel_job() persists orchestrator.session.cancelled and execution.terminal(status="cancelled") before it has actually stopped the linked runner tasks (task.cancel() happens later at lines 494-504). Cancellation in asyncio is cooperative, so a runner that delays or swallows CancelledError can keep working after consumers have already been told the execution is terminal. That is a real contract break for polling clients/TUI and can produce duplicate or contradictory terminal outcomes later. |
| 2 | src/ouroboros/mcp/job_manager.py:479 | BLOCKING | If mark_cancelled() returns an error, cancel_job() raises immediately and never reaches the task-cancellation block at lines 494-504. In that failure mode the API reports a cancellation error, but the background job keeps running even though the user already requested cancellation and the job was moved to CANCEL_REQUESTED. This needs to degrade gracefully and still stop the local tasks. |
Non-blocking Suggestions
None.
Design Notes
The change is aimed at the right race conditions, but JobManager.cancel_job() now owns part of the orchestrator terminal-state contract. That creates a split-brain risk: the job layer can declare a linked execution terminal before the runner has actually finished its own shutdown path.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
02ac4effor PR #464
Review record:
e5478bbf-05ac-4efc-b14e-9f5a37b8661d
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/orchestrator/runner.py:1575 | BLOCKING | The new pre-start cancellation guard only checks is_cancellation_requested(), which is process-local. Cross-process cancels are represented by a persisted orchestrator.session.cancelled event (_check_cancellation() already knows this), so a runner that starts after another process has already cancelled the session will miss the cancel here and still build prompts/start execution until the first periodic checkpoint. That means already-cancelled precreated sessions can still do real work, which is exactly the race this PR is trying to close. This needs to use the event-store-aware cancellation check before execution begins. |
Non-blocking Suggestions
None.
Design Notes
The direction is reasonable: cancellation is being made more explicit across MCP jobs, runner lifecycle, and execution-stream terminal events. The remaining issue is that the startup path still mixes two cancellation mechanisms and only honors the local one at the earliest decision point.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
3998f0cfor PR #464
Review record:
6a190a66-126b-4535-99d5-35ff5ea21714
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:469 | BLOCKING | cancel_job() now decides to persist orchestrator.session.cancelled whenever the linked session has any live lock holder, but that includes the current process as well. Because it only waits 5 seconds before forcing the terminal events (asyncio.wait(..., timeout=5) at lines 484-512), a locally owned runner that needs longer to unwind cancellation can still be running while the session/execution have already been marked cancelled. That produces a false terminal state and allows work to continue after clients have been told the execution is over. |
Non-blocking Suggestions
None.
Design Notes
The separation between signalling cancellation and mirroring terminal state is the right direction, but the new logic is still conflating “session is alive” with “session is owned by another process.” That boundary needs to be explicit before the manager can safely persist terminal cancellation on behalf of the runner.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
a702807for PR #464
Review record:
3b82b1d1-c734-44c4-962c-669bdc00a691
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:477 | BLOCKING | Cross-process cancellation is silently dropped when reconstruct_session() fails. should_persist_linked_cancel is gated on linked_session_exists, so a transient replay/storage failure means cancel_job() only sets the in-memory cancellation registry in the current process, skips mark_cancelled(), and still returns a cancelled job snapshot. A linked session running in another process will never observe that registry entry, so the real execution keeps running even though the API reported success. |
Non-blocking Suggestions
| 1 | tests/unit/mcp/test_job_manager.py:361 | Tests | The new coverage exercises cross-process cancellation and query_events() failure, but not the more important reconstruct_session() failure case that now controls whether cancellation is persisted at all. A regression test for that branch would make this behavior much safer to change. |
Design Notes
The PR is moving cancellation handling in the right direction by separating same-process task teardown from cross-process session terminalization and by avoiding lock release from the wrong process. The main weakness is that cancellation authority is now split across heartbeat state, in-memory registry state, and persisted session state, and one transient failure in the reconstruction path currently breaks the cross-process contract silently.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
2f9c323for PR #464
Review record:
f6bdebd1-f8c3-4f16-aeba-09b77b2bd5d5
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:517 | BLOCKING | cancel_job() now clears the in-memory cancellation flag whenever the linked session has not acquired a heartbeat lock yet. That makes cancellation non-durable for precreated sessions: if a different worker starts the same session after this method returns, execute_precreated_session() only checks the registry or a persisted orchestrator.session.cancelled event at startup (runner.py:1339, runner.py:1593). Because this branch neither persists cancellation nor leaves the registry set, the later startup will run normally even though the user already cancelled the job. This is a real cross-process race/regression in the new behavior. |
Non-blocking Suggestions
None.
Design Notes
The PR is moving in the right direction on local-vs-remote cancellation ownership, but the current solution relies too heavily on the heartbeat lock as a proxy for "execution has started". That works for already-running sessions, but it breaks the contract for precreated sessions because cancellation is no longer durable before the lock exists.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
518dca4for PR #464
Review record:
e8158344-8954-4082-ac4c-ec9865362d8f
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/job_manager.py:481 | BLOCKING | cancel_job() now cancels self._tasks[job_id], but that task is the _run_job() wrapper, not just the underlying runner. If cancellation lands after the runner has already returned and _run_job() is in its success/failure finalization path (job_manager.py), the CancelledError is raised from the else: block, which is not covered by _run_job()'s except asyncio.CancelledError. In that race, no terminal job event is appended, the monitor is torn down in finally, and the job can be left stuck in cancel_requested with no completion/cancel record. This regression only exists because the new linked-session path unconditionally cancels the wrapper task. |
Non-blocking Suggestions
None.
Design Notes
The heartbeat ownership checks and startup-cancellation handling are sensible improvements, but the job manager is now mixing two concerns: signalling cancellation to the underlying execution and cancelling its own bookkeeping wrapper. That coupling introduces a terminal-state race.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
36823f8for PR #464
Review record:
d9eec3e9-a979-4e66-8675-485f9817368f
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
The change set is coherent: it tightens cancellation handling across the MCP job manager and orchestrator runner, and the added tests cover the main startup/in-flight race cases introduced by the new logic. I did not find a scoped blocking defect in the modified paths.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Summary
ouroboros_cancel_jobcancel the in-process job task even when the job is linked to an orchestration session.CancelledErrorreaches runtime subprocess cleanup immediately.Changes
JobManager.cancel_job()now always cancels tracked job/runner tasks after persistingcancel_requested.request_cancellation(session_id)for runner/event-store visibility.cancel_job().Tests
uv run pytest tests/integration/test_cancel_subprocess_termination.pyuv run pytest tests/unit/mcp/test_job_manager.py::TestJobManager::test_cancel_job_cancels_non_session_task