fix(interview): enforce streak check on explicit 'done' path#428
fix(interview): enforce streak check on explicit 'done' path#428
Conversation
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
012905afor PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|---|---|---|
| 1 | tests/unit/mcp/tools/test_checklist_verify.py:143 | Tests | Add a regression test for a seed with exactly one acceptance criterion going through ChecklistVerifyHandler; the current suite only exercises multi-AC seeds, which is why the contract break above slips through. |
Design Notes
The checklist aggregation itself is clean and reasonably isolated, and reusing EvaluateHandler for ChecklistVerifyHandler is the right direction. The main issue is that the boundary between “single evaluate” and “checklist verify” is encoded by AC count inside EvaluateHandler, which leaks an implementation detail into a higher-level tool with a different contract.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
095ee49for PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The checklist aggregation itself is a reasonable additive design, and reusing EvaluateHandler from ChecklistVerifyHandler keeps the surface area small. The main problem is the routing boundary: the new parameter contract is inconsistent between the single-AC and multi-AC paths.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
Pushed Bug RecapPreviously, when a user typed
Fix (Option a — implicit streak advance)Explicit
This preserves the Socratic Clarity principle (stability verified across TestsAdded 3 new tests and replaced the prior "refuses when streak insufficient" test (which asserted the broken behavior):
All 109 tests in |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
40aa3aefor PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The checklist aggregation itself is a reasonable additive layer, and reusing EvaluateHandler from ChecklistVerifyHandler keeps the surface area small. The main design problem is that the routing boundary and execution strategy in EvaluateHandler do not match the new API contract: single-AC lists are mishandled, and the per-AC fan-out repeats repo-scoped mechanical checks concurrently.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
Labels: |
Q00
left a comment
There was a problem hiding this comment.
Requesting changes because the latest head is not scoped to the PR title anymore, and it still carries an unresolved checklist regression.
This PR is titled as an explicit 'done'-path fix, but the diff still includes the full checklist/evaluate bundle (src/ouroboros/evaluation/checklist.py, src/ouroboros/mcp/tools/evaluation_handlers.py, src/ouroboros/mcp/tools/definitions.py, and the new checklist test suites). That makes the review surface far larger than advertised.
Worse, the bundled checklist code still has the 1-item acceptance_criteria bug: in src/ouroboros/mcp/tools/evaluation_handlers.py, the plural list is normalized, but the single-AC path still falls back to acceptance_criterion / the generic default instead of honoring a 1-item list.
Please either re-scope this PR to the explicit 'done' fix only, or include the later normalization fix before merging.
The multi-AC routing gate used `len(acceptance_criteria) >= 2`, which silently dropped single-item lists back to the generic single-AC path. This broke ChecklistVerifyHandler for seeds with exactly one AC — the checklist metadata (multi_ac, ac_count, pass_rate) was lost. Change the gate to `>= 1` so any explicit acceptance_criteria list, regardless of length, takes the checklist aggregation path. This gives ChecklistVerifyHandler consistent per-AC results for all seed sizes. Also adds a regression test for single-AC seeds through ChecklistVerifyHandler, as suggested by the bot review. Addresses review feedback from Q00 on PR Q00#428. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
215398cfor PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The checklist aggregation itself is a sensible additive layer, and wiring a dedicated ouroboros_checklist_verify tool through the existing evaluator keeps the public surface coherent. The main architectural miss is that the implementation parallelizes the entire pipeline instead of only the AC-dependent portion.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Q00
left a comment
There was a problem hiding this comment.
Review Verdict: Changes Requested
Can't merge this as-is. The core bug the PR targets is real, and I can reproduce it on this branch. Also the PR is currently CONFLICTING against origin/main (mergeStateStatus: DIRTY).
1. High — Live re-score path double-bumps the streak, so one done can still bypass the 2-signal gate
src/ouroboros/mcp/tools/authoring_handlers.py:946,:956-959
The explicit-done path falls back to _score_interview_state() when the stored score is missing or degraded:
# :946
exit_score = await self._score_interview_state(llm_adapter, state)But _score_interview_state() (:556-586) already advances the streak via _update_completion_candidate_streak(state, score) (:581), which increments completion_candidate_streak when the score qualifies (:137):
# :130-141
def _update_completion_candidate_streak(state, score) -> bool:
qualifies = qualifies_for_seed_completion(score, is_brownfield=state.is_brownfield)
if qualifies:
state.completion_candidate_streak += 1
else:
state.completion_candidate_streak = 0
return qualifiesImmediately afterward, the explicit-done branch increments again:
# :956-958
if state.completion_candidate_streak < AUTO_COMPLETE_STREAK_REQUIRED:
state.completion_candidate_streak += 1
state.mark_updated()
if state.completion_candidate_streak >= AUTO_COMPLETE_STREAK_REQUIRED:
return await self._complete_interview_response(...)Repro conditions:
completion_candidate_streak = 0- no stored ambiguity (or stored score is degraded)
- live re-score returns a qualifying score
Observed:
result_ok= True
state_status= completed
streak_after_one_done= 2
completed_meta= True
complete_interview_calls= 1
One done → streak jumps 0 → 1 (inside _score_interview_state) → 2 (explicit-done branch) → auto-completion. That defeats the "two consecutive stability signals required" policy this PR is supposed to enforce.
Suggested fix — pick one:
- Track a
rescoredflag and skip the:956increment when_score_interview_state()was called. - Add
_score_interview_state(..., update_streak=False)and have the explicit-done branch own the single advance.
Must-add regression test — "no stored score / degraded + streak 0 + qualifying live score + single done → state must NOT complete; streak must be exactly 1". Without this the exact bug walks back in.
4. Medium — Shortfall response pops the pending question but tells the user they can "answer another question"
src/ouroboros/mcp/tools/authoring_handlers.py:933-934,:971-989
On explicit done the pending unanswered round is popped:
# :933-934
if state.rounds and state.rounds[-1].user_response is None:
state.rounds.pop()When the streak advances but still falls short, we persist and tell the user:
"Type 'done' once more to confirm (... more signal(s) needed),
or answer another question to update the score."
But the pending question was just removed. If the user takes the guidance at face value and sends a plain answer:
- If that was the only outstanding round,
state.roundsis now empty and the answer path at:1013-1019returns"Cannot record answer - no questions have been asked yet". - If earlier answered rounds exist,
last_question = state.rounds[-1].questionat:1021grabs a stale, already-answered question and attaches the new response to it.
Either branch contradicts the shortfall message.
Suggested fix — pick one:
- Don't pop the pending round on a shortfall path; keep it available so the user can still answer it.
- Immediately generate and persist a new closure/clarification question before returning the shortfall response.
- Reword the message to match actual state, e.g. "type
doneagain, or resume without an answer to get another question."
Merge hygiene
gh pr view 428 --json mergeable,mergeStateStatus currently reports CONFLICTING / DIRTY. Needs a rebase onto latest origin/main before re-review.
Maintainer recommendation
Hold merge. Two things before I re-review:
- Fix the double-advance in the explicit-done path so one
donecannot skip the streak gate (Finding 1), with the regression test pinned. - Fix the shortfall UX so the printed guidance matches actual state (Finding 4).
Once those are in and the branch is rebased clean, this is a good fix to land — #405 itself is a real foot-gun and the overall direction of this PR is correct.
215398c to
3c2531d
Compare
…ortfall UX Resolves the two blockers flagged on Q00#428: 1. High — streak double-bump on explicit 'done'. When the stored ambiguity snapshot was missing or degraded, the explicit-done branch called `_score_interview_state()`, which advanced the streak inside `_update_completion_candidate_streak()`, and then the branch advanced it again. Result: one 'done' walked streak 0 → 1 → 2 and auto- completed, bypassing the 2-signal gate. Fix: add `update_streak` kwarg to `_score_interview_state` (defaults to True so all other callers keep their existing semantics) and pass `update_streak=False` from the explicit-done branch so it owns the single advance. 2. Medium — shortfall path popped the pending round but told the user they could "answer another question". If the user followed the prompt with a plain answer, it either hit "no questions have been asked yet" or attached to a stale round. Fix: defer the pending-round pop to the branches that actually end the interview (completion and refusal stay off the UX critical path in their own ways). On the shortfall branch the pending round is preserved and the message tells the user they can "answer the pending question". When no pending round existed, the guidance gracefully degrades to "resume without an answer to receive another question." Also rebased onto origin/main — the scope-bloat checklist/evaluate commits flagged in the first review have since landed on main and were auto-dropped by the rebase (cherry-pick detection). The branch is now scoped strictly to the Q00#405 done-path fix. Regression tests: - `test_explicit_done_live_rescore_advances_streak_exactly_once` pins the streak=1 outcome on the exact repro conditions from the review (streak=0, no stored score, qualifying live rescore). - `test_explicit_done_shortfall_preserves_pending_round` asserts the pending round survives the shortfall branch and the printed guidance now matches actual state. Refs Q00#405 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Thanks for the review — both blockers addressed plus the rebase. Force-pushed a rebased branch ( 1. Streak double-bump (Finding 1 — High). The explicit-done branch now owns the single streak advance:
2. Shortfall UX (Finding 4 — Medium). Moved the pending-round pop into the completion-only branch. The shortfall branch now preserves the pending round and the message reads "or answer the pending question to update the score." If no pending round exists (corner case — the completion signal itself was the only round), the wording degrades gracefully to "or resume without an answer to receive another question." Pinned by Merge hygiene. Rebased onto latest Full suite: |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
3c2531dfor PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The branch is now much better isolated: update_streak=False cleanly prevents the live-rescore double-bump, and preserving the pending round fixes the transcript/UX mismatch. The remaining problem is that the new control flow makes persistence part of the correctness contract, but the implementation still treats that write as best-effort.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
60e1d2efor PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The fix correctly separates “explicit done owns the increment” from the normal scoring path, and the shortfall persistence/UX changes are directionally sound. The main remaining issue is that the new flag is too broad: it disabled both incrementing and failure/reset semantics, but only the increment needed to be skipped.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
The explicit-done branch correctly stopped double-bumping qualifying scores, but it also stopped resetting a stale streak when the live rescore came back weak. That let an old completion signal survive a failing rescore and made the next qualifying signal count as if the failure never happened. Constraint: Explicit-done must own the qualifying increment without changing the standard between-round scorer behavior Constraint: Existing shortfall UX/persistence fixes on PR Q00#428 must remain intact Rejected: Add a second explicit-done-only reset branch after scoring | duplicates score-state logic and drifts from the scorer contract Rejected: Leave the stale streak in place and rely on later answers to clear it | breaks the consecutive-signal policy the PR is enforcing Confidence: high Scope-risk: narrow Reversibility: clean Directive: update_streak=False should suppress only the qualifying increment; failing or non-qualifying rescoring still needs to clear stale streak state Tested: ruff check src/ouroboros/mcp/tools/authoring_handlers.py tests/unit/mcp/tools/test_definitions.py; PYTHONPATH=src pytest tests/unit/mcp/tools/test_definitions.py -q; PYTHONPATH=src pytest tests/unit/mcp/tools/test_definitions.py -q -k "explicit_done_live_rescore_advances_streak_exactly_once or explicit_done_nonqualifying_live_rescore_resets_stale_streak or explicit_done_shortfall_preserves_pending_round" Not-tested: Full repo test suite outside interview-handler coverage
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
8765c43for PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The fix is well-scoped and the explicit-done ownership of streak advancement is cleaner than the previous behavior. The remaining gap is consistency in stale-streak invalidation: success/non-qualifying and failure paths should all enforce the same reset rule, or the two-signal completion contract becomes stateful in surprising ways.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
Hi @Q00 — the latest head on this branch ( 1. Double-bump is resolvedInstead of removing the manual
Net: exactly one streak bump per Regression test at 2. Scope is clean
No 3. RebaseBranch now rebases cleanly onto Verified locally
Ready for another look when you have a cycle. |
…rtfall persist Addresses the three outstanding non-blocking design notes on PR Q00#428: 1. `update_streak` flag was too broad (60e1d2e note). The single kwarg suppressed both the qualifying-score increment AND the failure/non- qualifying reset when passed False. Split into two orthogonal flags on `_score_interview_state`: `advance_streak: bool = True` controls the `+= 1` bump, `reset_on_failure: bool = True` controls the stale-streak invalidation. The explicit-done branch now calls with `advance_streak=False, reset_on_failure=True` so it owns the single qualifying bump while still honoring the shared stale-reset contract on weak/failed rescores. 2. Stale-streak invalidation consistency (8765c43 note). Introduced `_reset_stale_completion_streak(state)` as the single source of truth for the "stale streak invalidation" rule, reused by `_update_completion_candidate_streak`, the scorer-error path, and the explicit-done non-qualifying elif. The two-signal completion contract is now stateless across flows: a streak only survives a signal that was itself qualifying, regardless of which path observed it. 3. Shortfall persist is already load-bearing (3c2531d note). The earlier commit 60e1d2e made `engine.save_state(...)` on the shortfall branch return an error envelope on failure rather than best-effort-swallowed; this commit pins that contract with a new regression test explicitly asserting the error-surfacing behavior under the required task naming. New pinned regression tests in tests/unit/mcp/tools/test_definitions.py: - `test_explicit_done_non_qualifying_resets_stale_streak` — streak=1 from an earlier signal, next `done` with a non-qualifying rescore must reset streak to 0. - `test_shortfall_persist_failure_surfaces_error` — simulate persist raising an error, assert the response is an error and complete_interview / ask_next_question are never called. - `test_advance_streak_false_still_resets_on_failure` — unit-level: `advance_streak=False, reset_on_failure=True` with a scorer error AND with a non-qualifying live score both clear a stale streak. Updated the existing pinned regression `test_explicit_done_live_rescore_advances_streak_exactly_once` to assert the new kwargs (`advance_streak=False, reset_on_failure=True`) rather than the old `update_streak=False`. Verification (from worktree): - uv run ruff format --check . && uv run ruff check . -> clean - uv run mypy src/ouroboros/mcp/tools/authoring_handlers.py -> no issues - PYTHONPATH=src uv run pytest tests/unit/mcp/tools/ tests/unit/bigbang/test_interview.py -q -> 508 passed Refs Q00#405 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Pushed
Pinned regression tests added (as requested in the design notes):
The existing pinned test Local verification from
|
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
e12a682for PR #428
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| # | File:Line | Category | Finding |
|---|
Design Notes
The patch is well-scoped and the explicit-done state machine is much clearer now: separating advance_streak from reset_on_failure and preserving pending rounds on shortfall are both solid improvements. The remaining issue is consistency: once stale-streak invalidation becomes part of the correctness contract, every branch that mutates it needs the same persistence guarantees.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
@shaun0927 Hi, can you make this rebase from main? |
Rebase of Q00#428 onto v0.29.2. Squashes the iterative review-response history (e12a682) into one commit on top of current main since main's authoring_handlers.py refactor invalidated the original commit graph. Behavior preserved from e12a682: - Explicit 'done' owns the single streak advance via _score_interview_state(advance_streak=False, reset_on_failure=True), preventing the live-rescore double-bump that let one 'done' bypass AUTO_COMPLETE_STREAK_REQUIRED. - Shortfall branch preserves the pending round and surfaces persist failures as Result.err (correctness contract, not best-effort). - _reset_stale_completion_streak() unifies stale-streak invalidation across the explicit-done success / non-qualifying / scorer-error paths. Pinned regression tests in tests/unit/mcp/tools/test_definitions.py: - test_explicit_done_live_rescore_advances_streak_exactly_once - test_explicit_done_shortfall_preserves_pending_round - test_explicit_done_non_qualifying_resets_stale_streak - test_shortfall_persist_failure_surfaces_error - test_advance_streak_false_still_resets_on_failure Fixes Q00#405
e12a682 to
36c83f1
Compare
|
Rebased onto v0.29.2 (origin/main Squashed the seven review-iteration commits into one because main's Local verification:
Ready for re-review @Q00. |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
36c83f1for PR #428
Review record:
592b348e-c7bb-4650-b8f5-920d2e1bd5e2
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| 1 | tests/unit/mcp/tools/test_definitions.py:1464 | Refactoring | The rebased test file now carries two pairs of near-duplicate regression tests for stale-streak reset and shortfall persist failure (test_explicit_done_nonqualifying_live_rescore_resets_stale_streak vs. test_explicit_done_non_qualifying_resets_stale_streak, and test_explicit_done_shortfall_returns_error_when_persist_fails vs. test_shortfall_persist_failure_surfaces_error). They all assert the same contract, so consolidating them would reduce maintenance noise without losing coverage. |
Design Notes
The handler change is internally consistent: explicit done now owns the qualifying streak increment, _score_interview_state() owns stale-streak invalidation through explicit flags, and the shortfall path preserves the pending round while treating persistence as load-bearing. The added regression coverage matches the prior review history and exercises the key state-machine edges.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
…it 'done' path Fixes Q00#405 The explicit 'done' path only checked qualifies_for_seed_completion() and never verified completion_candidate_streak, so a single lucky low-ambiguity score could close the interview before the two-signal contract was met. This adds the AUTO_COMPLETE_STREAK_REQUIRED gate to the done branch, mirrors the same stale-streak invalidation rule via a shared _reset_stale_completion_streak helper, and splits the scorer call into two orthogonal kwargs: - advance_streak (default True): a qualifying score bumps the streak counter. Disabled on the explicit-done branch which owns the increment itself, to avoid a double-bump. - reset_on_failure (default True): a scorer error or non-qualifying score clears any existing streak. The done branch keeps this on so a weak live rescore still invalidates a stale streak. Shortfall handling now persists the state even when the live rescore is below threshold, so a second 'done' attempt with additional answers can advance rather than repeatedly recomputing from a lost baseline. Failure to persist surfaces as an error result instead of silently losing the pending round. This is the rebased-onto-main version of the original 7-commit branch. Because `81a8385` refactored tests/unit/mcp/tools/test_definitions.py (removing pre-existing InterviewHandler tests from Q00#342 and others that this branch had inherited but did not own), the PR Q00#428 test additions are now landed in a dedicated tests/unit/mcp/tools/test_interview_done_streak.py file — keeping scope narrow to this PR's work and avoiding re-introducing tests that main intentionally removed. Validation: - uv run ruff format --check . / ruff check . — clean - uv run mypy src/ouroboros/mcp/tools/authoring_handlers.py — no issues - pytest tests/unit/mcp/tools/ — 428 passed (10 new streak/shortfall tests)
36c83f1 to
f84f597
Compare
|
Rebased onto Note on history: a straight linear rebase of the original 7 commits hit a semantic conflict — main's commit What's preserved from the prior review iterations:
Local validation on the rebased HEAD:
Happy to split differently if you'd rather keep the multi-commit history — just let me know. |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
f84f597for PR #428
Review record:
f1892007-1819-4d0f-9d8c-6c066e468537
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
| 1 | tests/unit/mcp/tools/test_interview_done_streak.py:356 | Test maintenance | test_explicit_done_nonqualifying_live_rescore_resets_stale_streak substantially overlaps with test_explicit_done_non_qualifying_resets_stale_streak at line 574, and test_explicit_done_shortfall_returns_error_when_persist_fails at line 493 overlaps with test_shortfall_persist_failure_surfaces_error at line 633. Keeping both pairs makes the rebased test file harder to maintain without adding much extra behavioral coverage. |
Design Notes
The production change is coherent: the explicit-done branch now owns the qualifying streak increment, while _score_interview_state() retains stale-streak invalidation through the split advance_streak / reset_on_failure controls. That keeps the completion contract consistent across normal answers, rescoring failures, and explicit completion attempts.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Addresses ouroboros-agent[bot] non-blocking note on f84f597. Two pairs of near-duplicate tests exercised the same code path with identical inputs and overlapping assertions. They originated from separate review iterations (8765c43 "design-note regression" and 60e1d2e "follow-up owner-review note") and coexisted in the prior commit-by-commit history, becoming visibly redundant after the surgical rebase consolidated them into one file. Dropped: - test_explicit_done_non_qualifying_resets_stale_streak — subset of test_explicit_done_nonqualifying_live_rescore_resets_stale_streak (same streak=1 stale state, same weak rescore, same streak->0 assertion; the retained variant adds seed_ready + save_state.awaited coverage). - test_shortfall_persist_failure_surfaces_error — near-copy of test_explicit_done_shortfall_returns_error_when_persist_fails (same qualifying score, same persist failure, same error-surfacing assertion; only the session_id string differed). 8 tests remain; coverage of the two-signal completion contract, stale-streak invalidation, and shortfall persistence error surfacing is unchanged.
|
Pushed Dropped:
Both came from different review iterations ( Coverage preserved: 8 tests remain, pinning the two-signal completion contract, stale-streak invalidation across both the stored-score and live-rescore branches, shortfall persistence error surfacing, and the orthogonal Local validation on
|
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
227f4cbfor PR #428
Review record:
19eed221-f429-4bf3-920a-88f61a15d72d
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/mcp/tools/authoring_handlers.py:1490 | BLOCKING | The new explicit-done flow treats stale-streak invalidation as a correctness contract, but the ambiguity-gate branch still swallows save_state() failures after _score_interview_state() has already cleared completion_candidate_streak / stored ambiguity. If that write fails, the user receives a normal "Cannot complete yet" response while the old qualifying streak/snapshot remain on disk; the next request reloads that stale state and can complete after only one new qualifying signal, violating the two-signal guarantee this PR is fixing. This needs the same hard-error handling the shortfall branch now has. |
Non-blocking Suggestions
None.
Design Notes
The production-side refactor is directionally right: separating scorer-owned increment from reset semantics makes the explicit-done state machine easier to reason about, and moving the pending-round pop into the completion-only branch fixes the earlier transcript/UX bug. The remaining issue is persistence consistency: once streak reset is part of the contract, every branch that mutates it needs the same durability guarantees.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
…rors Addresses ouroboros-agent[bot] BLOCKING finding on 227f4cb. The explicit-'done' ambiguity-gate branch calls _score_interview_state(advance_streak=False, reset_on_failure=True) which can clear a stale completion_candidate_streak in memory. The subsequent save_state(state) call previously discarded its Result, so a silent persist failure would leave the pre-reset streak on disk. The next request would reload the stale qualifying streak and complete the interview after only a single new qualifying signal — violating the two-signal contract this PR exists to enforce. The refuse branch now mirrors the shortfall branch: on save_state failure it logs mcp.tool.interview.save_failed_on_ambiguity_gate and returns Result.err(MCPToolError("Failed to persist stale-streak reset: ...", tool_name="ouroboros_interview")) instead of the normal "Cannot complete yet" response. Regression test: test_ambiguity_gate_persist_failure_surfaces_error_not_swallowed pins the contract with streak=1, weak live rescore (0.55), and a save_state returning Result.err(ValidationError). Asserts is_err, "persist" in message, tool_name correct, and that complete_interview is never invoked — the handler must not swallow the failure behind the normal refuse response. 9 tests in test_interview_done_streak.py; pytest tests/unit/mcp/tools/ -> 427 passed.
|
Pushed The bug. The ambiguity-gate refuse branch at The fix. The refuse branch now mirrors the shortfall branch's persistence contract: refuse_save_result = await engine.save_state(state)
if refuse_save_result.is_err:
log.error("mcp.tool.interview.save_failed_on_ambiguity_gate", ...)
return Result.err(
MCPToolError(
f"Failed to persist stale-streak reset: {refuse_save_result.error}",
tool_name="ouroboros_interview",
)
)Regression test. Validation on
Scope kept narrow — only the one new error path in |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
fc49925for PR #428
Review record:
49f28b74-d2e2-4736-aab2-65af33056daa
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
The rebased change is narrowly scoped and coherent: the explicit done path now owns the qualifying streak increment, stale-streak invalidation is centralized, and both load-bearing persistence points on the done shortfall/refusal branches are treated as correctness-critical instead of best-effort. The dedicated regression file covers the important state-machine edges introduced by the fix.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Summary
completion_candidate_streak >= AUTO_COMPLETE_STREAK_REQUIREDgate to the explicit "done" path inInterviewHandler, matching the auto-complete path behaviorcompletion_candidate_streak=2where they expect successful completion via "done"test_interview_handle_done_refuses_when_streak_insufficientto verify the fixFixes #405
Test plan
test_interview_handle_done_completes_without_new_question— "done" with qualifying score AND streak >= 2 completes successfullytest_interview_handle_done_refuses_when_streak_insufficient— "done" with qualifying score but streak == 0 is rejected with informative messagetest_interview_handle_done_refuses_when_component_floors_fail— "done" with failing component floors still rejectedtest_interview_handle_done_rescores_degraded_brownfield_snapshot— brownfield re-scoring path with sufficient streak workstest_interview_handle_asks_closure_question_on_first_strong_score— auto-complete streak=0 asks closure questiontest_interview_handle_auto_completes_on_second_strong_score— auto-complete streak=1->2 completestest_definitions.pypasstest_ambiguity.pypasstest_interview.pypass