feat(ce-doc-review): autofix tiers, per-finding walk-through, multi-round memory#601
feat(ce-doc-review): autofix tiers, per-finding walk-through, multi-round memory#601
Conversation
…ion overhaul Capture requirements brainstorm (38 requirements across 11 sections) and implementation plan (8 units, 4 phases) for overhauling ce-doc-review: three-tier autofix classification (safe_auto / gated_auto / manual) aligned with ce-code-review names, per-severity confidence gates, ported walk-through + bulk-preview + routing-question interaction, in-doc Open Questions deferral, multi-round decision memory, domain-agnostic learnings-researcher rewrite, and ce:compound problem_type enum expansion. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… / design_pattern / tooling_decision / convention Adds four knowledge-track values to the ce:compound frontmatter schema to absorb the best_practice overflow that had been serving as a catch-all for architecture, design, tooling, and convention learnings. best_practice remains valid but is now documented as the fallback for entries that no narrower knowledge-track value covers. Schema updates land in both ce-compound and ce-compound-refresh (duplicate files are kept in sync intentionally). Category mappings added for the four new values under docs/solutions/. Plugin AGENTS.md gets a new "Documented Solutions" discoverability section naming all knowledge-track and bug-track categories so agents reading project instructions know the full taxonomy. Migrates 10 existing best_practice entries to narrower values and resolves one correctness-gap schema violation (workflow/todo-status-lifecycle.md -> workflow_issue). Remaining best_practice entries are genuinely broad enough to stay in that bucket (e.g., ce-pipeline-end-to-end-learnings). bun test tests/frontmatter.test.ts passes. Part of the ce-doc-review autofix and interaction overhaul plan. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… knowledge The learnings-researcher was shaped around bug-hunting — its taxonomy, category table, output framing, and DO/DON'T guidelines all assumed the caller was looking for past defects. But the team captures more than bugs: architecture patterns, design patterns, tooling decisions, conventions, and workflow learnings are all first-class in docs/solutions/ now. Authors were retreating to best_practice as a catch-all because the agent didn't reliably surface non-bug entries. This rewrite treats all learning shapes equally: - Identity framing dropped "preventing repeated mistakes" for "find and distill applicable learnings" — bug avoidance is one flavor, not the premise - Accepts a structured <work-context> input (Activity / Concepts / Decisions / Domains) and still handles free-form text for callers that don't use the structured form - Replaced the hardcoded Feature-Type-to-Directory table with a dynamic probe of docs/solutions/ at invocation time so new subdirectories (architecture-patterns/, design-patterns/, conventions/, etc.) are picked up automatically - Expanded keyword extraction from 4 dimensions (Module, Technical terms, Problem indicators, Component) to 8 by adding Concepts, Decisions, Approaches, Domains — so design-pattern and convention queries have something to match on - Made Step 3b (critical-patterns.md read) conditional on file existence — the file does not exist in this repo and demanding it was emitting warnings - Reframed output to "Applicable Past Learnings" with a Type field so knowledge-track entries surface as themselves (architecture_pattern, design_pattern, etc.) instead of being flattened to a bug-shape - Reworked Integration Points to drop the ce-doc-review reference (it explicitly does not dispatch this agent) and stop framing planning-time as the agent's primary home - Updated DO/DON'T guidelines: don't discard candidates for missing bug-shaped fields, treat non-bug learnings as first-class, don't assume critical-patterns.md exists Preserved the grep-first filtering strategy intact (it was already efficient) and the overall 7-step flow. File grew from 246 to 283 lines due to the expanded taxonomy and structured input documentation. Part of the ce-doc-review autofix and interaction overhaul plan. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…tion, strawman rule, framing guidance The persona subagent template was producing output that pushed too much into 'present' (requires user judgment) because the classification rule was permissive about what counted as a 'reasonable alternative' — (a) real fix + (b) 'accept the defect' strawman + (c) 'document in release notes' strawman looked like three valid approaches, inflating the manual/present bucket. This upgrade sharpens classification and improves persona output quality across all seven document-review personas via one file edit. Changes: - autofix_class enum renamed and expanded: [auto, present] -> [safe_auto, gated_auto, manual]. Aligns names with ce-code-review's first three tiers so cross-skill vocabulary is consistent. Drops the advisory tier in favor of a presentation-layer FYI subsection handled by synthesis - Three-tier routing spelled out with one-sentence descriptions and examples: safe_auto for single-option fixes, gated_auto for concrete fixes that touch meaning/scope (the new default for 'I know the fix but the author should sign off'), manual for genuine judgment calls - Strawman-aware classification rule added: a 'do nothing / accept the defect' option is not a real alternative — it is the failure state the finding describes. Same for 'document in release notes' and similar sidesteps. Listed alternatives must be ones a competent implementer would genuinely weigh. Positive and negative example pair included - Strawman safeguard on safe_auto: if safe_auto is chosen via strawman-dismissal, the dismissed alternatives must be named in why_it_matters. Any genuine non-strawman alternative downgrades the finding to gated_auto (user confirms) rather than silent apply - Auto-promotion patterns consolidated into an explicit rule set: factually incorrect behavior, missing standard security/reliability controls, codebase-pattern-resolved fixes (with cited pattern), framework-native-API substitutions, mechanically-implied completeness - Framing guidance block added (ported from ce-code-review's subagent-template): observable-consequence-first phrasing, why-the-fix-works grounding, 2-4 sentence budget, weak/strong illustrative pair. Tuned for document-review's surface — 'observable consequence' rather than 'observable behavior' since doc problems surface as disagreements, misreads, and wrong decisions rather than runtime behavior - Persona exclusion rule added: personas must exclude '## Deferred / Open Questions' sections and their timestamped subsections from review scope. Prevents round-2 feedback loops where deferred entries get flagged as new findings or quoted as evidence for suppression checks. Unit 6 is going to create these sections as the document-native defer mechanic Template grows from 52 to 95 lines, still under the 150-line @ inclusion threshold so SKILL.md continues to use @-inclusion. findings-schema.json autofix_class enum and description updated to match. Part of the ce-doc-review autofix and interaction overhaul plan. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…s and FYI subsection Rewrites the synthesis pipeline to route findings by the three new tiers (safe_auto / gated_auto / manual) with per-severity confidence gates and a distinct FYI subsection for low-confidence manual findings. This is the load-bearing synthesis change that enables the interaction model (Unit 5) and multi-round memory (Unit 7) to follow. Pipeline changes: - Step 3.2: replace flat 0.50 gate with per-severity table. P0 0.50, P1 0.60, P2 0.65, P3 0.75. FYI floor at 0.40: below-gate manual findings above the floor survive in an FYI subsection rather than being dropped — preserves observational value without forcing decisions - Step 3.4: delete residual-concern promotion. Replace with cross-persona agreement boost (+0.10, capped at 1.0) applied to merged findings that multiple personas flagged. Mirrors ce-code-review stage 5 step 4. Residual concerns surface in Coverage only; below-gate findings are not promoted back into the review surface - Step 3.5: tier language updated for three tiers. Contradictions route to manual with both perspectives - Step 3.6: auto-promotion patterns consolidated into an explicit rule set: codebase-pattern-resolved, factually incorrect behavior, missing standard security/reliability controls, framework-native-API substitutions, mechanically-implied completeness additions. Promotion targets safe_auto or gated_auto depending on whether the addition is substantive. Strawman-downgrade safeguard: if safe_auto was chosen via strawman-dismissal per the subagent template, synthesis verifies the dismissed alternatives were genuinely strawmen and downgrades to gated_auto if any were plausible design choices - Step 3.7: routing rewritten for three tiers + FYI. safe_auto applies silently in Phase 4. gated_auto enters walk-through with Apply recommended. manual enters walk-through with user-judgment framing. FYI-subsection routes outside the walk-through entirely - State-machine narration added to the phase header so each transition (Raised -> Classified -> one of four buckets) is a named step in synthesis prose, not an implied carry-forward - R30 fix-landed matching predicate specified explicitly: fingerprint + evidence-substring overlap >50%. Section renames count as different locations so a rename-caused re-raise is treated as new, not as 'fix did not land' Phase 4 updates: - Headless envelope extended with distinct 'Gated-auto findings' and 'FYI observations' sections alongside existing 'Applied N safe_auto fixes' and 'Manual findings' sections. Backward-compatible: callers that only read the old sections continue to work; new sections surface alongside, not instead of, the existing buckets. Landing the envelope change here (Unit 4) rather than Unit 8 means ce-plan 5.3.8 headless invocations never observe an interstitial state where new tiers are produced but the envelope can't carry them - Interactive presentation split by tier and finding type. FYI observations get their own subsection. Applied safe_auto fixes list every change with section + reviewer attribution - Phase 5 simplified: handoff to Unit 5 walkthrough.md for routing question and per-finding interaction. Terminal question (Unit 7's three-option structure) fires after walk-through or bulk action Creates tests/fixtures/ce-doc-review/seeded-plan.md — a test fixture plan doc with known issues planted across tier shapes (3 safe_auto, 3 gated_auto, 5 manual, 5 FYI candidates, 3 drop-worthy P3s). Used for the validation gate: run the new pipeline against this fixture and verify bucket distribution matches expectations before shipping Phase 2. Part of the ce-doc-review autofix and interaction overhaul plan. Unit 4 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…and bulk preview Ports ce-code-review's interactive flow (post-PR #590) into ce-doc-review while adapting for document-specific semantics. Replaces the current binary 'refine / complete' Phase 5 question with a four-option routing question that dispatches to focused interactions sized to the user's intent. New references: - walkthrough.md — the per-finding walk-through that fires when the user picks routing option A. Each finding renders in two surfaces: a terminal markdown block carrying the explanation (heading, What's wrong, Proposed fix, Why it works) and a compact two-line question stem carrying the decision. Four per-finding options: Apply / Defer (append to Open Questions) / Skip / LFG-the-rest. Post-tie-break recommendation marked with '(recommended)'. N=1 and no-append adaptations included. Unified completion report covers walk-through, LFG, Append-to-Open-Questions, and zero-findings terminal paths - bulk-preview.md — the compact plan preview before every bulk action (routing B, routing C, walk-through LFG-the-rest). Grouped buckets (Applying / Appending to Open Questions / Skipping) with one-line plain-English summary per finding. Two options: Proceed or Cancel. Cancel semantics route back to the originating question (routing or per-finding). Proceed executes the plan and emits the unified completion report Differences from ce-code-review's equivalents: - No tracker-detection tuple or sink-availability logic — Defer is always an in-doc append, no external tracker - No [TRACKER] label substitution — bucket is always 'Appending to Open Questions' - No fixer-subagent batch dispatch — the orchestrator applies gated_auto/manual fixes inline (document edits are single-file markdown with no cross-file dependencies) - No advisory variant with Acknowledge — advisory-style findings surface in the FYI subsection at the presentation layer, outside the walk-through - No .context/ artifact-lookup paths — ce-doc-review personas don't write run artifacts - Observable-consequence phrasing tuned for document surface (readers, implementers, downstream decisions) rather than ce-code-review's runtime-behavior framing SKILL.md changes: - Added 'Interactive mode rules' section at the top with the AskUserQuestion pre-load directive and numbered-list fallback rule. Mirrors ce-code-review's approach for reliable question-tool loading on Claude Code - Updated the Phases 3-5 handoff to load walkthrough.md and bulk-preview.md after agent dispatch completes, lazily via backtick paths per the skill compliance checklist Part of the ce-doc-review autofix and interaction overhaul plan. Unit 5 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Implements the Defer action's document-native analogue to ce-code-review's tracker-ticket mechanic. When the user chooses Defer on a finding (from the walk-through or from the bulk-preview Append-to-Open-Questions path), an entry appends to a '## Deferred / Open Questions' section at the end of the document under review. Design decisions: - In-doc append instead of sibling follow-up file or external tracker: deferred findings stay attached to the document they came from, naturally discoverable by anyone reading the doc, no new infrastructure required. Trade-off: deferred findings visibly mutate the doc — but that's the point, an Open Questions section is a canonical place for 'worth remembering but not acting on now' in planning docs - Section location handling covers three cases: end-of-doc heading (most common — append inside), mid-doc heading (user positioned deliberately — still append at that location, don't duplicate at end), absent (create at end, or above trailing separators/footers, or after frontmatter for frontmatter-only docs) - Timestamped subsections (### From YYYY-MM-DD review) group multiple defers per review and keep sequential reviews distinguishable. Same-day reviews within a session share one subsection; cross-day reviews get distinct subsections - Entry format: title, severity, reviewer attribution, confidence, and the why_it_matters framing. Intentionally excludes suggested_fix and evidence — those live in the run artifact when applicable; the in-doc entry is a concern summary for the reader returning months later, not a full decision packet - Idempotence on title collisions within the same subsection prevents duplicate entries from LFG-the-rest re-routing or orchestrator retry paths - Concurrent-edit safety: re-read document from disk before every append to reduce collision window with user-in-editor writes. No persistent lock — interactive review needs observation-before-write, not coordination - Failure path via blocking sub-question (Retry / Record-only / Convert- to-Skip) ensures no silent failures. Default-to-Record-only when the sub-question goes unanswered keeps in-memory state consistent - Upstream availability signal cached at Phase 4 start — known-unwritable documents suppress the Defer option in the walk-through menu and routing question; mid-flow failures route through the per-finding failure path without flipping the session signal Referenced by references/walkthrough.md (per-finding Defer option) and references/bulk-preview.md (routing option C Proceed). Includes a worked example showing the appended-content format in context. Part of the ce-doc-review autofix and interaction overhaul plan. Unit 6 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…nd decision primer
Replaces the current binary 'refine / complete' Phase 5 question with a
three-option terminal question that separates 'apply decisions' from
're-review' — the most common user frustration the brainstorm surfaced
was that the old wording forced a re-review even when the user wanted
to apply fixes and move on.
Terminal question:
- 'Apply decisions and proceed to <next stage>' — default when fixes
were applied or decisions were made. <next stage> substitutes to
ce-plan for requirements docs and ce-work for plan docs
- 'Apply decisions and re-review' — opt-in re-review when the user
believes their edits warrant another pass
- 'Exit without further action' — user wants to stop for now
When fixes_applied_count == 0 (zero-actionable case or all-Skip walk-
through), the 're-review' option drops (re-review is useless when
nothing changed) and the 'Apply decisions and' prefix drops from the
primary option's label so it matches what the system is doing.
Caller-context handling is implicit rather than requiring a formal
nested:true argument. The agent interprets 'Proceed to <next stage>'
contextually from the visible conversation state — when invoked from
inside another skill's flow, it returns control to the caller; when
standalone, it dispatches the next stage skill. No caller-side change
to ce-brainstorm or ce-plan. If implicit handling proves unreliable,
an explicit nested:true flag can be added as a follow-up.
Multi-round decision memory:
- SKILL.md Phase 2 adds a new {decision_primer} template variable that
accumulates prior-round decisions across the session. Round 1 renders
an empty <prior-decisions> block; round 2+ renders applied and
rejected entries by round with section, title, reviewer, confidence,
and rejection reason. Skip and Defer both count as 'rejected' for
suppression purposes. Cross-session persistence is out of scope —
fresh invocations start a fresh round 1
- synthesis-and-presentation.md adds an R29 suppression step. For each
current-round finding, compare against the primer's rejected list
using the same fingerprint + evidence-substring predicate as R30.
On match, drop the current-round finding and log the suppression in
Coverage. Materially-different exception: if the section was edited
enough that the prior-round evidence quote no longer appears in the
current document, treat as new. Synthesis is the authoritative gate;
the persona instruction below is a soft hint
- subagent-template.md adds a <decision-primer-rules> block that
instructs personas to honor the primer: don't re-raise a finding whose
title and evidence pattern-match a prior-round rejected entry unless
materially different. Prior-round Applied findings are informational
— the orchestrator verifies those landed via R30. Round 1 runs with
no primer constraints
Part of the ce-doc-review autofix and interaction overhaul plan. Unit 7
of 8.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ugh, defer, enum expansion
Extends pipeline-review-contract.test.ts with structural assertions for
the ce-doc-review overhaul. Uses pattern matching on key tokens rather
than exact prose so future wording improvements don't ossify the tests.
New test groups:
- ce-doc-review contract (8 tests):
- findings-schema autofix_class enum uses [safe_auto, gated_auto,
manual] — aligned with ce-code-review's first three tier names,
no advisory tier, no legacy auto/present values
- subagent template carries framing guidance, strawman rule, strawman
safeguard, Deferred/Open-Questions exclusion, and decision-primer
slot + rules
- synthesis pipeline has per-severity gates (P0 0.50 / P1 0.60 /
P2 0.65 / P3 0.75), FYI floor at 0.40, cross-persona agreement
boost (+0.10), three-tier routing, R29 rejected-finding suppression,
R30 fix-landed matching predicate
- headless envelope surfaces new tiers distinctly (Applied N safe_auto
fixes / Gated-auto findings / Manual findings / FYI observations)
with preserved 'Review complete' terminal signal
- terminal question is three-option by default with label adaptation
in the zero-actionable case; next-stage substitution rules
documented
- SKILL.md has Interactive mode rules with AskUserQuestion pre-load
via ToolSearch, decision_primer variable in dispatch, lazy
references to walkthrough.md and bulk-preview.md
- walkthrough has routing question with front-loaded distinguishing
words, four per-finding options, (recommended) marker, no advisory
variant, no tracker-detection machinery; bulk-preview has Proceed/
Cancel with three bucket labels, no Acknowledging bucket
- open-questions-defer implements the append mechanic with timestamped
subsections, field-level entry format, failure-path sub-question
(Retry / Record-only / Convert-to-Skip), no tracker-detection logic
- ce-compound frontmatter schema expansion contract (3 tests):
- problem_type enum includes architecture_pattern, design_pattern,
tooling_decision, convention — with best_practice retained as
fallback
- ce-compound-refresh schema is byte-identical to canonical
ce-compound schema (kept in sync per AGENTS.md duplicate convention)
- yaml-schema.md documents category mappings for the four new values
- ce-learnings-researcher domain-agnostic contract (1 test):
- Agent prompt frames as domain-agnostic (not bug-focused), names
Architecture / Design / Tooling / Conventions as first-class
learning shapes, accepts structured <work-context> input, replaces
the hardcoded category table with a dynamic probe, reads
critical-patterns.md conditionally, and drops ce-doc-review from
Integration Points (agent is ce-plan-owned)
All 13 new tests pass. Full suite: 798 pass, 0 fail. bun run
release:validate: in sync.
Part of the ce-doc-review autofix and interaction overhaul plan. Unit 8
of 8 — final unit; this ships the contract tests that lock in the
three-tier / multi-round / interaction-model changes shipped in Units 3
through 7.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d897823f52
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Preserve evidence snippets in decision primer entries (thread r3106059451, P1) R29 rejected-finding suppression and R30 fix-landed verification both use fingerprint + evidence-substring overlap (>50%) as their matching predicate. The primer format in SKILL.md previously stored only section/title/reviewer/confidence/reason, leaving the orchestrator without prior-round evidence needed to compute the overlap check. Adds an 'Evidence:' line to each primer entry containing the first evidence quote truncated to ~120 chars (word-boundary-preserving, internal quotes escaped). Documented inline why the field is required. - Align review-output-template.md with new three-tier taxonomy (thread r3106059452, P2) The interactive presentation template still referenced the legacy 'auto/present' contract — summary line, 'Auto-fixes Applied' heading, 'Auto | Present' coverage columns, and rules text all leaked stale bucket semantics into user-facing output. Rewrites the template for the new taxonomy: 'Applied N safe_auto fixes' summary line, a Tier column in severity tables, a dedicated 'FYI Observations' section, and Coverage columns for SafeAuto | GatedAuto | Manual | FYI | Residual. Residual Concerns section clarified: no longer promoted via residual-concern promotion (dropped in Unit 4) — listed for transparency only. Adds a note up-front that this template covers Phase 4 interactive presentation specifically; the headless envelope lives in synthesis-and-presentation.md. All contract tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df79db5015
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Both findings fall in the same pattern the prior round surfaced: high-level rules written without the deterministic mechanics the orchestrator needs to actually implement them. Cross-invocation signal from resolve-pr-feedback flagged the theme. - Add deterministic recommended-action tie-break to synthesis (thread r3106064985, P1) walkthrough.md said 'recommended action has already been normalized by synthesis step 3.6' but step 3.6 only handled auto-promotion; there was no rule for picking Apply/Defer/Skip when contributing personas implied different actions. Without a tie-break, LFG outcomes are non-deterministic across runs and walk-through '(recommended)' marks drift. Adds synthesis step 3.5b 'Deterministic Recommended-Action Tie-Break' implementing 'Skip > Defer > Apply' (most conservative first). Scans contributing personas in that order; first match wins. Documents persona-to-action mapping (safe_auto/gated_auto -> Apply, manual tradeoffs -> Defer, residual-suppression-eligible -> Skip, contradiction-set with 'keep as-is' -> Skip). Silent-personas fallback to Apply. Records a conflict-context string on the merged finding for the walk-through's R15 conflict-context line. Updates walkthrough.md's reference from 'synthesis step 3.6' to the correct '3.5b'. Invariant: walk-through and bulk-preview read recommended_action and never recompute. LFG-the-rest and routing option B execute the recorded recommendation in bulk, so the same review artifact always produces the same bulk plan. - Strengthen dedup key in open-questions-defer.md (thread r3106064987, P2) Title-only idempotence silently drops legitimate deferred entries when two different concerns share a short title in the same review date. Loses user-visible backlog context. Compound key is now 'normalize(section) + normalize(title) + evidence_fingerprint', mirroring the R29/R30 matching predicate (fingerprint + evidence-substring overlap) so cross-round and intra-round dedup behave consistently. Evidence fingerprint is the first ~120 chars of the finding's first evidence quote (same slice used in the decision primer). Falls back to section+title when evidence is unavailable. All contract tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: afc699da8f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Both findings continue the pattern: specification gaps in deterministic orchestration mechanics. Different gaps each round — Codex keeps finding adjacent spots where high-level rules need concrete implementation detail. - Drop routing-level recommendation labeling (thread r3106084407, P2) walkthrough.md's routing question said any of A/B/C/D could be marked '(recommended)' based on synthesis, but synthesis only defines per-finding recommended_action — there was no rule for picking a routing-level recommendation. Rather than invent one (a mapping from finding-set shape to routing recommendation would pressure users toward automated paths in ways that conflict with the user-intent framing of the four options), remove the routing- level (recommended) labeling. Per-finding (recommended) inside the walk-through and bulk preview still applies — that's deterministic via step 3.5b. The routing choice itself is user-intent: engage / trust / triage / skim. - Persist dedup-key fields in the appended entry (thread r3106084409, P2) Previous round added a compound dedup key (section + title + evidence_fingerprint) but the deferred-entry format on disk only stored title/severity/reviewer/confidence/why_it_matters — so retries or same-day reruns couldn't reliably reconstruct the key from persisted data. Duplicates could still slip through. Adds an HTML-comment line on each appended entry carrying the machine-readable dedup fields: <!-- dedup-key: section=\"...\" title=\"...\" evidence=\"...\" -->. Keeps the visible content reader-friendly (section now surfaces in the header line alongside severity/reviewer/confidence; the HTML comment stays hidden in rendered markdown) while persisting exactly the fields Step 4's compound-key check needs. Legacy entries without the comment fall back to title-only comparison — imperfect but strictly better than duplicate-appending. All contract tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f65830da9b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. The finding's `why_it_matters` cites a specific existing codebase pattern -- not just "best practice" or "convention," but a concrete pattern with a file, function, or usage reference | ||
| 2. The finding includes a concrete `suggested_fix` that follows that cited pattern | ||
| 3. There is no genuine tradeoff -- the codebase context resolves any ambiguity about which approach to use | ||
| If the contributing personas are all silent on action (e.g., a merged `manual` finding from personas that all flagged it as observation without recommendation) → `recommended_action: Apply` as the pragmatic default; the walk-through still lets the user pick any of the four options. |
There was a problem hiding this comment.
Do not default actionless manual findings to Apply
Step 3.5b sets recommended_action: Apply when contributing personas are “silent on action”, but manual findings are allowed to omit suggested_fix (3.7), and end-of-walk-through execution applies each Apply-set finding’s suggested_fix (walkthrough.md line 183). This means a merged manual observation with no concrete fix can still be auto-routed to Apply (including via LFG), yielding non-executable or empty apply actions. Gate Apply on suggested_fix presence (or default this branch to Defer/Skip) so every recommended action is executable.
Useful? React with 👍 / 👎.
| A. Review each finding one by one — accept the recommendation or choose another action | ||
| B. LFG. Apply the agent's best-judgment action per finding | ||
| C. Append findings to the doc's Open Questions section and proceed | ||
| D. Report only — take no further action |
There was a problem hiding this comment.
Suppress routing option C when append is unavailable
The routing question always includes option C (Append findings to the doc's Open Questions section and proceed), but open-questions-defer.md says this option must be suppressed when append_available: false (e.g., read-only document). In that environment, exposing C sends users into guaranteed append failures and per-finding fallback prompts instead of preventing the invalid path up front. Add the append-availability adaptation here so option C is omitted when append is known unavailable.
Useful? React with 👍 / 👎.
Summary
Overhaul
ce-doc-reviewto match the interaction quality ofce-code-review(post-PR #590). The current skill surfaces too many findings as "needs user judgment" when one clear fix exists, nitpicks at low confidence, and ends with a binary "refine / complete" question that forces re-review when the user wants to apply fixes and move on. This PR replaces binary classification with a three-tier system aligned withce-code-review's names, ports the per-finding walk-through + bulk preview + routing question interaction, adds in-doc Open Questions deferral, introduces multi-round decision memory, rewritesce-learnings-researcherto handle domain-agnostic institutional knowledge, and expands thece-compoundfrontmatterproblem_typeenum.Brainstorm:
docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md(38 requirements, R31–R35 dropped from scope per pipeline-separation decision).Plan:
docs/plans/2026-04-18-001-feat-ce-doc-review-autofix-and-interaction-overhaul-plan.md(8 units, 4 phases).What changed
Autofix classification (three tiers, ce-code-review-aligned)
autofix_classenum renamed from[auto, present]to[safe_auto, gated_auto, manual]:safe_auto— one clear correct fix, applied silentlygated_auto— concrete fix exists but touches document meaning; user confirms via walk-through (Apply marked(recommended))manual— requires user judgment; multiple valid approachesNo
advisorytier. Low-confidencemanualfindings (above a 0.40 FYI floor, below the severity gate) surface in a distinct FYI subsection at the presentation layer — observational value without forcing a decision.Per-severity confidence gates replace the flat 0.50 floor: P0 0.50 / P1 0.60 / P2 0.65 / P3 0.75. Residual-concern promotion (old step 3.4) removed; replaced with cross-persona agreement boost (+0.10) on merged findings multiple personas flagged.
Classification rule sharpened with a strawman-aware test: a "do nothing / accept the defect" alternative is the failure state, not a real alternative. When any genuine non-strawman alternative exists,
safe_autodowngrades togated_autoso the user sees the tradeoff before silent apply.Framing guidance block added to the shared subagent template (observable-consequence-first, why-the-fix-works grounding, 2-4 sentence budget, positive/negative example pair). One file edit propagates to all 7 personas.
Interaction model (ported from ce-code-review)
After
safe_autofixes apply, the new four-option routing question dispatches:Per-finding walk-through renders each finding in two surfaces — a terminal markdown block (explanation) and a compact two-line question stem (decision). Four per-finding options: Apply / Defer / Skip / LFG-the-rest, with the post-tie-break recommendation marked
(recommended). N=1 and no-append adaptations included.Bulk preview shows the plan before LFG, Append-to-Open-Questions, and walk-through LFG-the-rest executes. Two options: Proceed / Cancel.
Differences from ce-code-review: no tracker-detection tuple or sink-availability logic (Defer is always in-doc append); no
[TRACKER]label substitution; no fixer-subagent batch dispatch (the orchestrator applies edits inline); no advisory-variant Acknowledge option.In-doc Open Questions deferral (document-native Defer)
When the user picks Defer on a finding, an entry appends to a
## Deferred / Open Questionssection at the end of the document under review, under a timestamped subsection (### From YYYY-MM-DD review). Deferred findings stay attached to the document they came from — no external tracker, no sibling file. Failure path (read-only doc, write error, concurrent-edit collision) surfaces a blocking sub-question with Retry / Record-only / Convert-to-Skip options.Multi-round decision memory
Every review round after the first passes a cumulative decision primer to every persona. Rejected findings (Skip or Defer) from any prior round in the current session are carried in a
<prior-decisions>block in the subagent template. Synthesis-level R29 suppression drops re-raised rejected findings using the same fingerprint + evidence-substring predicate as R30 fix-landed verification. No diff is passed — fixed findings self-suppress because evidence is gone from the current doc; regressions surface as normal findings; rejected findings are handled by pattern-match suppression. Cross-session persistence is out of scope.Terminal question (three options, separates "apply" from "re-review")
The old binary "refine / complete" question replaced with:
Apply decisions and proceed to <next stage>— default when fixes applied;<next stage>isce-planfor requirements docs,ce-workfor plan docsApply decisions and re-review— opt-in when the user believes edits warrant another passExit without further actionZero-actionable case drops the re-review option and adapts the primary label to
Proceed to <next stage>. Caller-context (ce-brainstorm nested invocation, ce-plan 5.3.8, standalone) is handled implicitly by the agent reading conversation state — nonested:trueargument or caller-side change.learnings-researcher rewrite (domain-agnostic)
ce-learnings-researcherrewritten for domain-agnostic institutional knowledge. Identity framing dropped "preventing repeated mistakes" for "find and distill applicable learnings." Accepts a structured<work-context>input (Activity / Concepts / Decisions / Domains) while preserving free-form fallback. Hardcoded Feature-Type-to-Directory table replaced with dynamic probe ofdocs/solutions/at invocation time. Keyword extraction expanded from 4 dimensions (Module, Technical terms, Problem indicators, Component) to 8 (adds Concepts, Decisions, Approaches, Domains).critical-patterns.mdread is now conditional on file existence (the file doesn't exist in this repo). Integration Points dropsce-doc-review— the agent isce-plan-owned andce-doc-reviewexplicitly does not dispatch it (pipeline-separation principle perdocs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md).ce-compound problem_type enum expansion
Four knowledge-track values added:
architecture_pattern,design_pattern,tooling_decision,convention. Absorbs thebest_practiceoverflow that had been a catch-all. Category mappings added. Bothce-compoundandce-compound-refreshschemas updated in sync. PluginAGENTS.mdgets a new "Documented Solutions" discoverability section naming all knowledge-track and bug-track categories. 10 existingbest_practiceentries migrated to narrower values; onecorrectness-gapschema violation resolved (docs/solutions/workflow/todo-status-lifecycle.md→workflow_issue).Scope boundaries
document-review-betafork. Phased ship directly on the stable skill; headless envelope preserves backward compatibility with ce-brainstorm and ce-plan callers.Testing
bun test: 798 pass, 0 fail. 13 new contract tests cover the three-tier enum, framing guidance, strawman rule, Open-Questions exclusion, decision primer, per-severity gates, FYI floor, cross-persona boost, R29 suppression, R30 verification, headless envelope tier buckets, terminal question three-option structure, label adaptation, Interactive-mode pre-load, walk-through and bulk-preview required mechanics, Open-Questions-defer append flow and failure path, schema enum expansion, schema duplicate-sync, yaml-schema category mappings, and learnings-researcher domain-agnostic framing.bun run release:validate: in sync. compound-engineering currently has 50 agents, 43 skills, 0 MCP servers.tests/fixtures/ce-doc-review/seeded-plan.mdwith known issues planted across tier shapes — 3 safe_auto candidates, 3 gated_auto, 5 manual, 5 FYI candidates, 3 drop-worthy P3s. Intended for pre-merge pipeline validation: run the new synthesis against this fixture and verify the bucket distribution matches expectations. Not yet exercised in this PR — doing so requires the new pipeline to be live in the installed skill cache. Recommend running once the PR lands and before promoting this into the default document-review flow.Review
document-review(5 personas, 38 findings → 13 auto-applied, pushed back on 15 findings through 2 refinement rounds)document-reviewagainst itself (5 personas, 38 findings → 9 auto-applied, resolved 15 findings through 2 refinement rounds including scope trim, tier rename, strawman safeguard, Open-Questions exclusion, primer structure spec, R30 matching predicate, pipeline sequencing fix, envelope-promotion-to-Unit-4, ce-plan-side validation for Unit 2)docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md); drove decision to NOT dispatch learnings-researcher from ce-doc-reviewdocs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md); finding-lifecycle state machine modeled explicitly in the plan with each transition narrated in synthesis prosedocs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md)Commits (9)
Each unit shipped as its own commit so the progression is reviewable:
docs(ce-doc-review): brainstorm + planfeat(ce-compound): Unit 1 — enum expansion + 10 migrationsfeat(learnings-researcher): Unit 2 — domain-agnostic rewritefeat(ce-doc-review): Unit 3 — subagent template upgrade + findings-schema enumfeat(ce-doc-review): Unit 4 — synthesis pipeline with per-severity gates and headless envelopefeat(ce-doc-review): Unit 5 — routing question + walk-through + bulk previewfeat(ce-doc-review): Unit 6 — in-doc Open Questions deferralfeat(ce-doc-review): Unit 7 — terminal question + multi-round decision primertest(ce-doc-review): Unit 8 — contract tests + framing polishPost-deploy validation
ce-doc-reviewagainsttests/fixtures/ce-doc-review/seeded-plan.mdand verify bucket distribution: ≥2 of 3 seeded safe_auto applied silently, ≥2 of 3 seeded gated_auto in walk-through Apply bucket, all 3 drop-worthy P3s dropped by the 0.75 gate, ≥3 of 5 FYI candidates in the FYI subsection, zero false-auto on manual-shaped seeds.ce-doc-reviewagainst this PR's brainstorm doc (docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md) as a before/after comparison with the old pipeline; verify meaningful decision-count reduction without regressing previously-applied P0/P1 fixes.ce-planagainst any plan doc and verify the rewritten learnings-researcher produces reasonable results across code-bug queries and non-code queries (skill-design, workflow). If output quality regresses, iterate via prompt edits in follow-up PRs.Fixes #562 spiritually (document-review analog of the ce-review interactive judgment overhaul).
🤖 Generated with Claude Code