diff --git a/docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md b/docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md new file mode 100644 index 000000000..8867e870d --- /dev/null +++ b/docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md @@ -0,0 +1,157 @@ +--- +date: 2026-04-18 +topic: ce-doc-review-autofix-and-interaction +--- + +# ce-doc-review Autofix and Interaction Overhaul + +## Problem Frame + +`ce-doc-review` consistently produces painful reviews. It surfaces too many findings as "requires judgment" when one reasonable fix exists, nitpicks on low-confidence items, and hands the user a wall of prose with only two terminal options — "refine and re-review" or "review complete." The interaction model lags behind what `ce-code-review` now offers (per PR #590): per-finding walk-through, LFG, bulk preview, tracker defer, and a recommendation-stable routing question. + +A real-world review of a plan document produced **14 findings all routed to "needs judgment"** — including five P3 findings at 0.55–0.68 confidence, three concrete mechanical fixes that a competent implementer would arrive at independently, and one subjective filename-symmetry observation that didn't need a decision at all. The user had to parse 14 prose blocks, pick answers, and then was forced into a re-review regardless of how little the edits actually changed. + +The gaps are structural and line up with four observable failure modes: + +1. **Classification is binary and coarse.** `autofix_class` is `auto` or `present`. There is no `gated_auto` tier (concrete fix, minor sign-off) and no `advisory` tier (report-only FYI). Everything that isn't "one clear correct fix with zero judgment" becomes `present`, which conflates high-stakes strategic decisions with small mechanical follow-ups. +2. **Confidence gate is flat and too low.** A single 0.50 threshold across all severities lets borderline P3s through. `ce-code-review` moved to 0.60 with P0-only survival at 0.50+. +3. **"Reasonable alternative" test is permissive.** Persona reviewers list `(a) / (b) / (c)` fix options where (b) and (c) are strawmen ("accept the regression," "document in release notes," "do nothing"). The classification rule reads those as multiple reasonable fixes and routes the finding to `present`, when in fact only (a) is a real option. +4. **Subagent framing and interaction model are pre-PR-590.** No observable-behavior-first framing guidance, no walk-through, no bulk preview, no per-severity confidence calibration, no post-fix "apply and proceed" exit — every path that addresses findings forces a re-review, even when the user is done. + +## Requirements + +**Classification tiers** + +- R1. `autofix_class` expands from two values to four: `auto`, `gated_auto`, `advisory`, `present`. Values preserve the existing "is there one correct fix" axis but add (a) a tier for concrete fixes that touch document scope / meaning and should be user-confirmed (`gated_auto`), and (b) a tier for report-only observations with no decision to make (`advisory`). +- R2. `auto` findings are applied silently, same as today. The promotion rules in the synthesis pipeline (current steps 3.6 and 3.7) are sharpened per R4 below and carry the new strictness forward. +- R3. `gated_auto` findings carry a concrete `suggested_fix` and a user-confirmation requirement. They enter the per-finding walk-through (R13) with `Apply the proposed fix` marked `(recommended)`. They are the default tier for "concrete fix exists, but it changes what the document says in a way the author should sign off on" (e.g., adding a backward-compatibility read-fallback, requiring two units land in one commit, substituting a framework-native API for a hand-rolled one). +- R4. `advisory` findings are report-only. They surface in a compact FYI block in the final output and do not enter the walk-through or any bulk action. Subjective observations ("filename asymmetry — could go either way"), drift notes without actionable fixes, and low-stakes calibration gaps live here. +- R5. `present` findings remain for genuinely strategic / scope / prioritization decisions where multiple reasonable approaches exist and the right choice depends on context the reviewer doesn't have. + +**Classification rule sharpening** + +- R6. The subagent-template classification rule adds teeth: "a 'do nothing / accept the defect' option is not a real alternative — it's the failure state the finding describes." If the only listed alternatives to the primary fix are strawmen, the finding is `auto` (or `gated_auto` if confirmation is warranted), not `present`. This applies equally to "document in release notes," "accept drift," and other deferral framings that sidestep the actual problem. +- R7. Auto-promotion patterns already scattered in prose (steps 3.6 and 3.7) are consolidated into an explicit promotion rule set, covering: + - Factually incorrect behavior where the correct behavior is derivable from context or the codebase + - Missing standard security / reliability controls with established implementations (HTTPS, fallback-with-deprecation-warning, input sanitization, checksum verification, private IP rejection, etc.) + - Codebase-pattern-resolved fixes that cite a concrete existing pattern + - Framework-native-API substitutions when a hand-rolled implementation duplicates first-class framework behavior (e.g., cobra's `Deprecated` field) + - Completeness additions mechanically implied by the document's own explicit decisions +- R8. The subagent template includes a framing-guidance block (ported from the `ce-code-review` shared template): observable-behavior-first phrasing, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, positive/negative example pair. One file change, applied universally across all seven personas. + +**Per-severity confidence gates** + +- R9. The single 0.50 confidence gate is replaced with per-severity gates: + - P0: survive at 0.50+ + - P1: survive at 0.60+ + - P2: survive at 0.65+ + - P3: survive at 0.75+ +- R10. The residual-concern promotion step (current step 3.4) is dropped. Cross-persona agreement instead boosts the confidence of findings that already survived the gate (by +0.10, capped at 1.0), mirroring `ce-code-review` stage 5 step 4. Residual concerns surface in Coverage only. +- R11. `advisory` findings are exempt from the confidence gate — they are report-only and can't generate false-positive work even at lower confidence. This is the safety valve for observations the reviewer wants on record but doesn't want to escalate. + +**Interaction model (post-fix routing)** + +- R12. After `auto` fixes are applied and before any user interaction, Interactive mode presents a four-option routing question that mirrors `ce-code-review`'s post-PR-590 design: + - (A) `Review each finding one by one — accept the recommendation or choose another action` + - (B) `LFG. Apply the agent's best-judgment action per finding` + - (C) `Append findings to the doc's Open Questions section and proceed` (ce-doc-review analogue of ce-code-review's "file a tracker ticket" — for docs, "defer" means appending the findings to a `## Deferred / Open Questions` section within the document itself, not an external system) + - (D) `Report only — take no further action` + If zero `gated_auto` / `present` findings remain after the `auto` pass, the routing question is skipped and the flow falls directly into the terminal question (R19). +- R13. Routing option A enters a per-finding walk-through, presented one finding at a time in severity order (P0 first). Each per-finding question carries: position indicator (`Finding N of M`), severity, confidence, a plain-English statement of the problem, the proposed edit, and a short reasoning grounded in the document's own content or the codebase. Options: `Apply the proposed fix` / `Defer — append to the doc's Open Questions section` / `Skip — don't apply, don't append` / `LFG the rest — apply the agent's best judgment to this and remaining findings`. Advisory-only findings substitute `Acknowledge — mark as reviewed` for Apply. +- R14. Routing option B and walk-through `LFG the rest` execute the agent's per-finding recommended action across the selected scope (all pending findings for B, remaining-undecided for walk-through). The recommendation for each finding is determined deterministically by R16. +- R15. Before any bulk action executes (routing B, routing C, walk-through `LFG the rest`), a compact plan preview renders findings grouped by intended action (`Applying (N):`, `Appending to Open Questions (N):`, `Skipping (N):`, `Acknowledging (N):`) with a one-line summary per finding. Exactly two responses: `Proceed` or `Cancel`. Cancel from walk-through `LFG the rest` returns the user to the current finding, not to the routing question. + +**Recommendation tie-breaking** + +- R16. When merged findings carry conflicting recommendations across contributing personas (one says Apply, another says Defer), synthesis picks the most conservative using `Skip > Defer > Apply > Acknowledge`, so walk-through recommendations and LFG behavior are deterministic across re-runs. + +**Terminal "next step" question (the re-review fix)** + +- R17. The current Phase 5 binary question (`Refine — re-review` / `Review complete`) conflates "apply fixes" with "re-review" into a single option. This is replaced by a three-option terminal question that separates the two axes: + - (A) `Apply decisions and proceed to ` — for requirements docs, hand off to `ce-plan`; for plan docs, hand off to `ce-work`. Default / recommended when fixes were applied or decisions were made. + - (B) `Apply decisions and re-review` — opt-in re-review when the user believes the edits warrant another pass. + - (C) `Exit without further action` — user wants to stop for now. + When zero actionable findings remain (everything was `auto` or `advisory`), option B is omitted — re-review is not useful when there's nothing to re-examine. +- R18. The terminal question is distinct from the mid-flow routing question (R12). The routing question chooses *how* to engage with findings; the terminal question chooses *what to do next* once engagement is complete. The two are asked separately, not merged. +- R19. The zero-findings degenerate case (no `gated_auto` / `present` findings after the `auto` pass) skips the routing question entirely and proceeds directly to the terminal question with option B suppressed. + +**In-doc deferral (Defer analogue)** + +- R20. Document-review's `Defer` action appends the deferred finding to a `## Deferred / Open Questions` section at the end of the document under review. If the heading does not exist, it is created on first defer within a review. Multiple deferred findings from a single review accumulate under a single timestamped subsection (e.g., `### From 2026-04-18 review`) to keep sequential reviews distinguishable. This replaces `ce-code-review`'s tracker-ticket mechanic with a document-native analogue: deferred findings stay attached to the document they came from. +- R21. The appended entry for each deferred finding includes: title, severity, reviewer attribution, confidence, and the `why_it_matters` framing — enough context that a reader returning to the doc later can understand the concern without re-running the review. The entry does not include `suggested_fix` or `evidence` — those live in the review run artifact and can be looked up if needed. +- R22. When the append fails (document is read-only, path issue, write failure), the agent surfaces the failure inline and offers: retry, fall back to recording the deferral in the completion report only, or convert the finding to Skip. Silent failure is not acceptable. + +**Framing quality in reviewer output** + +- R23. Every user-facing surface that describes a finding — walk-through questions, LFG completion reports, Open Questions entries — explains the problem and fix in plain English. The framing leads with the *observable consequence* of the issue (what an implementer, reader, or downstream caller sees), not the document's structural phrasing. +- R24. The framing explains *why the fix works*, not just what it changes. When a pattern exists elsewhere in the document or codebase, reference it so the recommendation is grounded. +- R25. The framing is tight — approximately two to four sentences. Longer framings are a regression. + +**Cross-cutting** + +- R26. Tool-loading pre-flight mirrors `ce-code-review`: on Claude Code, `AskUserQuestion` is pre-loaded once at the start of Interactive mode via `ToolSearch` (`select:AskUserQuestion`), not lazily per-question. The numbered-list text fallback applies only when `ToolSearch` explicitly returns no match or the tool call errors. +- R27. Headless mode behavior is preserved. `mode:headless` continues to apply `auto` fixes silently and return all other findings as structured text to the caller. The caller owns routing. New tiers (`gated_auto`, `advisory`) must appear distinctly in headless output so callers can route them appropriately. + +**Multi-round decision memory** + +- R28. Every review round after the first passes a cumulative decision primer to every persona, carrying forward all prior rounds' decisions in the current interactive session: rejected findings (Skipped / Deferred from any prior round) with title, evidence quote, and rejection reason; plus Applied findings from any prior round with title and section reference. Personas still receive the full current document as their primary input. No diff is passed — fixed findings self-suppress because their evidence no longer exists, regressions surface as normal findings on the current doc, and rejected findings are handled by the suppression rule in R29. +- R29. Personas must not re-raise a finding whose title and evidence pattern-match a finding rejected in any prior round, unless the current document state makes the concern materially different. The orchestrator drops any finding that would violate this rule and records the drop in Coverage. +- R30. For each prior-round Applied finding, synthesis confirms the fix landed by checking that the specific issue the finding described no longer appears in the referenced section. If a persona re-surfaces the same finding at the same location, synthesis flags it as "fix did not land" in the final report rather than treating it as a new finding. + +**Institutional memory (learnings-researcher integration)** + +- R31. `ce-doc-review` dispatches `research:ce-learnings-researcher` as an always-on agent, in parallel with coherence-reviewer and feasibility-reviewer. The agent owns its own fast-exit behavior when `docs/solutions/` is empty or absent — no activation-gating in the orchestrator. +- R32. The orchestrator produces a compressed search seed during Phase 1's classify-and-select step: document type, 3-5 topic keywords extracted from the doc, named entities (tools, frameworks, patterns explicitly named), and the doc's top-level decision points. Learnings-researcher receives the search seed plus the document path, not the full document content. It searches `docs/solutions/` by frontmatter metadata first, then selectively reads matching solution bodies. +- R33. Learnings-researcher returns, per match: the solution doc's path, a one-line relevance reason, and the specific claim in the doc under review that the past solution relates to. Full solution content is loaded on demand by other personas or the orchestrator if the match is promoted into a finding. Results are capped at a small N (default 5) most relevant matches — past-solution volume is not the goal; directly applicable grounding is. +- R34. Learnings-researcher output surfaces in a dedicated "Past Solutions" section of the review output. Entries default to `advisory` tier (report-only grounding) unless a past solution directly contradicts a specific claim in the document under review, in which case they promote to `gated_auto` or `present` with the past solution's path as evidence. +- R35. Learnings-researcher content does not participate in confidence-gating (R9) or cross-persona dedup (existing step 3.3). Its role is to add institutional memory, not to compete with persona findings for user attention. + +**learnings-researcher agent rewrite (bundled)** + +- R36. Rewrite `research:ce-learnings-researcher` to treat the `docs/solutions/` corpus as domain-agnostic institutional knowledge. Code bugs are one genre among several, alongside skill-design patterns, workflow learnings, developer-experience discoveries, integration gotchas, and anything else captured by `ce-compound` and its refresh counterpart. The agent's primary function is "find applicable past learnings given a work context," not "find past bugs given a feature description." +- R37. The agent accepts a structured `` input from callers: a short description of what the caller is working on or considering, a list of key concepts / decisions / domains / components extracted from the caller's work, and an optional domain hint when one applies cleanly (e.g., `skill-design`, `workflow`, `code-implementation`). No mode flag is required — the context shape adapts to the calling skill without the agent branching on caller identity. +- R38. The hardcoded category-to-directory table is replaced with a dynamic probe of `docs/solutions/` to discover available subdirectories at runtime. Category narrowing uses the discovered set. The agent no longer assumes which subdirectories exist in a given repo. +- R39. Keyword extraction handles decision-and-approach-shape content alongside symptom-and-component-shape content. The extraction taxonomy expands from the current four dimensions (Module names, Technical terms, Problem indicators, Component types) to include Concepts, Decisions, Approaches, and Domains. No input shape is privileged over another; the caller's context determines which dimensions carry weight. +- R40. Output framing drops code-bug-biased phrasing ("gotchas to avoid during implementation," "prevent repeated mistakes" framed narrowly around bugs) in favor of neutral institutional-memory framing ("applicable past learnings," "related decisions and their outcomes"). The pointer + one-line-relevance + key-insight summary format carries across all input genres. +- R41. Read `docs/solutions/patterns/critical-patterns.md` only when it exists. When absent, the agent proceeds without it — this file is a per-repo convention, not a protocol requirement. +- R42. The agent's Integration Points section documents invocation by `/ce-plan`, `/ce-code-review`, `ce-doc-review`, and any other skill benefiting from institutional memory. Remove the framing that implies planning-time is the agent's primary home. + +**Frontmatter enum expansion (bundled)** + +- R43. Expand the `ce-compound` frontmatter `problem_type` enum to add non-bug genre values: `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`. Document `best_practice` as the fallback for entries not covered by any narrower value, not the default. Migrate the 8 existing `best_practice` entries that fit a narrower value (3 architecture patterns, 3 design patterns, 1 tooling decision, 1 remaining as best_practice), and resolve the one `correctness-gap` schema violation (`workflow/todo-status-lifecycle.md`) into a valid enum value. Update `ce-compound` and `ce-compound-refresh` so they steer authors toward narrower values when the new categories apply. + +## Scope Boundaries + +- Not introducing a document-native tracker integration (e.g., Linear / Jira / GitHub Issues). Document-review's Defer analogue is an in-doc `## Deferred / Open Questions` section. If users later want tracker integration for doc findings, that's a follow-up proposal. +- Not changing persona selection logic. The seven personas and the activation signals for conditional ones stay as-is. The persona markdown files themselves change only to absorb the subagent-template framing-guidance block. +- Not changing headless mode's structural contract with callers (`ce-brainstorm`, `ce-plan`). Headless continues to apply `auto` fixes silently and return a structured text envelope. Callers must be updated to handle the new `gated_auto` and `advisory` tiers but the envelope shape stays. +- Not adding a `requires_verification` field or an in-skill fixer subagent. Document edits happen inline during the walk-through; there is no batch-fixer analogue to `ce-code-review`'s Step 3 fixer because document fixes are trivially confined in scope (single-file markdown edits). +- Not addressing iteration-limit guidance. The existing "after 2 refinement passes, recommend completion" heuristic stays. +- Not persisting decision primers across interactive sessions. The cumulative decision list (R28) lives in-memory across rounds within a single invocation. A new invocation of `ce-doc-review` on the same doc starts fresh with no carried memory, even if prior-session decisions were Applied to the document. Mirrors `ce-code-review` walk-through state rules. +- Not building a fully new frontmatter schema. R43 adds non-bug enum values but does not redesign the schema dimensions (no split into `learning_category` + `problem_type`, no new required fields). The existing authoring flow stays the same; only the set of valid `problem_type` values grows. + +## Design Decisions Worth Calling Out + +- **Three new tiers, not two.** A minimal refactor could add only `gated_auto` and keep `advisory` collapsed into `present`. But real-world evidence shows FYI-grade findings (subjective observations, low-stakes drift notes) drive significant noise, and folding them into `present` forces user decisions on things that don't warrant any decision. Adding `advisory` as a distinct tier is cheap (one enum value + one output block) and materially reduces decision fatigue. +- **Strawman-aware classification rule in the subagent template, not in synthesis.** Moving the rule to synthesis means persona reviewers still emit inflated alternative lists and the orchestrator retroactively collapses them. Moving it to the subagent template changes what reviewers produce at the source, so the evidence and framing travel together correctly. +- **Per-severity confidence gates, not a flat 0.60.** A flat 0.60 would still let 0.60–0.68 P3 nits through (three of them in the attached real-world example). Severity-aware gates recognize that a P3 finding at 0.65 is noise in a way a P1 at 0.65 is not, because P3 impact is low enough that the expected value of a borderline call doesn't justify the user's attention. +- **Separate terminal question from routing question.** The current skill conflates "engage with findings" and "exit the review" into one question with two poorly-aligned options. Splitting them gives the user explicit control over whether re-review happens — the most common user frustration surfaced in the bug report that prompted this work. +- **In-doc Open Questions section, not a sibling follow-up note or external tracker, as Defer analogue.** Documents don't have the same "handoff to a different system" shape that code findings do. A sibling markdown note would fragment context; an external tracker would add platform complexity with no upside for document review. Appending deferred findings to a `## Deferred / Open Questions` section inside the document itself keeps deferred concerns attached to the artifact they came from, is naturally discoverable by anyone reading the doc, and requires no new infrastructure. The trade-off is that deferred findings visibly mutate the doc — but that is the point: "I want to remember this but not act now" is exactly what an Open Questions section expresses in a planning doc. +- **Port framing-guidance once via the shared subagent template.** Matches how `ce-code-review` shipped the same fix in PR #590. One file change, applied universally. Per-persona edits would inflate scope to seven files; a synthesis-time rewrite pass would add per-review model cost and paper over the root cause in the persona output itself. +- **Classification-rule sharpening and promotion-pattern consolidation ship together with the tier expansion.** Shipping the tiers without the sharpened rule would leave the classifier behavior unchanged and just add new tier labels nothing routes to. Shipping the rule without the tiers has no tier to promote findings into. +- **Keep the existing persona markdown files mostly unchanged.** The framing-guidance block lives in the shared subagent template that wraps every persona dispatch; the personas themselves retain their confidence calibration, suppress conditions, and domain focus. This keeps the persona-level failure-mode catalogs stable while upgrading the shared framing bar. +- **No diff passed to the multi-round decision primer.** Fixed findings self-suppress because their evidence is gone from the current doc; regressions surface as normal findings; rejected findings are handled by the suppression rule (R29). A diff would be signal amplification, not a correctness requirement, and would add prompt weight without changing what the agent can do. +- **learnings-researcher rewrite bundled, not split.** The review-time use case has no consumer without ce-doc-review, so splitting into a precursor PR would ship a dormant feature. Bundling keeps the change coherent and easier to review as one unit. The agent rewrite (R36–R42) and the frontmatter enum expansion (R43) also benefit `/ce-plan`'s existing usage, so the scope investment pays off beyond ce-doc-review. +- **Generalize learnings-researcher rather than patch with a mode flag.** The original proposal was a minimal `review-time` mode flag grafted onto the agent. But the real issue is that the agent's taxonomy, categories, and output framing are code-bug-shaped even when invoked by non-review callers — the plugin already captures non-code learnings via `ce-compound` / `ce-compound-refresh`, and the agent should treat them as first-class. Rewriting for domain-agnostic institutional knowledge is a bigger change but removes the drift, rather than accumulating special cases. +- **Expand `problem_type` rather than introduce a new orthogonal dimension.** A cleaner design might split current `problem_type` into separate `learning_category` (genre) and `problem_type` (bug-shape detail) fields. But that requires migrating every existing entry and teaching authors to pick both. Expanding the existing enum with non-bug values absorbs the `best_practice` overflow with minimal schema churn and keeps the authoring flow stable. + +## Calibration Against Real-World Example + +The attached review output (14 findings, all `present`) re-classifies under the proposed rules as: + +- **4 `auto`** (silently applied, no user interaction): missing fallback-with-deprecation-warning (industry-standard pattern), public-repo grep step (single action), deployment-coupling-commit guarantee (mechanical), cobra's native `Deprecated` field (framework-native substitution). +- **1 `advisory`** (FYI line): filename asymmetry — genuinely ambiguous, no wrong answer. +- **4 `present`** (walk-through): historical-docs rule, alias-compatibility breaking-change, escape-hatch scope decision, Unit merging decision. +- **5 dropped** by per-severity gates: five P3-P2 findings at 0.55–0.68 confidence. + +Net: the user sees **4 decisions**, not 14. The walk-through's `LFG the rest` escape further bounds fatigue — after the user calibrates on the agent's recommendations, they can bail and accept the rest. diff --git a/docs/plans/2026-04-18-001-feat-ce-doc-review-autofix-and-interaction-overhaul-plan.md b/docs/plans/2026-04-18-001-feat-ce-doc-review-autofix-and-interaction-overhaul-plan.md new file mode 100644 index 000000000..73850905f --- /dev/null +++ b/docs/plans/2026-04-18-001-feat-ce-doc-review-autofix-and-interaction-overhaul-plan.md @@ -0,0 +1,708 @@ +--- +title: ce-doc-review Autofix and Interaction Overhaul +type: feat +status: active +date: 2026-04-18 +origin: docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md +--- + +# ce-doc-review Autofix and Interaction Overhaul + +## Overview + +Overhaul `ce-doc-review` to match the interaction quality and auto-fix leverage of `ce-code-review` (post-PR #590). Today, ce-doc-review surfaces too many findings as "needs user judgment" when one clear fix exists, nitpicks at low confidence, and ends with a binary question that forces re-review when the user wants to apply fixes and move on. This plan expands the autofix classification from binary (`safe_auto` / `manual`) to three tiers (`safe_auto` / `gated_auto` / `manual`) using ce-code-review-aligned names, raises and severity-weights the confidence gate, ports the per-finding walk-through + bulk-preview + routing-question pattern from `ce-code-review`, adds in-doc deferral, introduces multi-round decision memory, rewrites `learnings-researcher` to handle domain-agnostic institutional knowledge, and expands the `ce-compound` frontmatter `problem_type` enum to absorb the `best_practice` overflow. **Advisory-style findings** (low-confidence observations worth surfacing but not worth a decision) render as a distinct FYI subsection of the `manual` bucket at the presentation layer rather than a separate schema tier. + +The plan ships in phases so lower-risk foundation work (enum expansion, agent rewrite) can land and stabilize before the interaction-model port. Each implementation unit is atomic and can ship as its own PR. + +## Problem Frame + +See origin document for full problem framing. In brief, a real-world review surfaced **14 findings all routed to `manual`**, including five P3s at 0.55–0.68 confidence, three concrete mechanical fixes that a competent implementer would arrive at independently, and one subjective observation with no right answer. Under the revised rules the same review produces 4 auto-applied fixes, 1 FYI entry, 4 real decisions, and 5 dropped — the user engages with 4 items instead of 14. + +## Requirements Trace + +38 requirements from the origin document. Full definitions live there; listed here for traceability. + +- **Classification tiers:** R1–R5 (three tiers — add `gated_auto`; keep `safe_auto` / `manual`; advisory-style findings become presentation-layer FYI subsection of manual, not a distinct enum value) +- **Classification rule sharpening:** R6–R8 (strawman-aware rule with safeguard, consolidated promotion patterns, shared framing-guidance block) +- **Per-severity confidence gates:** R9–R11 (P0 0.50 / P1 0.60 / P2 0.65 / P3 0.75; drop residual promotion; low-confidence manual findings surface in a distinct FYI subsection without being dropped) +- **Interaction model:** R12–R16 (4-option routing, per-finding walk-through, bulk preview, tie-break) +- **Terminal question:** R17–R19 (three-option split: apply-and-proceed / apply-and-re-review / exit) +- **In-doc deferral:** R20–R22 (append to `## Deferred / Open Questions` section) +- **Framing quality:** R23–R25 (observable consequence, why-the-fix-works, tight) +- **Cross-cutting:** R26–R27 (AskUserQuestion pre-load, headless preservation) +- **Multi-round memory:** R28–R30 (cumulative decision primer, suppression, fix-landed verification) +- **learnings-researcher agent rewrite:** R36–R42 (domain-agnostic, ``, dynamic category probe, optional critical-patterns read) — benefits `/ce-plan`'s existing usage +- **Frontmatter enum expansion:** R43 (add `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`) + +**Dropped from scope:** R31–R35 (learnings-researcher integration into ce-doc-review). See Key Technical Decisions and Alternative Approaches Considered for the rationale. **In scope:** R36–R42 (learnings-researcher domain-agnostic rewrite, Unit 2) and R43 (frontmatter enum expansion, Unit 1), which benefit `/ce-plan`'s existing usage even though learnings-researcher is not dispatched from ce-doc-review. + +## Scope Boundaries + +- Not introducing external tracker integration. Document-review's Defer analogue is an in-doc section. +- Not changing persona activation/selection logic. The 7 personas and their conditional activation signals stay as-is. +- Not adding `requires_verification` or a batch fixer subagent. Document fixes apply inline. +- Not addressing iteration-limit guidance. "After 2 refinement passes, recommend completion" stays. +- Not persisting decision primers across interactive sessions (matches `ce-code-review` walk-through state rules). +- Not redesigning the frontmatter schema dimensions. Enum expansion only — no new `learning_category` field alongside `problem_type`. + +### Deferred to Separate Tasks + +- Frontmatter validation test. Adding a pre-commit or CI check that enforces `problem_type` enum membership is valuable (`correctness-gap` slipped through today) but is additive and can ship as a follow-up. +- Updating the frontmatter `component` enum. It's heavily Rails-focused and would benefit from expansion for non-Rails work, but that's out of scope for this overhaul. + +## Context & Research + +### Relevant Code and Patterns + +**Port-from targets (`ce-code-review`):** +- `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md` — per-finding walk-through (terminal output block + blocking question split, fixed-order options, `(recommended)` marker, LFG-the-rest escape, N=1 adaptation, unified completion report) +- `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md` — grouped Apply/Filing/Skipping/Acknowledging preview with `Proceed` / `Cancel` +- `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md:51-73` — framing-guidance block for personas +- `plugins/compound-engineering/skills/ce-code-review/SKILL.md:75` — AskUserQuestion pre-load directive +- `plugins/compound-engineering/skills/ce-code-review/SKILL.md:477` (stage 5 step 7b) — recommendation tie-break order `Skip > Defer > Apply > Acknowledge` + +**Target surfaces (`ce-doc-review`):** +- `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` — orchestrator +- `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` — framing-guidance block lands here +- `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` — tier routing, confidence gate, decision primer, and headless envelope updates +- `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` — `autofix_class` enum expansion +- `plugins/compound-engineering/agents/document-review/ce-*.agent.md` — 7 persona files (mostly unchanged) + +**Caller contracts:** +- `plugins/compound-engineering/skills/ce-brainstorm/SKILL.md:188-194` — invokes interactively on requirements doc +- `plugins/compound-engineering/skills/ce-brainstorm/references/handoff.md:29,56,65` — surfaces residual P0/P1 adjacent to menus; offers re-review +- `plugins/compound-engineering/skills/ce-plan/references/plan-handoff.md:5-17` — phase 5.3.8; interactive normally, `mode:headless` in pipeline + +**Schema surfaces:** +- `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` (canonical) and `yaml-schema.md` (human-readable) — `problem_type` enum definitions + category mapping +- `plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml` and `yaml-schema.md` — **duplicate** copies, must update in sync +- `plugins/compound-engineering/skills/ce-compound/SKILL.md` — author steering language +- `plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md` — refresh steering language + +**Agent to rewrite:** +- `plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md` — domain-agnostic rewrite + +**Test surfaces:** +- `tests/pipeline-review-contract.test.ts:279-352` — asserts ce-doc-review is invoked with `mode:headless` in pipeline mode. Will need extension for new tier visibility in headless envelope. +- `tests/converter.test.ts:417-438` — OpenCode 3-segment → flat name rewrite for ce-doc-review agent refs. Unaffected. +- No dedicated test file for ce-doc-review itself. Adding one is in scope (Unit 8). + +### Institutional Learnings + +Seven directly applicable learnings from `docs/solutions/`: + +- `docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md` — **Mandatory read.** Authored from the `ce-code-review` PR #590 redesign this plan ports. Documents the bulk-preview vs. walk-through distinction, the 4-option `AskUserQuestion` cap as a structural constraint, the "two semantic meanings in one flag" risk, and the "sample 10-20 real artifacts before accepting research-agent architectural recommendations" rule. +- `docs/solutions/skill-design/compound-refresh-skill-improvements.md` — Six-item skill-review checklist (no hardcoded tool names, no contradictory phase rules, no blind questions, no unsatisfied preconditions, no shell in subagents, autonomous-mode opt-in). The "borderline cases get marked stale in autonomous mode" template informs how `advisory` findings behave in headless runs. +- `docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md` — Classifies `learnings-researcher` as ce-plan-owned (HOW / implementation-context). **Drove the decision to remove R31–R35 from scope entirely:** rather than dispatch from ce-doc-review in any form (always-on or conditional), keep the agent in its ce-plan pipeline lane. ce-doc-review does not dispatch it. Users who want institutional memory should invoke ce-plan. +- `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` — Default to path-passing; 7× tool-call difference from prompt phrasing. Relevant to Unit 2's learnings-researcher rewrite — the `` input should pass paths and compressed context, not full documents. +- `docs/solutions/skill-design/beta-skills-framework.md` + `beta-promotion-orchestration-contract.md` + `ce-work-beta-promotion-checklist-2026-03-31.md` — Beta-skill pattern for major overhauls. Considered and rejected for this work (see Alternative Approaches below). +- `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md` — **High severity for this plan.** Model tier/confidence/deferral as an explicit state machine; re-read state at each transition boundary. Directly shapes Unit 4 (synthesis pipeline) structure. +- `docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md` — When enum expands, update instruction-discovery surface (schema reference, learnings-researcher prompt, AGENTS.md) in the same PR. Shapes Unit 1 and Unit 2. + +### External References + +No external research was needed — the work is internal plugin refactoring with strong local patterns (ce-code-review post-PR #590 is the canonical reference). + +## Key Technical Decisions + +- **Port the ce-code-review walk-through / bulk-preview pattern rather than invent a new one.** Same menu shape, same tie-break rule, same AskUserQuestion pre-load pattern. Users who've experienced ce-code-review's new flow will find ce-doc-review consistent. **Tier naming aligned with ce-code-review** (`safe_auto`, `gated_auto`, `manual`) so cross-skill mental model is consistent. +- **Three tiers, not four — advisory is a display treatment, not an enum value.** ce-code-review has four tiers (adds `advisory`) because code reviews have a meaningful "report-only, release/human-owned" category (rollout notes, residual risk, learnings). Document reviews rarely produce that shape — FYI observations are typically just low-confidence manual findings that don't need a decision. Collapsing to three tiers + FYI-subsection presentation drops a schema value without losing the user-facing distinction between "needs decision" and "FYI, move on." Cognitive load lower; schema simpler. +- **Interaction-surface convergence with ce-code-review is intentional; keep the skills separate.** Post-plan, ce-doc-review and ce-code-review share interaction mechanics (walk-through shape, bulk preview, routing question, tie-break order) but evaluate genuinely different things: the personas are different (coherence/feasibility/scope-guardian for docs vs correctness/security/performance for code), the inputs are different (prose vs diff), and the failure modes are different. Shared interaction scaffold, distinct review content. Unifying into one skill would smear the focused-review value each delivers today. +- **Ship without a `ce-doc-review-beta` fork.** See Alternative Approaches. +- **Do not dispatch `learnings-researcher` from ce-doc-review at all.** The agent is ce-plan-owned (implementation-context per `research-agent-pipeline-separation-2026-04-05.md`). When ce-doc-review runs inside ce-plan, the agent has already fired and its output lives in the plan. When ce-doc-review runs inside ce-brainstorm, the context is WHY (product-framing), not HOW (implementation) — running an implementation-context agent would be a pipeline violation. When ce-doc-review runs standalone, the personas already cover coherence, feasibility, and scope — institutional memory is a nice-to-have that adds dispatch cost without proportional value. Users who want institutional memory for a doc should invoke `/ce-plan`, where that lookup is a first-class pipeline stage. +- **Put R1–R8 classification changes in the shared subagent template**, not in each persona. One file edit propagates to all 7 personas. Matches how `ce-code-review` shipped the same quality upgrade. +- **Keep R9–R11 confidence gates in synthesis** (`synthesis-and-presentation.md` step 3.2), not in personas. Personas keep their existing HIGH/MODERATE/<0.50 calibration. +- **No diff passed in multi-round primer (R28).** Fixed findings self-suppress (evidence gone); regressions surface as normal findings; rejected findings use pattern-match suppression. The diff would add prompt weight without changing what the agent can detect. +- **Enum expansion values go on the knowledge track**, not the bug track. All four new values (`architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`) are knowledge-track per the two-track schema in `schema.yaml:12-31`. +- **Update duplicate schema files in both `ce-compound` and `ce-compound-refresh`** in the same commit. They are intentional duplicates; divergence is a known pitfall. +- **Model tier/confidence/deferral as an explicit state machine** (per `git-workflow-skills-need-explicit-state-machines` learning). See High-Level Technical Design for the state diagram. + +## Open Questions + +### Resolved During Planning + +- **Beta fork vs phased ship?** Phased ship without beta. The overhaul is large but cleanly phaseable; each phase is independently testable; callers stay compatible via the preserved headless envelope contract (R27). +- **Dispatch learnings-researcher from ce-doc-review?** No. Dropped from scope (R31–R35 removed). The agent is ce-plan-owned; users who want institutional memory should invoke ce-plan, which has it as a first-class pipeline stage. Unit 2 still rewrites the agent to be domain-agnostic — that benefits ce-plan's existing usage. +- **Diff in multi-round primer?** No. Decision metadata alone is sufficient. +- **Defer destination for docs?** In-doc `## Deferred / Open Questions` section, not a sibling file. See origin document R20. + +### Deferred to Implementation + +- **How many existing `best_practice` entries map to each new enum value?** Research suggests ~11 candidates; final mapping happens when migrating. +- **Exact wording of the `gated_auto` / `manual` labels in the AskUserQuestion menus.** Draft wording exists in origin document R12–R13; final phrasing validated against harness rendering during implementation. +- **Exact line-count budget for the subagent-template framing-guidance block.** Target ~40-50 lines per research findings; adjust as needed to stay under the ~150-line `@` inclusion threshold. +- **Whether to extend `tests/pipeline-review-contract.test.ts` or add a new `tests/ce-doc-review-contract.test.ts`.** Decide during Unit 8 based on assertion overlap. + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification. The implementing agent should treat it as context, not code to reproduce.* + +### Finding lifecycle state machine + +Per the `git-workflow-skills-need-explicit-state-machines` learning, the tier/confidence/deferral interactions form a state machine that must be modeled explicitly — prose-level carry-forward silently breaks. + +```mermaid +stateDiagram-v2 + [*] --> Raised: Persona emits finding + Raised --> Dropped: confidence < per-severity gate (R9) + Raised --> Dropped: re-raises rejected prior-round finding (R29) + Raised --> Deduplicated: fingerprint matches another persona's finding + Deduplicated --> Classified + Raised --> Classified: after confidence + dedup gates + + Classified --> SafeAuto: autofix_class = safe_auto (R2) + Classified --> GatedAuto: autofix_class = gated_auto (R3) + Classified --> Manual: autofix_class = manual (R5) + Classified --> FYI: low-confidence manual, FYI-floor <= conf < per-severity gate + + SafeAuto --> Applied: orchestrator edits doc silently + Applied --> Verified: next round confirms fix landed (R30) + Applied --> FixDidNotLand: persona re-raises same finding at same spot (R30) + + GatedAuto --> WalkThrough: routing option A (R13) + GatedAuto --> BulkApply: routing option B LFG (R14) + GatedAuto --> BulkDefer: routing option C (R12) + Manual --> WalkThrough + Manual --> BulkApply + Manual --> BulkDefer + FYI --> Reported: surfaces in FYI subsection at presentation layer, no decision + + WalkThrough --> UserChoice + UserChoice --> Applied: user picks Apply + UserChoice --> Deferred: user picks Defer (R20-R22) + UserChoice --> Skipped: user picks Skip + + BulkApply --> Applied: proceed + BulkDefer --> Deferred: proceed + + Deferred --> AppendedToOpenQuestions: append succeeds (R20) + Deferred --> Skipped: append fails, user converts to Skip (R22) + + Verified --> [*] + FixDidNotLand --> [*]: flagged in report + AppendedToOpenQuestions --> [*] + Skipped --> [*] + Reported --> [*] + Dropped --> [*] +``` + +This diagram models ce-doc-review persona findings only. Learnings-researcher findings (R31–R35) are out of scope — ce-doc-review does not dispatch the agent (see Key Technical Decisions and Alternative Approaches Considered). + +Transitions to verify explicitly in synthesis (not carry forward as prose): + +- Classified → one of four buckets (tier routing, step 3.7 rewrite) +- Rejected-in-prior-round → Dropped (R29 suppression, new synthesis step) +- Applied → Verified or FixDidNotLand (R30, new synthesis step) +- Auto / GatedAuto → Applied (separate paths; Auto is silent, GatedAuto goes through walk-through or bulk) + +### Three interaction surfaces + +```mermaid +flowchart TD + A[Auto fixes applied silently] --> B{Any gated_auto / present
findings remain?} + B --> |No| Z[Zero-findings degenerate
→ Terminal question
B option omitted] + B --> |Yes| C[Four-option routing question] + + C --> |A Review| W[Per-finding walk-through] + C --> |B LFG| P1[Bulk preview] + C --> |C Append to Open Questions| P2[Bulk preview] + C --> |D Report only| E[Terminal question
without applying] + + W --> |Apply/Defer/Skip| W + W --> |LFG the rest| P3[Bulk preview
scoped to remaining] + + P1 --> |Proceed| X[Apply set dispatched
Defer appends
Skip no-op] + P2 --> |Proceed| Y[All append to
Open Questions section] + P3 --> |Proceed| X + + X --> T[Terminal question
3 options] + Y --> T + E --> T + + T --> |Apply and proceed| NextStage[ce-plan or ce-work] + T --> |Apply and re-review| Round2[Next review round
with decision primer] + T --> |Exit| End[Done for now] + + Round2 --> A +``` + +The walk-through, bulk preview, and routing question are ports of the same-named `ce-code-review` references with ce-doc-review specific adaptations (Defer = in-doc append; no batch fixer subagent; terminal question routes to pipeline stages instead of PR/push). + +## Implementation Units + +- [ ] **Unit 1: Frontmatter enum expansion + migration** + +**Goal:** Add four knowledge-track values (`architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`) to the `problem_type` enum, update both duplicate schema files, migrate existing `best_practice` overflow entries, resolve the one `correctness-gap` schema violation, and update instruction-discovery surfaces so new values are discoverable. + +**Requirements:** R43 + +**Dependencies:** None (foundation) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` +- Modify: `plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md` +- Modify: `plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml` +- Modify: `plugins/compound-engineering/skills/ce-compound-refresh/references/yaml-schema.md` +- Modify: `plugins/compound-engineering/skills/ce-compound/SKILL.md` (author-steering language toward narrower values) +- Modify: `plugins/compound-engineering/skills/ce-compound-refresh/SKILL.md` (refresh steering language) +- Modify: `plugins/compound-engineering/AGENTS.md` (discoverability line that names `problem_type` values) +- Migrate: the ~8–11 existing `best_practice` entries under `docs/solutions/` that fit a narrower value (see repo-research report for the candidate list — some entries may stay `best_practice` if no narrower value applies; the final count is a small range, not a fixed number) +- Migrate: `docs/solutions/workflow/todo-status-lifecycle.md` (`correctness-gap` → valid enum value) + +**Approach:** +- Add four values to both schema.yaml files under the knowledge track +- Add four category mappings to both yaml-schema.md files (`architecture_pattern → docs/solutions/architecture-patterns/`, etc.) +- Keep `best_practice` valid but document it as the fallback, not the default +- Author-steering language in ce-compound SKILL body should name the new values with one-line descriptions so authors pick the narrower value when applicable +- Category directory creation on first use — don't pre-create empty dirs +- Migration pass: re-classify the ~11 existing `best_practice` entries per the research findings, and move `todo-status-lifecycle.md` off `correctness-gap` + +**Patterns to follow:** +- `schema.yaml` existing two-track structure (bug / knowledge) +- `yaml-schema.md` existing "Category Mapping" section format +- ce-compound existing author-steering prose in section naming problem types + +**Test scenarios:** +- Happy path: a fixture knowledge-track doc with `problem_type: architecture_pattern` parses and validates +- Happy path: a fixture knowledge-track doc with `problem_type: design_pattern` parses and validates +- Edge case: a doc with `problem_type: best_practice` still validates (backward compat) +- Edge case: a doc with an unknown value (e.g., `problem_type: xyz-invalid`) is flagged +- Integration: ce-compound steering guidance names the new values in its output when classifying an appropriate learning + +**Verification:** +- Both schema files contain all 4 new values and the category mappings +- Every `best_practice` entry under `docs/solutions/` that fits a narrower value has been reclassified (final count is the subset of ~8–11 candidates that genuinely fit a narrower tier; some may legitimately remain `best_practice`) +- `docs/solutions/workflow/todo-status-lifecycle.md` carries a valid enum value +- AGENTS.md references the new categories so future agents discover them + +--- + +- [ ] **Unit 2: learnings-researcher domain-agnostic rewrite** + +**Goal:** Rewrite the `learnings-researcher` agent to treat `docs/solutions/` as domain-agnostic institutional knowledge. Accept a structured `` input, replace hardcoded category tables with dynamic probing, expand keyword extraction beyond bug-shape taxonomy, and make the `critical-patterns.md` read optional. + +**Requirements:** R36, R37, R38, R39, R40, R41, R42 + +**Dependencies:** Unit 1 for taxonomy-aware output framing only. The dynamic category probe itself has no schema dependency (it reads `docs/solutions/` subdirectories at runtime), so Unit 2 can be drafted in parallel; only the final author-visible framing benefits from Unit 1's enum landing first. + +**Files:** +- Modify: `plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md` +- Test: No agent-specific test exists. Extend or add a fixture under `tests/fixtures/` if needed to validate the dispatch contract across platforms — defer to Unit 8 if non-trivial. + +**Approach:** +- Rewrite the opening identity/framing: "domain-agnostic institutional knowledge researcher" (not bug-focused) +- Replace `feature/task description` input format with structured `` block (description + concepts + decisions + domains + optional domain hint) +- Replace hardcoded category-to-directory table with a dynamic probe: at invocation time, list subdirectories under `docs/solutions/` and use the discovered set +- Expand keyword extraction taxonomy: existing four dimensions plus Concepts, Decisions, Approaches, Domains +- Make Step 3b (critical-patterns.md read) conditional on file existence +- Rewrite output framing to "applicable past learnings" / "related decisions and their outcomes" from "gotchas to avoid during implementation" +- Update Integration Points to include `/ce-plan` and standalone use (ce-doc-review is explicitly not a caller per this plan's Key Technical Decisions — the rewrite's consumer is `/ce-plan`) + +**Execution note:** After rewriting, sample 3-5 real invocations on current codebase learnings to verify the domain-agnostic rewrite produces relevant output for non-code queries (e.g., skill-design questions, workflow questions). Per the ce-pipeline learnings doc: "sample real artifacts before accepting research-agent architectural recommendations." + +**Patterns to follow:** +- `ce-code-review` shared subagent template (`references/subagent-template.md`) for the new `` block shape +- Existing `learnings-researcher.md` grep-first filtering strategy (Step 3) — preserve it, it's already efficient +- `docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md` classification to keep the agent in its pipeline-stage lane + +**Test scenarios:** +- Happy path: invoke with a code-bug work-context → returns bug-relevant learnings matching existing behavior +- Happy path: invoke with a skill-design work-context → returns skill-design learnings (previously would have lumped under `best_practice` with weaker matches) +- Edge case: `docs/solutions/` is empty or absent → fast-exit returns "No relevant learnings" without errors +- Edge case: `docs/solutions/patterns/critical-patterns.md` absent → agent proceeds without warning +- Edge case: `` has no domain hint → agent falls back to general keyword extraction across all discovered categories +- Integration: converter tests (`tests/codex-writer.test.ts:329` and siblings) still pass — the agent's dispatch string is preserved, only the inner prompt changes + +**Verification:** +- Running the agent on a skill-design question returns results pointing to `docs/solutions/skill-design/` entries, not miscategorized matches from `best-practices/` +- The hardcoded category table is gone; the agent probes `docs/solutions/` at invocation time +- Output framing does not say "gotchas to avoid during implementation" or code-bug-biased language +- Missing `critical-patterns.md` does not cause errors or warnings +- Cross-platform converter tests still pass +- **ce-plan-side validation per #14 review feedback:** run ce-plan's existing Phase 1.1 dispatch flow (on any in-repo plan target) against the rewritten agent and verify (a) the agent's output is consumable by ce-plan's current synthesis step without errors, (b) dispatch-string/contract across Codex, Gemini, and Claude converters is preserved, (c) output shape for a representative code-implementation query matches or improves on pre-rewrite relevance. Document the comparison briefly in the PR description so owners of ce-plan can audit the regression check. + +--- + +- [ ] **Unit 3: ce-doc-review subagent template upgrade: framing, classification rule, tier expansion** + +**Goal:** Upgrade the shared ce-doc-review subagent template with an observable-consequence-first framing guidance block, a strawman-aware classification rule, consolidated auto-promotion patterns, and the new three-tier `autofix_class` enum aligned with ce-code-review names. This is the single file change that propagates improved output across all 7 personas. + +**Requirements:** R1, R2, R3, R4, R5, R6, R7, R8 + +**Dependencies:** None (parallel to Units 1-2) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` (rename + expand `autofix_class` enum) +- Test: (deferred to Unit 8 — contract test assertion against template structure) + +**Approach:** +- Rename and expand `autofix_class` enum in `findings-schema.json` from `[auto, present]` to `[safe_auto, gated_auto, manual]`. Matches ce-code-review's first three tiers exactly. Does not adopt ce-code-review's fourth `advisory` tier — low-confidence FYI findings render as a distinct FYI subsection of the `manual` bucket at the presentation layer (Unit 4 handles that). +- Add tier definitions in the subagent template with one-sentence descriptions and examples per R2–R5. Three tiers: `safe_auto` (apply silently, one clear correct fix); `gated_auto` (concrete fix, user confirms); `manual` (requires user judgment). +- Add a strawman-aware classification rule per R6: "a 'do nothing / accept the defect' option is not a real alternative — it's the failure state the finding describes. Count only alternatives a competent implementer would genuinely weigh." Include a positive/negative example pair. +- **Strawman safeguard per #11 review feedback:** any finding classified `safe_auto` via strawman-dismissal of alternatives must name the dismissed alternatives in `why_it_matters`. When alternatives exist at all (even if reviewer judges them weak), the finding defaults to `gated_auto` (one-click apply in walk-through) rather than silent `safe_auto`. `safe_auto` stays reserved for truly single-option fixes (typo, wrong count, stale cross-reference, missing mechanical step). +- **Persona exclusion of `## Deferred / Open Questions` section per #8 review feedback:** the template instructs personas to exclude any `## Deferred / Open Questions` section and its subheadings from the review scope — those entries are review output from prior rounds, not part of the document being reviewed. Prevents the round-2 feedback loop where personas flag the deferred section as a new finding or quote its text as evidence. +- Consolidate auto-promotion patterns per R7 into an explicit rule set: factually incorrect behavior, missing standard security/reliability controls, codebase-pattern-resolved fixes, framework-native-API substitutions, mechanically-implied completeness additions +- Add framing-guidance block per R8: observable-consequence-first, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, positive/negative example pair (modeled on `ce-code-review`'s block at `subagent-template.md:51-73`) +- Respect the ~150-line `@` inclusion threshold; if the template exceeds it, switch to a backtick path reference in the SKILL.md (unlikely given current 52-line size + ~40-50 line addition) + +**Patterns to follow:** +- `ce-code-review` subagent template (`subagent-template.md:51-73`) framing-guidance block structure +- Existing subagent template `` section for where new rules live + +**Test scenarios:** +- Happy path: a persona agent receives the new template and produces findings with one of the four valid `autofix_class` values +- Edge case: a finding with only strawman alternatives (e.g., "accept the defect") is classified as `safe_auto`, not `manual` +- Edge case: a finding that would previously have been `manual` because "there's more than one way to fix it" is now `gated_auto` when the fix is concrete and the non-primary options are strawmen +- Edge case: an FYI-grade observation (subjective, no decision) gets classified as `manual` but routes to the FYI subsection at the presentation layer because confidence falls below the per-severity gate yet above the FYI floor +- Integration: all 7 personas produce output that validates against the expanded `findings-schema.json` — no schema violations + +**Verification:** +- Template includes a framing-guidance block, classification rule, and consolidated auto-promotion patterns +- `findings-schema.json` enum includes all 4 new values +- Subagent template stays under ~150 lines and continues to be loaded via `@` inclusion + +--- + +- [ ] **Unit 4: Synthesis pipeline: per-severity gates, tier routing, auto-promotion, state-machine discipline** + +**Goal:** Rewrite the synthesis pipeline to route the four new tiers correctly, apply per-severity confidence gates, drop residual promotion in favor of cross-persona agreement boost, and make tier/confidence/deferral state transitions explicit (per the git-workflow state-machine learning). This is the load-bearing synthesis change. + +**Requirements:** R9, R10, R11 + synthesis updates for R2–R5 tier routing + +**Dependencies:** Unit 3 (new `autofix_class` enum must exist before synthesis routes to it) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` +- Create: `tests/fixtures/ce-doc-review/seeded-plan.md` — test-fixture plan doc with seeded findings across tier shapes (see validation gate in Approach) + +**Approach:** +- Step 3.2 (confidence gate): replace flat 0.50 with per-severity table (P0 ≥0.50, P1 ≥0.60, P2 ≥0.65, P3 ≥0.75). Low-confidence manual findings that don't pass the gate but are above a FYI floor (0.40) surface in an FYI subsection at the presentation layer rather than being dropped — keeps observational value without forcing decisions. +- Step 3.4 (residual promotion): delete. Replaced by a cross-persona agreement boost (+0.10, capped at 1.0) applied after the gate, matching `ce-code-review` stage 5 step 4. Residual concerns surface in Coverage only. +- Step 3.5 (contradictions): keep; adapt terminology for three-tier routing +- Step 3.6 (pattern-resolved promotion): expand per R7's consolidated promotion patterns +- Step 3.7 (route by autofix class): rewrite for three tiers. `safe_auto` → apply silently. `gated_auto` → walk-through with Apply as recommended. `manual` → walk-through with user-judgment framing, or FYI subsection when low-confidence. +- **R30 fix-landed matching predicate per #10 review feedback:** when determining whether a round-2 persona's finding is a re-raise of a round-1 Applied finding at the same location, use the existing dedup fingerprint (`normalize(section) + normalize(title)`) augmented with an evidence-substring overlap check. Section renames count as "different location" — treat as new finding. Specify explicitly in the synthesis step so the implementer doesn't invent a predicate. +- **Validation gate per #3 + #7 review feedback:** before merging Unit 4, run the new synthesis pipeline against two artifacts and log the result in the PR description: + 1. **A seeded test-fixture plan doc** — create one under `tests/fixtures/ce-doc-review/seeded-plan.md` with known issues planted across each tier (target seed: ~3 safe_auto candidates, ~3 gated_auto candidates, ~5 manual candidates, ~5 FYI-tier candidates at confidence 0.40–0.65, ~3 drop-worthy P3s at confidence 0.55–0.74). This is the rigorous validation — existing plans in `docs/plans/` have already been through review and would make the pipeline look falsely clean. + 2. **The brainstorm doc** (`docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md`), which went through document-review via the OLD pipeline — re-running under the NEW pipeline is a valid before/after comparison. + + **Numeric pass criteria (soft, not absolute):** + - Seeded fixture: ≥2 of the 3 seeded safe_auto candidates get applied silently; ≥2 of the 3 seeded gated_auto show up in the walk-through bucket with `(recommended)` Apply; all 3 drop-worthy P3s at 0.55–0.74 get dropped by the per-severity gate; ≥3 of the 5 FYI candidates surface in the FYI subsection; zero false auto-applies on manual-shaped seeds. + - Brainstorm re-run: no P0/P1 findings that the old pipeline applied are regressed (i.e., the new pipeline doesn't drop what the old one kept as important); total user-facing decision count (gated_auto + manual after gate) should be meaningfully lower than the old pipeline produced. + + If a seed classification fires outside its intended tier, investigate before merging — may indicate threshold or strawman-rule calibration issue. +- Add explicit state-machine narration referencing the diagram in High-Level Technical Design. Every transition ("Raised → Classified," "Classified → SafeAuto," etc.) is a named step in synthesis prose, not an implied carry-forward. +- **Headless envelope extension lands here per #12 sequencing fix:** this unit is the first to produce `gated_auto` findings in headless mode, so the envelope must support the new bucket headers before shipping. Extend `synthesis-and-presentation.md:93-119` headless output to include `Gated-auto findings` and `FYI observations` sections alongside existing `Applied N safe_auto fixes` count and `Manual findings` section. Preserves existing bucket structure so callers that only read the old buckets continue to work (forward-compat; ce-brainstorm and ce-plan surface P0/P1 residuals adjacent to menus, unchanged). Unit 8 adds the contract test for this envelope later. + +**Patterns to follow:** +- `ce-code-review` stage 5 merge pipeline (`SKILL.md:456-484`) for confidence-gate, dedup, cross-reviewer agreement boost structure +- Existing `synthesis-and-presentation.md` step numbering — preserve step IDs to avoid churning cross-references + +**Test scenarios:** +- Happy path: a P3 finding at 0.60 confidence is dropped by the per-severity gate (under the current 0.50 flat gate it would survive) +- Happy path: a P0 finding at 0.52 confidence survives the gate +- Happy path: two personas flagging the same issue get a +0.10 boost, lifting one from 0.55 (below P1 gate) to 0.65 (above) +- Edge case: a low-confidence `manual` finding at 0.45 (above the 0.40 FYI floor, below the severity gate) surfaces in the FYI subsection, not dropped +- Edge case: a `gated_auto` finding with only strawman alternatives gets auto-promoted to `safe_auto` per R7 consolidated patterns — but if ANY alternatives exist (even weak), defaults to `gated_auto` per the strawman safeguard +- Edge case: contradiction handling — two personas with opposing actions on the same finding route to `manual` with both perspectives +- Integration: routing the calibration-example case from the origin document (14 findings → 4 manual + 3 gated_auto + 1 safe_auto + 1 FYI + 5 dropped) produces a reasonable bucket distribution +- Integration: seeded-fixture test (`tests/fixtures/ce-doc-review/seeded-plan.md`) meets the numeric pass criteria in the Approach section — seeded safe_auto/gated_auto/manual/FYI candidates land in their intended tiers; drop-worthy P3s are dropped; no false-auto on manual-shaped seeds +- Integration: brainstorm-doc re-run (`docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md`) shows meaningful decision-count reduction without regressing previously-applied P0/P1 fixes + +**Verification:** +- Confidence gate is per-severity, documented in step 3.2 of synthesis +- Residual-promotion step is removed; cross-persona agreement boost is its replacement +- Each state transition in the finding lifecycle has a named synthesis step +- Routing the origin document's real-world example reproduces the expected 14→4 decisions split + +--- + +- [ ] **Unit 5: Interaction model: routing question + per-finding walk-through + bulk preview** + +**Goal:** Port the per-finding walk-through, bulk preview, and four-option routing question from `ce-code-review`. Adapt for ce-doc-review (no batch fixer, no tracker integration). This is the biggest behavioral change and where most of the user-visible UX improvement lives. + +**Requirements:** R12, R13, R14, R15, R16, R26 + +**Dependencies:** Unit 4 (routing uses new tiers and confidence-gated finding set) + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md` +- Create: `plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md` +- Modify: `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` (add Interactive mode rules section at top, AskUserQuestion pre-load directive) +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` (Phase 4 presentation hands off to walkthrough.md; add routing question stage) + +**Approach:** +- Add an "Interactive mode rules" section at the top of `SKILL.md` modeled on `ce-code-review/SKILL.md:73-77`. Include the `AskUserQuestion` pre-load directive and the numbered-list fallback rule. +- Create `walkthrough.md` by adapting `ce-code-review/references/walkthrough.md`. **Tier alignment:** ce-doc-review uses the first three ce-code-review tier names verbatim — `safe_auto`, `gated_auto`, `manual` — so no rename in the port. ce-code-review's fourth `advisory` tier has no ce-doc-review equivalent in the walk-through; advisory-style findings render in a presentation-layer FYI subsection (Unit 4's concern), not as a walk-through option. **Keep from the source:** terminal-block + question split, four-option menu shape (Apply / Defer / Skip / LFG-the-rest), `(recommended)` marker, LFG-the-rest escape, N=1 adaptation, unified completion report, post-tie-break recommendation rendering. **Remove from the source:** fixer-subagent-batch-dispatch (ce-doc-review has no batch fixer per scope boundary), `[TRACKER]` label substitution logic, tracker-detection tuple (`named_sink_available`, `any_sink_available`, confidence-based label substitution), render-time Defer→Skip remap on `any_sink_available: false`, `.context/compound-engineering/ce-code-review/{run_id}/{reviewer_name}.json` artifact-lookup paths (ce-doc-review's agents don't write run artifacts), advisory-variant `Acknowledge` option (no advisory tier here). **Replace:** "file a tracker ticket" → "append to Open Questions section" (Unit 6 implements the append mechanic). +- Create `bulk-preview.md` by adapting `ce-code-review/references/bulk-preview.md`: keep the grouped buckets, Proceed/Cancel options, scope-summary header. Adapt bucket labels (`Applying (N):`, `Appending to Open Questions (N):`, `Skipping (N):`). Drop the `Acknowledging (N):` bucket — no advisory tier means no Acknowledge action. Remove tracker-label substitution. +- Update `synthesis-and-presentation.md` Phase 4: after auto-fixes are applied, route to the new routing question (if any `gated_auto` / `manual` findings remain). Load `walkthrough.md` for option A, `bulk-preview.md` for options B and C. Option D = report only. Use R16 tie-break order (`Skip > Defer > Apply > Acknowledge`) for per-finding recommendations. + +**Execution note:** Port the Interactive Question Tool Design rules verbatim from AGENTS.md — third-person voice, front-loaded distinguishing words, ≤4 options. Verify each menu's labels at the rendering surface during implementation; harness label truncation is a known failure mode (ce-pipeline learnings doc §5). + +**Patterns to follow:** +- `ce-code-review/references/walkthrough.md` — structural template +- `ce-code-review/references/bulk-preview.md` — structural template +- `ce-code-review/SKILL.md:73-77` — Interactive mode rules section +- `plugins/compound-engineering/AGENTS.md` → "Interactive Question Tool Design" section — menu design rules +- The state machine in High-Level Technical Design above + +**Test scenarios:** +- Happy path: 3 `gated_auto` findings + 1 `manual` finding → routing question offers all 4 options; picking A enters walk-through; each finding presented one at a time with recommended action marked +- Happy path: N=1 (exactly one pending finding) → walk-through wording drops "Finding N of M," LFG-the-rest option suppressed +- Happy path: user picks LFG-the-rest at finding 2 of 8 → bulk preview scoped to findings 3-8, header notes "2 already decided" +- Edge case: all findings are low-confidence `manual` (FYI subsection only) → routing question skipped (no gated_auto / present-above-gate remain), flows to terminal question with no walk-through; FYI content still renders in the report +- Edge case: bulk-preview Cancel from LFG-the-rest returns to the current finding, not to the routing question +- Edge case: routing Cancel from option B / C returns to the routing question with no side effects +- Integration: recommendation tie-break (R16) — two personas flag the same finding with conflicting actions (Apply / Skip); walk-through marks the post-tie-break value (Skip) with `(recommended)`; R15-conflict context line surfaces the disagreement in the question stem + +**Verification:** +- `walkthrough.md` and `bulk-preview.md` exist with adapted content +- SKILL.md has an Interactive mode rules section with AskUserQuestion pre-load +- Synthesis Phase 4 routes to the walkthrough/bulk-preview references after auto-fixes +- Menus pass the Interactive Question Tool Design review (third-person, ≤4 options, self-contained labels) + +--- + +- [ ] **Unit 6: In-doc Open Questions deferral + append mechanic** + +**Goal:** Implement the Defer action's in-doc append mechanic. When a user chooses Defer on a finding, append an entry to a `## Deferred / Open Questions` section at the end of the document under review. + +**Requirements:** R20, R21, R22 + +**Dependencies:** Unit 5 (walk-through's Defer option is where this fires) + +**Files:** +- Create: `plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md` +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md` (reference the Defer mechanic from Unit 5's walkthrough.md) + +**Approach:** +- Create `open-questions-defer.md` describing: + - Detection: does the doc already have a `## Deferred / Open Questions` section at the end? + - Heading creation if absent + - Subsection structure: `### From YYYY-MM-DD review` (timestamped to the review invocation — creates per-review grouping even when run multiple times on the same doc) + - Entry format per R21: title, severity, reviewer attribution, confidence, `why_it_matters` framing. Excludes `suggested_fix` and `evidence` (those live in the run artifact if one exists; pointer to run artifact included when relevant) + - Append location: end of doc, after existing content. If the doc has a trailing horizontal rule or separator, add above it to avoid visual drift. +- Failure handling per R22: document is read-only / path issue / write failure → surface inline with Retry / Fallback-to-completion-report-only / Convert-to-Skip sub-question. No silent failure. +- Walkthrough.md references this file when the user picks Defer; the walkthrough itself doesn't reimplement the append logic. + +**Patterns to follow:** +- `ce-code-review/references/tracker-defer.md` — **only** for the failure-path sub-question structure (Retry / Fallback / Convert to Skip). Do not carry over tracker-detection, sink-availability, or label-substitution logic — none apply to in-doc append. + +**Test scenarios:** +- Happy path: doc has no Open Questions section → append creates the `## Deferred / Open Questions` heading and a `### From YYYY-MM-DD review` subsection with the deferred entry +- Happy path: doc already has the Open Questions section at the end → append adds under a new `### From YYYY-MM-DD review` subsection (keep prior review entries distinguishable) +- Happy path: two Defer actions in the same review session → both entries land under the same `### From YYYY-MM-DD review` subsection +- **Shadow path (mid-doc heading) per #13 review feedback:** doc has a `## Deferred / Open Questions` heading somewhere in the middle (not the end) → append finds it and lands under it at its existing location, does not create a duplicate section at the end +- **Shadow path (same-title collision) per #13:** round 2 within the same day defers a finding whose title matches an existing round-1 entry under the same `### From YYYY-MM-DD review` subsection → append is idempotent on title (skip duplicate entry), records the no-op in the completion report +- **Shadow path (frontmatter-only doc):** doc has frontmatter and no body content → append creates the heading after the frontmatter block, not at byte 0 +- **Shadow path (concurrent editor writes):** re-read the doc from disk immediately before the append to reduce the window for user-in-editor concurrent-write collisions; log mtime before and after append and abort + surface retry if changed during the write +- Edge case: doc is read-only → append fails, user is offered Retry / Fall-back-to-report-only / Convert-to-Skip; Convert-to-Skip records the Skip reason in the completion report +- Edge case: doc has a trailing `---` or other separator → append lands above it +- Integration: deferred entries from a walk-through round 1 are visible in the doc when round 2 runs; the decision primer (Unit 7) correctly identifies them as prior-round decisions; personas exclude the section from review scope per the subagent template instruction (Unit 3) + +**Verification:** +- `open-questions-defer.md` exists and describes the append mechanic + failure handling +- Walk-through's Defer option invokes the mechanic correctly +- Deferred findings accumulate under timestamped subsections across reviews +- No silent failures — every failure surfaces inline with user-actionable options + +--- + +- [ ] **Unit 7: Terminal question restructure + multi-round decision memory** + +**Goal:** Replace the current Phase 5 binary question (`Refine — re-review` / `Review complete`) with a three-option terminal question that separates "apply decisions" from "re-review," and introduce the multi-round decision primer that carries prior-round decisions into subsequent rounds. + +**Requirements:** R17, R18, R19, R28, R29, R30 + +**Dependencies:** Unit 5 (walkthrough captures the decisions the primer carries forward), Unit 6 (Defer decisions contribute to the primer) + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` (Phase 2 dispatch passes cumulative primer) +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` (Phase 5 terminal question + R29 suppression rule + R30 fix-landed verification) +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` (persona instructions to honor the primer — suppress re-raising rejected findings, respect fix-landed verification context) + +**Approach:** +- Replace Phase 5 terminal question with three options per R17: `Apply decisions and proceed to ` (default / recommended when fixes were applied), `Apply decisions and re-review`, `Exit without further action`. The `` text uses the document type: `ce-plan` for requirements docs, `ce-work` for plan docs. +- R19 zero-actionable-findings degenerate case: skip option B (re-review), offer only "Proceed to " / "Exit." **Label adapts:** when there are no decisions to apply (zero-actionable case, or a routing path where every finding was Acknowledge/Skip), drop the "Apply decisions and" prefix — the label should match what the system is doing. Only when at least one Apply was queued does the label remain "Apply decisions and proceed to ". +- R18 rendering rule: terminal question is distinct from mid-flow routing question. Don't merge them. +- Multi-round decision primer per R28–R30: + - The orchestrator maintains an in-memory decision list across rounds within a single session (rejected findings with title/evidence/reason; applied findings with title/section) + - Passed to every persona in round 2+ as part of the subagent template variable bindings + - **Primer structure per #9 review feedback:** the primer is a single text block injected into `{decision_primer}` slot at the top of the `` block. Shape: + ``` + + Round 1 — applied (N entries): + - {section}: "{title}" ({reviewer}, {confidence}) + Round 1 — rejected (N entries): + - {section}: "{title}" — {Skipped|Deferred} because {reason or "no reason provided"} + + ``` + Round 1 (no primer) renders as an empty `` block or omits the block entirely — implementation-detail choice driven by which reads better to personas during calibration. The subagent template gets a matching `{decision_primer}` slot. + - Persona-level suppression rule per R29: don't re-raise a finding whose title and evidence pattern-match a rejected finding unless current doc state makes the concern materially different + - Synthesis-level fix-landed verification per R30: for each applied finding, confirm the specific issue no longer appears at the referenced section. If a persona re-surfaces the same finding at the same location (same section fingerprint + evidence-substring overlap per Unit 4's matching predicate), flag as "fix did not land" in the report rather than treating it as new. +- **Caller-context handling per #6 review feedback:** interactive-mode nested invocations (ce-brainstorm → ce-doc-review, ce-plan → ce-doc-review) rely on the model reading conversation context to interpret the terminal question correctly, rather than an explicit `nested:true` argument. Rationale: the caller's conversation is visible to the sub-skill's orchestrator, so when the user picks "Proceed to " from inside ce-plan's 5.3.8 flow, the agent does not fire a nested `/ce-plan` dispatch — it returns control to the caller's flow which continues its own logic. When invoked standalone, "Proceed to " dispatches the appropriate next skill. `mode:headless` stays explicit because headless is deterministic programmatic behavior, but interactive-mode caller-context is handled by model orchestration. **No caller-side change required for ce-brainstorm or ce-plan.** If this implicit handling proves unreliable in practice, add an explicit `nested:true` flag as a follow-up. +- Cross-session persistence is out of scope per the scope boundary. + +**Execution note:** Model the decision-primer flow as part of the state machine diagram. Every round-2-persona-dispatch transition explicitly reads from the primer — this is not a prose-level "personas should remember" assumption. + +**Patterns to follow:** +- `ce-code-review/SKILL.md` Step 5 final-next-steps for the mode-driven terminal question structure (but adapt PR/push verbs to pipeline-stage verbs) +- State machine diagram in High-Level Technical Design — every prior-round-decision transition is named + +**Test scenarios:** +- Happy path: round 1 user applies 2 findings and skips 1; round 2 persona re-raises the skipped finding → synthesis drops it per R29 with a note in Coverage +- Happy path: round 1 user applies a finding; round 2 persona does NOT re-raise it (fix self-suppressed because evidence is gone) → synthesis reports "fix verified" +- Happy path: round 1 user applies a finding; round 2 persona re-raises it at the same location (fix didn't actually land) → synthesis flags "fix did not land" in the final report +- Happy path: terminal question after round 1 with fixes applied → three options; user picks "Apply and proceed" → hand off to ce-plan or ce-work +- Edge case: zero actionable findings after auto-fixes → terminal question has 2 options (re-review suppressed) +- Edge case: user deferred a finding in round 1 (R22); round 2 persona re-raises same concern → suppressed per R29 (defer counts as rejection for suppression purposes) +- Edge case: re-review triggered → round 2 decision primer includes all rounds 1's decisions; flow re-enters Phase 2 dispatch with primer passed to personas +- Integration: multi-round primer state is in-memory; exiting the session discards it; starting a new session on the same doc is a fresh round 1 + +**Verification:** +- Terminal question has three options (or two in the zero-actionable case) +- Round-2 dispatch passes the cumulative primer to every persona +- R29 suppression drops re-raised rejected findings with Coverage note +- R30 fix-landed verification flags fixes that didn't land +- Cross-session persistence is not implemented (verified by the boundary) + +--- + +- [ ] **Unit 8: Framing polish + contract test extension** + +**Goal:** Apply framing quality rules (R23–R25) uniformly across all user-facing surfaces that weren't already updated by Units 3–7, and extend `pipeline-review-contract.test.ts` to lock in the new-tier envelope shape. (The headless-envelope extension itself moves earlier per the #12 sequencing fix — see below.) + +**Requirements:** R23, R24, R25, R27 *(R27's envelope shape landed in Unit 4 per the sequencing fix; this unit adds the contract test for it)* + +**Dependencies:** Units 3, 4, 5, 6, 7 + +**Files:** +- Modify: `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` (R23–R25 framing rules applied across output surfaces) +- Modify: `tests/pipeline-review-contract.test.ts` (extend to assert new tiers appear distinctly in headless envelope) +- Consider: `tests/ce-doc-review-contract.test.ts` (new) if assertions don't fit cleanly in pipeline-review-contract — decide during implementation + +**Approach:** +- **R23–R25 framing rules:** applied at every user-facing surface — walk-through terminal blocks, bulk-preview lines, Open Questions entries, headless envelope. Observable-consequence-first, why-the-fix-works grounding, 2-4 sentence budget. Because the framing-guidance block at the subagent template (Unit 3) already shapes persona output at the source, this pass is about ensuring the presentation surfaces carry the framing forward without dilution (e.g., the walk-through's terminal block shouldn't re-wrap the persona's `why_it_matters` in code-structure-first prose). +- **Test extension:** `pipeline-review-contract.test.ts:279-352` currently asserts `mode:headless` invocation from ce-brainstorm and ce-plan. Extend to assert the new tiers appear distinctly in the headless output without breaking existing pattern matches. Structural assertions only — do not lock exact prose, so future wording improvements don't ossify the test. Also assert the `nested:true` flag invocation from both callers (Unit 7 landing). +- No "Past Solutions" section in output — learnings-researcher is not invoked from ce-doc-review (see Key Technical Decisions). +- **Sequencing per #12 review feedback:** the actual headless envelope extension (new tier bucket headers) lands in Unit 4's PR, not this unit. Rationale: Unit 4 is the first unit that produces non-`safe_auto` / non-`manual` findings in headless mode. If Unit 4 ships before the envelope is updated, callers (ce-plan in `mode:headless`) would see `gated_auto` findings demoted into legacy buckets or emitted in a shape callers can't parse. Landing the envelope change with Unit 4 keeps each phase independently consumable. + +**Patterns to follow:** +- `ce-code-review` headless envelope (`SKILL.md:510-572`) structure — grouped by `autofix_class`, metadata header, per-finding detail lines +- Existing ce-doc-review headless output in `synthesis-and-presentation.md:93-119` + +**Test scenarios:** +- Happy path: headless mode run with findings across all 3 tiers → envelope contains distinct `Applied N safe_auto fixes` count + `Gated-auto findings` + `Manual findings` sections (+ `FYI observations` subsection when present) in that order +- Happy path: headless mode with only safe_auto fixes applied → envelope shows the count and omits the finding-type sections +- Happy path: headless mode with zero findings at all → envelope collapses to "Review complete (headless mode). No findings." +- Edge case: headless mode with only FYI-subsection content → envelope shows the subsection only, no decision-requiring buckets +- Integration: ce-plan phase 5.3.8 headless invocation continues to work with new envelope; new tier sections are visible to the caller for residual-P0/P1 surfacing decisions (`plan-handoff.md:13`) +- Integration: `nested:true` flag is respected — when set, terminal question omits the "Proceed to " option; verifiable via contract test +- Integration: framing of a single finding is consistent across walk-through terminal block, bulk-preview line, Open Questions append entry, and headless envelope — verify by reviewing a test fixture doc's output at all four surfaces + +**Verification:** +- All user-facing surfaces meet the R23–R25 framing bar +- Pipeline contract test extended and passing (covers new-tier envelope + `nested:true` caller-hint behavior) +- No learnings-researcher dispatch code in ce-doc-review (verified by grep) + +## System-Wide Impact + +- **Interaction graph:** `ce-brainstorm` Phase 3.5 + Phase 4 handoff re-review paths, `ce-plan` Phase 5.3.8 + 5.4 post-generation menu, LFG/SLFG pipeline invocations, direct user invocation. All consume `"Review complete"` terminal signal — unchanged by this work. **No caller-side diff required:** the terminal question's "Proceed to " hand-off is interpreted contextually by the agent from the visible conversation state — when invoked from inside another skill's flow, it returns control to the caller; when standalone, it dispatches the next stage. If implicit handling proves unreliable, add an explicit `nested:true` token as a follow-up. +- **Error propagation:** Append failures in Defer (Unit 6) must surface inline with retry/fallback/skip options. Headless mode failures (e.g., a persona times out) must return partial results with Coverage note, never block the whole review. +- **State lifecycle risks:** Multi-round decision primer (Unit 7) is in-memory only. User exits mid-session → primer discarded → next session is fresh round 1. In-doc Open Questions mutations (Unit 6) persist on disk — re-running ce-doc-review on a modified doc sees those mutations as part of doc state. +- **API surface parity:** Headless envelope (R27) is the machine-readable contract. Adding new tiers changes envelope content but not the terminal signal or the `mode:headless` invocation shape. Backward-compatible for existing callers; forward-compatible requires callers to handle new tier sections (ce-brainstorm and ce-plan both currently surface P0/P1 residuals adjacent to menus — no change needed for that behavior). +- **Integration coverage:** Cross-layer behaviors mocked tests won't prove — end-to-end tests with a realistic plan doc against ce-plan's 5.3.8 headless invocation flow catch tier-envelope compatibility issues. +- **Unchanged invariants:** + - Persona activation/selection logic (the 7 persona files' conditional triggers) + - `"Review complete"` terminal signal for callers + - Headless mode's structural contract (mutate-then-return with structured text; callers own routing) + - Cross-platform converter behavior (OpenCode 3-segment name rewrite, dispatch-string preservation) + - `ce-code-review` itself — this plan touches ce-doc-review only, not ce-code-review + +## Alternative Approaches Considered + +- **Ship as `ce-doc-review-beta` parallel skill.** The learnings-researcher recommended this path given ce-doc-review is chained into brainstorm→plan flows. **Rejected** because the overhaul is phaseable; each phase's blast radius is bounded (Units 1-2 don't touch ce-doc-review's contract at all; Units 3-7 preserve the headless envelope per R27); and beta forking would double the surface area (two skill directories, mirrored references, promotion PR needed). A phased single-track ship carries less risk-per-week and delivers user value earlier. If a phase later proves riskier than anticipated, fork to beta at that point rather than upfront. +- **Minimal `review-time` mode flag on learnings-researcher instead of domain-agnostic rewrite.** A smaller patch: add a `review-time` invocation context hint that adapts keyword extraction and output framing. **Rejected** because it accumulates special cases rather than fixing the root mismatch. `ce-compound` and `ce-compound-refresh` already capture non-code learnings; the agent's taxonomy should reflect that. A full rewrite removes the drift; a mode flag ossifies it. +- **Dispatch learnings-researcher from ce-doc-review (original R31–R35).** Considered as always-on dispatch, then as conditional dispatch (skip when ce-plan is the caller). **Both rejected.** The agent is ce-plan-owned (implementation-context per `research-agent-pipeline-separation-2026-04-05.md`); running it from ce-doc-review is a pipeline violation in the ce-brainstorm and standalone contexts and a redundant dispatch in the ce-plan context. Conditional-dispatch added "is the caller ce-plan?" detection logic that was fragile and solved a problem better avoided. Users who want institutional memory for a doc can invoke `/ce-plan`, where the lookup is a first-class pipeline stage. Keeping the dispatch out of ce-doc-review entirely preserves clean pipeline-stage ownership and removes complexity. +- **Add `learning_category` field orthogonal to `problem_type`.** A cleaner long-term schema split, but requires migrating every existing entry and teaching authors to pick both. **Rejected** in favor of enum expansion — minimal migration, keeps authoring flow stable, absorbs the `best_practice` overflow directly. +- **Pass a diff in multi-round decision primer.** Would give personas before/after comparison for each round. **Rejected** — fixed findings self-suppress (evidence gone), regressions surface as normal current-state findings, rejected findings are handled by pattern-match suppression. The diff adds prompt weight without changing what the agent can detect. + +## Risks & Dependencies + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| Caller flows break because the headless envelope changes shape | Low | High | R27 preserves existing envelope structure; extend `pipeline-review-contract.test.ts` in Unit 8 to assert new tiers appear distinctly without breaking existing match patterns; run ce-brainstorm and ce-plan end-to-end against the updated skill before merge | +| Strawman-aware classification rule (R6) is too aggressive, auto-applying fixes users want to review | Medium | Medium | Framing-guidance block includes a conservative positive/negative example pair; tiers preserve user control via `gated_auto` walk-through for anything with a concrete fix that changes doc meaning; calibration against the origin document's real-world example is a required validation step | +| Per-severity confidence gates drop genuinely valuable P3 findings | Low | Low | P3 gate at 0.75 is conservative; the FYI floor (0.40) on low-confidence `manual` findings keeps genuinely-noteworthy observations surfacing below the gate; if real-world calibration shows drops, the threshold is a single number change | +| Multi-round primer re-raises the same findings because personas don't reliably suppress | Medium | Medium | Synthesis-level enforcement (R29) is authoritative — orchestrator drops re-raised rejected findings regardless of whether the persona suppressed. Persona-level suppression is the hint; orchestrator is the gate. | +| Walk-through UX friction at high finding counts despite `LFG the rest` escape | Low | Medium | Walk-through's LFG-the-rest option bounds friction after initial calibration; bulk-preview Proceed gives an atomic commit point; N=1 adaptation handles the degenerate case cleanly | +| Duplicate schema files in ce-compound / ce-compound-refresh drift | Low | High | Unit 1 explicitly updates both in the same commit; future divergence detection is a follow-up test opportunity (deferred item) | +| learnings-researcher rewrite regresses ce-plan's existing usage | Medium | High | Unit 2 execution note requires sampling 3-5 real invocations before merge; cross-platform converter tests assert dispatch-string preservation; `` is additive, callers with old calling conventions continue to work because the agent probes for structured input and falls back to free-form description when absent | +| Dynamic category probe hits a weird repo with unexpected directory structure | Low | Low | Probe falls through to "no categories detected, do broad search across docs/solutions/" — this is already the agent's current behavior when the hardcoded table misses | + +## Documentation / Operational Notes + +- No additional runtime infrastructure — this is a plugin skill change with no user data, no external APIs. +- After Unit 1 lands, existing authors using `ce-compound` will see new enum options in the steering language; authors writing new solution docs can pick the narrower value immediately. +- After Unit 2 lands, `/ce-plan` users will see the agent's output reflect the broader taxonomy (non-code learnings surfacing more appropriately). +- After Units 5–7 land, interactive ce-doc-review users will see the new routing question, walk-through, and terminal question on their next review. The flow mirrors the `ce-code-review` experience users already have — low learning-curve. +- The `plugins/compound-engineering/README.md` reference-file counts table will need an update once the new `references/` files land in Units 5–6. `bun run release:validate` catches drift. +- AGENTS.md discoverability updates (Unit 1) need to include the four new `problem_type` values so agents reading AGENTS.md know the narrower categories are available. + +## Phased Delivery + +Each unit can ship as its own PR. Recommended sequence: + +### Phase 1 — Foundation (Units 1, 2) +- Unit 1 (enum expansion + migration) +- Unit 2 (learnings-researcher rewrite) + +These are independently valuable and low-risk. They benefit `/ce-plan`'s existing usage even before ce-doc-review changes land. + +### Phase 2 — Classification + Synthesis (Units 3, 4) +- Unit 3 (subagent template upgrade + findings-schema tier expansion) +- Unit 4 (synthesis pipeline per-severity gates + tier routing) + +Depends on Unit 1's enum values being available (not Unit 2 — that's a parallel Phase 1 deliverable for ce-plan). Within Phase 2, Unit 3 must complete before Unit 4 because Unit 4's synthesis routing depends on Unit 3's tier definitions. Changes ce-doc-review's internal shape but preserves the headless envelope contract. + +### Phase 3 — Interaction Model (Units 5, 6, 7) +- Unit 5 (routing question + walk-through + bulk preview) +- Unit 6 (in-doc Open Questions deferral) +- Unit 7 (terminal question + multi-round memory) + +Biggest UX surface change. Callers unchanged due to preserved headless contract; interactive users see the port of the `ce-code-review` flow. + +### Phase 4 — Integration + Polish (Unit 8) +- Unit 8 (framing polish across all surfaces, pipeline-review-contract test extension) + +Final polish pass. The headless envelope extension itself lands earlier (in Unit 4's PR, per the #12 sequencing fix) so callers never observe an interstitial state where new tiers are produced but the envelope can't carry them. Unit 8 locks the envelope shape in via the contract test and finishes the framing-polish sweep. + +## Sources & References + +- **Origin document:** `docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.md` +- **Pattern source (ce-code-review PR #590):** https://github.com/EveryInc/compound-engineering-plugin/pull/590 +- Related code: + - `plugins/compound-engineering/skills/ce-code-review/references/walkthrough.md` + - `plugins/compound-engineering/skills/ce-code-review/references/bulk-preview.md` + - `plugins/compound-engineering/skills/ce-code-review/references/subagent-template.md` + - `plugins/compound-engineering/skills/ce-code-review/SKILL.md` + - `plugins/compound-engineering/skills/ce-doc-review/SKILL.md` + - `plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md` + - `plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md` + - `plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json` + - `plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md` + - `plugins/compound-engineering/skills/ce-compound/references/schema.yaml` + - `plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md` + - `plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml` +- Related institutional learnings: + - `docs/solutions/best-practices/ce-pipeline-end-to-end-learnings-2026-04-17.md` + - `docs/solutions/skill-design/compound-refresh-skill-improvements.md` + - `docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md` + - `docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md` + - `docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md` + - `docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md` + - `docs/solutions/skill-design/beta-skills-framework.md` +- Related tests: + - `tests/pipeline-review-contract.test.ts:279-352` + - `tests/converter.test.ts:417-438` diff --git a/docs/solutions/adding-converter-target-providers.md b/docs/solutions/adding-converter-target-providers.md index 0423dfe57..e66f8320b 100644 --- a/docs/solutions/adding-converter-target-providers.md +++ b/docs/solutions/adding-converter-target-providers.md @@ -5,7 +5,7 @@ tags: [converter, target-provider, plugin-conversion, multi-platform, pattern] created: 2026-02-23 severity: medium component: converter-cli -problem_type: best_practice +problem_type: architecture_pattern root_cause: architectural_pattern --- diff --git a/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md b/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md index c317fe1b5..d2518cb3b 100644 --- a/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md +++ b/docs/solutions/best-practices/codex-delegation-best-practices-2026-04-01.md @@ -3,7 +3,7 @@ title: "Codex Delegation Best Practices" date: 2026-04-01 category: best-practices module: "Codex delegation / skill design" -problem_type: best_practice +problem_type: convention component: tooling severity: medium applies_when: diff --git a/docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md b/docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md index 10efd4348..e77cc7c1c 100644 --- a/docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md +++ b/docs/solutions/best-practices/conditional-visual-aids-in-generated-documents-2026-03-29.md @@ -3,7 +3,7 @@ title: Conditional visual aids in generated documents and PR descriptions date: 2026-03-29 category: best-practices module: compound-engineering plugin skills -problem_type: best_practice +problem_type: design_pattern component: documentation symptoms: - "Generated documents and PR descriptions lack visual aids that would improve comprehension of complex workflows and relationships" diff --git a/docs/solutions/best-practices/prefer-python-over-bash-for-pipeline-scripts-2026-04-09.md b/docs/solutions/best-practices/prefer-python-over-bash-for-pipeline-scripts-2026-04-09.md index 33133827d..38e782801 100644 --- a/docs/solutions/best-practices/prefer-python-over-bash-for-pipeline-scripts-2026-04-09.md +++ b/docs/solutions/best-practices/prefer-python-over-bash-for-pipeline-scripts-2026-04-09.md @@ -3,7 +3,7 @@ title: "Prefer Python over bash for multi-step pipeline scripts" date: 2026-04-09 category: best-practices module: "skill scripting / ce-demo-reel" -problem_type: best_practice +problem_type: tooling_decision component: tooling severity: medium applies_when: diff --git a/docs/solutions/codex-skill-prompt-entrypoints.md b/docs/solutions/codex-skill-prompt-entrypoints.md index a0a9aa1e6..e435940ef 100644 --- a/docs/solutions/codex-skill-prompt-entrypoints.md +++ b/docs/solutions/codex-skill-prompt-entrypoints.md @@ -5,7 +5,7 @@ tags: [codex, converter, skills, prompts, workflows, deprecation] created: 2026-03-15 severity: medium component: codex-target -problem_type: best_practice +problem_type: convention root_cause: outdated_target_model --- diff --git a/docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md b/docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md index 1b782f967..806aa136b 100644 --- a/docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md +++ b/docs/solutions/skill-design/discoverability-check-for-documented-solutions-2026-03-30.md @@ -3,7 +3,7 @@ title: Discoverability check for documented solutions in project instruction fil date: 2026-03-30 category: skill-design module: compound-engineering -problem_type: best_practice +problem_type: convention component: tooling severity: medium applies_when: diff --git a/docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md b/docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md index 097937b08..0102ec1c2 100644 --- a/docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md +++ b/docs/solutions/skill-design/git-workflow-skills-need-explicit-state-machines-2026-03-27.md @@ -3,7 +3,7 @@ title: "Git workflow skills need explicit state machines for branch, push, and P category: skill-design date: 2026-03-27 module: plugins/compound-engineering/skills/git-commit and git-commit-push-pr -problem_type: best_practice +problem_type: architecture_pattern component: tooling symptoms: - Detached HEAD could fall through to invalid push or PR paths diff --git a/docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md b/docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md index 99a973ff2..991ffe530 100644 --- a/docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md +++ b/docs/solutions/skill-design/pass-paths-not-content-to-subagents-2026-03-26.md @@ -1,6 +1,6 @@ --- title: "Pass paths, not content, when dispatching sub-agents" -problem_type: best_practice +problem_type: design_pattern component: tooling root_cause: inadequate_documentation resolution_type: workflow_improvement diff --git a/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md b/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md index 88fc86579..32e79684c 100644 --- a/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md +++ b/docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md @@ -3,7 +3,7 @@ title: Research agent dispatch is intentionally separated across the skill pipel date: 2026-04-05 category: skill-design module: compound-engineering -problem_type: best_practice +problem_type: architecture_pattern component: tooling severity: low applies_when: diff --git a/docs/solutions/workflow/todo-status-lifecycle.md b/docs/solutions/workflow/todo-status-lifecycle.md index 983526bba..f13cdc2ad 100644 --- a/docs/solutions/workflow/todo-status-lifecycle.md +++ b/docs/solutions/workflow/todo-status-lifecycle.md @@ -13,7 +13,7 @@ related_components: - plugins/compound-engineering/skills/ce-review/ - plugins/compound-engineering/skills/todo-triage/ - plugins/compound-engineering/skills/todo-create/ -problem_type: correctness-gap +problem_type: workflow_issue --- # Status-Gated Todo Resolution diff --git a/plugins/compound-engineering/AGENTS.md b/plugins/compound-engineering/AGENTS.md index 4318f65eb..35428f2e9 100644 --- a/plugins/compound-engineering/AGENTS.md +++ b/plugins/compound-engineering/AGENTS.md @@ -215,6 +215,10 @@ Beta skills use a `-beta` suffix and `disable-model-invocation: true` to prevent When modifying a skill that has a `-beta` counterpart (or vice versa), always check the other version and **state your sync decision explicitly** before committing — e.g., "Propagated to beta — shared test guidance" or "Not propagating — this is the experimental delegate mode beta exists to test." Syncing to both, stable-only, and beta-only are all valid outcomes. The goal is deliberate reasoning, not a default rule. +## Documented Solutions + +`docs/solutions/` holds documented solutions to past problems — bugs, architecture patterns, design patterns, tooling decisions, conventions, workflow practices, and other institutional knowledge. Entries use YAML frontmatter with fields including `module`, `tags`, and `problem_type`. Knowledge-track `problem_type` values are `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`, `workflow_issue`, `developer_experience`, `documentation_gap`, and `best_practice` (fallback). Bug-track values cover `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, and `logic_error`. Search this directory before designing new solutions so institutional memory compounds across changes. + ## Documentation See `docs/solutions/plugin-versioning-requirements.md` for detailed versioning workflow. diff --git a/plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md b/plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md index ec1dfde40..3be083058 100644 --- a/plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md +++ b/plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md @@ -1,36 +1,65 @@ --- name: ce-learnings-researcher -description: "Searches docs/solutions/ for relevant past solutions by frontmatter metadata. Use before implementing features or fixing problems to surface institutional knowledge and prevent repeated mistakes." +description: "Searches docs/solutions/ for applicable past learnings by frontmatter metadata. Use before implementing features, making decisions, or starting work in a documented area — surfaces prior bugs, architecture patterns, design patterns, tooling decisions, conventions, and workflow learnings so institutional knowledge carries forward." model: inherit --- -You are an expert institutional knowledge researcher specializing in efficiently surfacing relevant documented solutions from the team's knowledge base. Your mission is to find and distill applicable learnings before new work begins, preventing repeated mistakes and leveraging proven patterns. +You are a domain-agnostic institutional knowledge researcher. Your job is to find and distill applicable past learnings from the team's knowledge base before new work begins — bugs, architecture patterns, design patterns, tooling decisions, conventions, and workflow discoveries are all first-class. Your work helps callers avoid re-discovering what the team already learned. + +Past learnings span multiple shapes: + +- **Bug learnings** — defects that were diagnosed and fixed (bug-track `problem_type` values like `runtime_error`, `performance_issue`, `security_issue`) +- **Architecture patterns** — structural decisions about agents, skills, pipelines, or system boundaries +- **Design patterns** — reusable non-architectural design approaches (content generation, interaction patterns, prompt shapes) +- **Tooling decisions** — language, library, or tool choices with durable rationale +- **Conventions** — team-agreed ways of doing something, captured so they survive turnover +- **Workflow learnings** — process improvements, developer-experience insights, documentation gaps + +Treat all of these as candidates. Do not privilege bug-shaped learnings over the others; the caller's context determines which shape matters. ## Search Strategy (Grep-First Filtering) -The `docs/solutions/` directory contains documented solutions with YAML frontmatter. When there may be hundreds of files, use this efficient strategy that minimizes tool calls: +The `docs/solutions/` directory contains documented learnings with YAML frontmatter. When there may be hundreds of files, use this efficient strategy that minimizes tool calls. + +### Step 1: Extract Keywords from the Work Context + +Callers may pass a structured `` block describing what they are doing: + +``` + +Activity: +Concepts: +Decisions: +Domains: + +``` + +When the caller passes this block, extract keywords from each field. + +When the caller passes free-form text instead of a structured block, treat it as the Activity field and extract keywords heuristically from the prose. Both shapes are supported. -### Step 1: Extract Keywords from Feature Description +Keyword dimensions to extract (applies to either input shape): -From the feature/task description, identify: -- **Module names**: e.g., "BriefSystem", "EmailProcessing", "payments" -- **Technical terms**: e.g., "N+1", "caching", "authentication" -- **Problem indicators**: e.g., "slow", "error", "timeout", "memory" -- **Component types**: e.g., "model", "controller", "job", "api" +- **Module names** — e.g., "BriefSystem", "EmailProcessing", "payments" +- **Technical terms** — e.g., "N+1", "caching", "authentication" +- **Problem indicators** — e.g., "slow", "error", "timeout", "memory" (applies when the work is bug-shaped) +- **Component types** — e.g., "model", "controller", "job", "api" +- **Concepts** — named ideas or abstractions: "per-finding walk-through", "fallback-with-warning", "pipeline separation" +- **Decisions** — choices the caller is weighing: "split into units", "migrate to framework X", "add a new tier" +- **Approaches** — strategies or patterns: "test-first", "state machine", "shared template" +- **Domains** — functional areas: "skill-design", "workflow", "code-implementation", "agent-architecture" -### Step 2: Category-Based Narrowing (Optional but Recommended) +The caller's context determines which dimensions carry weight. A code-bug query weights module + technical terms + problem indicators. A design-pattern query weights concepts + approaches + domains. A convention query weights decisions + domains. Do not force every dimension into every search — use the dimensions that match the input. -If the feature type is clear, narrow the search to relevant category directories: +### Step 2: Probe Discovered Subdirectories -| Feature Type | Search Directory | -|--------------|------------------| -| Performance work | `docs/solutions/performance-issues/` | -| Database changes | `docs/solutions/database-issues/` | -| Bug fix | `docs/solutions/runtime-errors/`, `docs/solutions/logic-errors/` | -| Security | `docs/solutions/security-issues/` | -| UI work | `docs/solutions/ui-bugs/` | -| Integration | `docs/solutions/integration-issues/` | -| General/unclear | `docs/solutions/` (all) | +Use the native file-search/glob tool (e.g., Glob in Claude Code) to discover which subdirectories actually exist under `docs/solutions/` at invocation time. Do not assume a fixed list — subdirectory names are per-repo convention and may include any of: + +- Bug-shaped: `build-errors/`, `test-failures/`, `runtime-errors/`, `performance-issues/`, `database-issues/`, `security-issues/`, `ui-bugs/`, `integration-issues/`, `logic-errors/` +- Knowledge-shaped: `architecture-patterns/`, `design-patterns/`, `tooling-decisions/`, `conventions/`, `workflow/`, `workflow-issues/`, `developer-experience/`, `documentation-gaps/`, `best-practices/`, `skill-design/`, `integrations/` +- Other per-repo categories + +Narrow the search to the discovered subdirectories that match the caller's Domain hint or that align with the keyword shape (e.g., bug-shaped keywords → bug-shaped subdirectories). When the input crosses multiple shapes or no shape dominates, search the full tree. ### Step 3: Content-Search Pre-Filter (Critical for Efficiency) @@ -45,31 +74,28 @@ content-search: pattern="component:.*background_job" path=docs/solutions/ files_ ``` **Pattern construction tips:** + - Use `|` for synonyms: `tags:.*(payment|billing|stripe|subscription)` -- Include `title:` - often the most descriptive field +- Include `title:` — often the most descriptive field - Search case-insensitively - Include related terms the user might not have mentioned +- For non-bug-shaped queries, search on concept or domain keywords in `tags:` and `title:` fields — not just on bug-specific fields like `symptoms:` or `root_cause:` **Why this works:** Content search scans file contents without reading into context. Only matching filenames are returned, dramatically reducing the set of files to examine. **Combine results** from all searches to get candidate files (typically 5-20 files instead of 200). -**If search returns >25 candidates:** Re-run with more specific patterns or combine with category narrowing. +**If search returns >25 candidates:** Re-run with more specific patterns or combine with subdirectory narrowing from Step 2. **If search returns <3 candidates:** Do a broader content search (not just frontmatter fields) as fallback: + ``` content-search: pattern="email" path=docs/solutions/ files_only=true case_insensitive=true ``` -### Step 3b: Always Check Critical Patterns +### Step 3b: Conditionally Check Critical Patterns -**Regardless of Grep results**, always read the critical patterns file: - -```bash -Read: docs/solutions/patterns/critical-patterns.md -``` - -This file contains must-know patterns that apply across all work - high-severity issues promoted to required reading. Scan for patterns relevant to the current feature/task. +If `docs/solutions/patterns/critical-patterns.md` exists in this repo, read it. It may contain must-know patterns that apply across all work. When the file does not exist, skip this step silently — the critical-patterns convention is optional and not all repos follow it. ### Step 4: Read Frontmatter of Candidates Only @@ -81,40 +107,48 @@ Read: [file_path] with limit:30 ``` Extract these fields from the YAML frontmatter: -- **module**: Which module/system the solution applies to -- **problem_type**: Category of issue (see schema below) -- **component**: Technical component affected -- **symptoms**: Array of observable symptoms -- **root_cause**: What caused the issue -- **tags**: Searchable keywords -- **severity**: critical, high, medium, low + +- **module** — which module, system, or domain the learning applies to +- **problem_type** — category (knowledge-track and bug-track values apply equally; see schema reference below) +- **component** — technical component or area affected (when applicable) +- **tags** — searchable keywords +- **symptoms** — observable behaviors or friction (present on bug-track entries and sometimes on knowledge-track entries) +- **root_cause** — underlying cause (present on bug-track entries; optional on knowledge-track entries) +- **severity** — critical, high, medium, low + +Some non-bug entries may have looser frontmatter shapes (they do not require `symptoms` or `root_cause`). Do not discard these entries for missing bug-shaped fields — use whatever fields are present for matching. ### Step 5: Score and Rank Relevance -Match frontmatter fields against the feature/task description: +Match frontmatter fields against the keywords extracted in Step 1: **Strong matches (prioritize):** -- `module` matches the feature's target module -- `tags` contain keywords from the feature description -- `symptoms` describe similar observable behaviors + +- `module` or domain matches the caller's area of work +- `tags` contain keywords from the caller's Concepts, Decisions, or Approaches +- `title` contains keywords from the caller's Activity or Concepts - `component` matches the technical area being touched +- `symptoms` describe similar observable behaviors (when applicable) **Moderate matches (include):** -- `problem_type` is relevant (e.g., `performance_issue` for optimization work) + +- `problem_type` is relevant (e.g., `architecture_pattern` when the caller is making architectural decisions, `performance_issue` when the caller is optimizing) - `root_cause` suggests a pattern that might apply -- Related modules or components mentioned +- Related modules, components, or domains mentioned **Weak matches (skip):** -- No overlapping tags, symptoms, or modules -- Unrelated problem types + +- No overlapping tags, symptoms, concepts, or modules +- Unrelated `problem_type` and no cross-cutting applicability ### Step 6: Full Read of Relevant Files Only for files that pass the filter (strong or moderate matches), read the complete document to extract: -- The full problem description -- The solution implemented -- Prevention guidance -- Code examples + +- The full problem framing or decision context +- The learning itself (solution, pattern, decision, convention) +- Prevention guidance or application notes +- Code examples or illustrative evidence ### Step 7: Return Distilled Summaries @@ -123,109 +157,110 @@ For each relevant document, return a summary in this format: ```markdown ### [Title from document] - **File**: docs/solutions/[category]/[filename].md -- **Module**: [module from frontmatter] -- **Problem Type**: [problem_type] -- **Relevance**: [Brief explanation of why this is relevant to the current task] -- **Key Insight**: [The most important takeaway - the thing that prevents repeating the mistake] -- **Severity**: [severity level] +- **Module/Domain**: [module or domain from frontmatter] +- **Type**: [bug | architecture_pattern | design_pattern | tooling_decision | convention | workflow | other] +- **Relevance**: [Brief explanation of why this is relevant to the caller's work] +- **Key takeaway**: [The decision, pattern, or pitfall to carry forward] +- **Severity**: [severity level, when present] ``` +The **Type** field is derived from `problem_type`. Group bug-track values as `bug`; knowledge-track values (`architecture_pattern`, `design_pattern`, `tooling_decision`, `convention`, `workflow_issue`, `developer_experience`, `best_practice`, `documentation_gap`) surface as themselves so the caller can tell which shape of learning they are getting. + ## Frontmatter Schema Reference Use this on-demand schema reference when you need the full contract: `../../skills/ce-compound/references/yaml-schema.md` -Key enum values: - -**problem_type values:** -- build_error, test_failure, runtime_error, performance_issue -- database_issue, security_issue, ui_bug, integration_issue -- logic_error, developer_experience, workflow_issue -- best_practice, documentation_gap - -**component values:** -- rails_model, rails_controller, rails_view, service_object -- background_job, database, frontend_stimulus, hotwire_turbo -- email_processing, brief_system, assistant, authentication -- payments, development_workflow, testing_framework, documentation, tooling - -**root_cause values:** -- missing_association, missing_include, missing_index, wrong_api -- scope_issue, thread_violation, async_timing, memory_leak -- config_error, logic_error, test_isolation, missing_validation -- missing_permission, missing_workflow_step, inadequate_documentation -- missing_tooling, incomplete_setup - -**Category directories (mapped from problem_type):** -- `docs/solutions/build-errors/` -- `docs/solutions/test-failures/` -- `docs/solutions/runtime-errors/` -- `docs/solutions/performance-issues/` -- `docs/solutions/database-issues/` -- `docs/solutions/security-issues/` -- `docs/solutions/ui-bugs/` -- `docs/solutions/integration-issues/` -- `docs/solutions/logic-errors/` -- `docs/solutions/developer-experience/` -- `docs/solutions/workflow-issues/` -- `docs/solutions/best-practices/` -- `docs/solutions/documentation-gaps/` +Key enum values (illustrative — actual schema is authoritative): + +**Knowledge-track problem_type values:** + +- `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention` +- `workflow_issue`, `developer_experience`, `documentation_gap` +- `best_practice` (fallback for entries not covered by a narrower knowledge-track value) + +**Bug-track problem_type values:** + +- `build_error`, `test_failure`, `runtime_error`, `performance_issue` +- `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` + +**component values** (bug-track; optional on knowledge-track): + +- `rails_model`, `rails_controller`, `rails_view`, `service_object` +- `background_job`, `database`, `frontend_stimulus`, `hotwire_turbo` +- `email_processing`, `brief_system`, `assistant`, `authentication` +- `payments`, `development_workflow`, `testing_framework`, `documentation`, `tooling` + +**root_cause values** (bug-track; optional on knowledge-track): + +- `missing_association`, `missing_include`, `missing_index`, `wrong_api` +- `scope_issue`, `thread_violation`, `async_timing`, `memory_leak` +- `config_error`, `logic_error`, `test_isolation`, `missing_validation` +- `missing_permission`, `missing_workflow_step`, `inadequate_documentation` +- `missing_tooling`, `incomplete_setup` + +Subdirectory listings in the schema reference are illustrative, not exhaustive. Probe the live directory (Step 2) for what actually exists. ## Output Format Structure your findings as: ```markdown -## Institutional Learnings Search Results +## Applicable Past Learnings ### Search Context -- **Feature/Task**: [Description of what's being implemented] -- **Keywords Used**: [tags, modules, symptoms searched] -- **Files Scanned**: [X total files] -- **Relevant Matches**: [Y files] +- **Work context**: [Summary of caller's Activity / Concepts / Decisions / Domains] +- **Keywords used**: [tags, modules, concepts, domains searched] +- **Subdirectories probed**: [list of docs/solutions/ subdirectories searched] +- **Files scanned**: [X total files] +- **Relevant matches**: [Y files] -### Critical Patterns (Always Check) -[Any matching patterns from critical-patterns.md] +### Critical Patterns +[When critical-patterns.md exists and has relevant content; omit this section when the file does not exist in this repo.] ### Relevant Learnings #### 1. [Title] - **File**: [path] -- **Module**: [module] -- **Relevance**: [why this matters for current task] -- **Key Insight**: [the gotcha or pattern to apply] +- **Module/Domain**: [module or domain] +- **Type**: [bug | architecture_pattern | design_pattern | tooling_decision | convention | workflow | other] +- **Relevance**: [why this matters for caller's work] +- **Key takeaway**: [decision, pattern, or pitfall to carry forward] #### 2. [Title] ... ### Recommendations -- [Specific actions to take based on learnings] -- [Patterns to follow] -- [Gotchas to avoid] +- [Specific actions or decisions to consider based on the surfaced learnings] +- [Patterns to follow or mirror] +- [Past mis-steps worth avoiding, where applicable] ### No Matches -[If no relevant learnings found, explicitly state this] +[If no relevant learnings found, explicitly state this. Include the search context and subdirectories probed so the caller can see what was looked for.] ``` ## Efficiency Guidelines **DO:** + - Use the native content-search tool to pre-filter files BEFORE reading any content (critical for 100+ files) -- Run multiple content searches in PARALLEL for different keywords -- Include `title:` in search patterns - often the most descriptive field +- Run multiple content searches in PARALLEL for different keyword dimensions +- Probe `docs/solutions/` subdirectories dynamically rather than assuming a fixed list +- Include `title:` in search patterns — often the most descriptive field - Use OR patterns for synonyms: `tags:.*(payment|billing|stripe)` - Use `-i=true` for case-insensitive matching -- Use category directories to narrow scope when feature type is clear +- Narrow to discovered subdirectories when the caller's Domain hint makes one obvious - Do a broader content search as fallback if <3 candidates found - Re-narrow with more specific patterns if >25 candidates found -- Always read the critical patterns file (Step 3b) +- Treat architecture patterns, design patterns, tooling decisions, conventions, and workflow learnings as first-class — not secondary to bugs - Only read frontmatter of search-matched candidates (not all files) -- Filter aggressively - only fully read truly relevant files -- Prioritize high-severity and critical patterns -- Extract actionable insights, not just summaries +- Filter aggressively — only fully read truly relevant files +- Prioritize high-severity entries and critical patterns +- Extract actionable takeaways, not just summaries - Note when no relevant learnings exist (this is valuable information too) **DON'T:** + - Read frontmatter of ALL files (use content-search to pre-filter first) - Run searches sequentially when they can be parallel - Use only exact keyword matches (include synonyms) @@ -234,12 +269,15 @@ Structure your findings as: - Read every file in full (wasteful) - Return raw document contents (distill instead) - Include tangentially related learnings (focus on relevance) -- Skip the critical patterns file (always check it) +- Discard a candidate because it lacks bug-shaped fields like `symptoms` or `root_cause` — non-bug entries legitimately omit them +- Privilege bug-track entries over knowledge-track entries when both are relevant +- Assume `docs/solutions/patterns/critical-patterns.md` exists — read it only when present ## Integration Points -This agent is designed to be invoked by: -- `/ce-plan` - To inform planning with institutional knowledge and add depth during confidence checking -- Manual invocation before starting work on a feature +This agent is invoked by: + +- `/ce-plan` — to inform planning with institutional knowledge and add depth during confidence checking +- Standalone invocation before starting work in a documented area -The goal is to surface relevant learnings in under 30 seconds for a typical solutions directory, enabling fast knowledge retrieval during planning phases. +The goal is to surface relevant learnings in under 30 seconds for a typical solutions directory, enabling fast knowledge retrieval at decision points. diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml b/plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml index ad68b27f9..f8a726dd5 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml +++ b/plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml @@ -23,12 +23,16 @@ tracks: - integration_issue - logic_error knowledge: - description: "Best practices, workflow improvements, patterns, and documentation" + description: "Practices, patterns, conventions, decisions, workflow improvements, and documentation" problem_types: - best_practice - documentation_gap - workflow_issue - developer_experience + - architecture_pattern + - design_pattern + - tooling_decision + - convention # --- Fields required by BOTH tracks ----------------------------------------- required_fields: @@ -57,7 +61,11 @@ required_fields: - workflow_issue - best_practice - documentation_gap - description: "Primary category — determines track (bug vs knowledge)" + - architecture_pattern + - design_pattern + - tooling_decision + - convention + description: "Primary category — determines track (bug vs knowledge). Prefer the narrowest applicable value; best_practice is the fallback when no narrower knowledge-track value fits." component: type: enum diff --git a/plugins/compound-engineering/skills/ce-compound-refresh/references/yaml-schema.md b/plugins/compound-engineering/skills/ce-compound-refresh/references/yaml-schema.md index ca6cd71be..31420ca3b 100644 --- a/plugins/compound-engineering/skills/ce-compound-refresh/references/yaml-schema.md +++ b/plugins/compound-engineering/skills/ce-compound-refresh/references/yaml-schema.md @@ -16,7 +16,7 @@ The `problem_type` determines which **track** applies. Each track has different | Track | problem_types | Description | |-------|--------------|-------------| | **Bug** | `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` | Defects and failures that were diagnosed and fixed | -| **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience` | Practices, patterns, workflow improvements, and documentation | +| **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience`, `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention` | Practices, patterns, conventions, decisions, workflow improvements, and documentation. Prefer the narrowest applicable value; `best_practice` is the fallback. | ## Required Fields (both tracks) @@ -73,6 +73,10 @@ Docs created before the track system may have `symptoms`/`root_cause`/`resolutio - `workflow_issue` -> `docs/solutions/workflow-issues/` - `best_practice` -> `docs/solutions/best-practices/` - `documentation_gap` -> `docs/solutions/documentation-gaps/` +- `architecture_pattern` -> `docs/solutions/architecture-patterns/` +- `design_pattern` -> `docs/solutions/design-patterns/` +- `tooling_decision` -> `docs/solutions/tooling-decisions/` +- `convention` -> `docs/solutions/conventions/` ## Validation Rules diff --git a/plugins/compound-engineering/skills/ce-compound/SKILL.md b/plugins/compound-engineering/skills/ce-compound/SKILL.md index 2684f6152..4dd99918c 100644 --- a/plugins/compound-engineering/skills/ce-compound/SKILL.md +++ b/plugins/compound-engineering/skills/ce-compound/SKILL.md @@ -410,10 +410,14 @@ Bug track: - logic-errors/ Knowledge track: -- best-practices/ +- architecture-patterns/ — architectural or structural patterns (agent/skill/pipeline/workflow shape decisions) +- design-patterns/ — reusable non-architectural design approaches (content generation, interaction patterns, prompt shapes) +- tooling-decisions/ — language, library, or tool choices with durable rationale +- conventions/ — team-agreed way of doing something, captured so it survives turnover - workflow-issues/ - developer-experience/ - documentation-gaps/ +- best-practices/ — fallback only, use when no narrower knowledge-track value applies ## Common Mistakes to Avoid diff --git a/plugins/compound-engineering/skills/ce-compound/references/schema.yaml b/plugins/compound-engineering/skills/ce-compound/references/schema.yaml index ad68b27f9..f8a726dd5 100644 --- a/plugins/compound-engineering/skills/ce-compound/references/schema.yaml +++ b/plugins/compound-engineering/skills/ce-compound/references/schema.yaml @@ -23,12 +23,16 @@ tracks: - integration_issue - logic_error knowledge: - description: "Best practices, workflow improvements, patterns, and documentation" + description: "Practices, patterns, conventions, decisions, workflow improvements, and documentation" problem_types: - best_practice - documentation_gap - workflow_issue - developer_experience + - architecture_pattern + - design_pattern + - tooling_decision + - convention # --- Fields required by BOTH tracks ----------------------------------------- required_fields: @@ -57,7 +61,11 @@ required_fields: - workflow_issue - best_practice - documentation_gap - description: "Primary category — determines track (bug vs knowledge)" + - architecture_pattern + - design_pattern + - tooling_decision + - convention + description: "Primary category — determines track (bug vs knowledge). Prefer the narrowest applicable value; best_practice is the fallback when no narrower knowledge-track value fits." component: type: enum diff --git a/plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md b/plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md index ca6cd71be..31420ca3b 100644 --- a/plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md +++ b/plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md @@ -16,7 +16,7 @@ The `problem_type` determines which **track** applies. Each track has different | Track | problem_types | Description | |-------|--------------|-------------| | **Bug** | `build_error`, `test_failure`, `runtime_error`, `performance_issue`, `database_issue`, `security_issue`, `ui_bug`, `integration_issue`, `logic_error` | Defects and failures that were diagnosed and fixed | -| **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience` | Practices, patterns, workflow improvements, and documentation | +| **Knowledge** | `best_practice`, `documentation_gap`, `workflow_issue`, `developer_experience`, `architecture_pattern`, `design_pattern`, `tooling_decision`, `convention` | Practices, patterns, conventions, decisions, workflow improvements, and documentation. Prefer the narrowest applicable value; `best_practice` is the fallback. | ## Required Fields (both tracks) @@ -73,6 +73,10 @@ Docs created before the track system may have `symptoms`/`root_cause`/`resolutio - `workflow_issue` -> `docs/solutions/workflow-issues/` - `best_practice` -> `docs/solutions/best-practices/` - `documentation_gap` -> `docs/solutions/documentation-gaps/` +- `architecture_pattern` -> `docs/solutions/architecture-patterns/` +- `design_pattern` -> `docs/solutions/design-patterns/` +- `tooling_decision` -> `docs/solutions/tooling-decisions/` +- `convention` -> `docs/solutions/conventions/` ## Validation Rules diff --git a/plugins/compound-engineering/skills/ce-doc-review/SKILL.md b/plugins/compound-engineering/skills/ce-doc-review/SKILL.md index dabe8f409..83416784f 100644 --- a/plugins/compound-engineering/skills/ce-doc-review/SKILL.md +++ b/plugins/compound-engineering/skills/ce-doc-review/SKILL.md @@ -6,28 +6,34 @@ argument-hint: "[mode:headless] [path/to/document.md]" # Document Review -Review requirements or plan documents through multi-persona analysis. Dispatches specialized reviewer agents in parallel, auto-fixes quality issues, and presents strategic questions for user decision. +Review requirements or plan documents through multi-persona analysis. Dispatches specialized reviewer agents in parallel, auto-applies `safe_auto` fixes, and routes remaining findings through a four-option interaction (per-finding walk-through, LFG, Append-to-Open-Questions, Report-only) for user decision. + +## Interactive mode rules + +- **Pre-load the platform question tool before any question fires.** In Claude Code, `AskUserQuestion` is a deferred tool — its schema is not available at session start. At the start of Interactive-mode work (before the routing question, per-finding walk-through questions, bulk-preview Proceed/Cancel, and Phase 5 terminal question), call `ToolSearch` with query `select:AskUserQuestion` to load the schema. Load it once, eagerly, at the top of the Interactive flow — do not wait for the first question site. On Codex (`request_user_input`) and Gemini (`ask_user`) this step is not required; the tools are loaded by default. +- **The numbered-list fallback only applies on confirmed load failure.** Presenting options as a numbered list and waiting for the user's reply is valid only when `ToolSearch` returns no match or the tool call explicitly fails. Rendering a question as narrative text because the tool feels inconvenient, because the model is in report-formatting mode, or because the instruction was buried in a long skill is a bug. A question that calls for a user decision must either fire the tool or fail loudly. ## Phase 0: Detect Mode -Check the skill arguments for `mode:headless`. Arguments may contain a document path, `mode:headless`, or both. Tokens starting with `mode:` are flags, not file paths -- strip them from the arguments and use the remaining token (if any) as the document path for Phase 1. +Check the skill arguments for `mode:headless`. Arguments may contain a document path, `mode:headless`, or both. Tokens starting with `mode:` are flags, not file paths — strip them from the arguments and use the remaining token (if any) as the document path for Phase 1. If `mode:headless` is present, set **headless mode** for the rest of the workflow. -**Headless mode** changes the interaction model, not the classification boundaries. Document-review still applies the same judgment about what has one clear correct fix vs. what needs user judgment. The only difference is how non-auto findings are delivered: -- `auto` fixes are applied silently (same as interactive) -- `present` findings are returned as structured text for the caller to handle -- no AskUserQuestion prompts, no interactive approval -- Phase 5 returns immediately with "Review complete" (no refine/complete question) +**Headless mode** changes the interaction model, not the classification boundaries. ce-doc-review still applies the same judgment about which tier each finding belongs in. The only difference is how non-safe_auto findings are delivered: + +- `safe_auto` fixes are applied silently (same as interactive) +- `gated_auto`, `manual`, and FYI findings are returned as structured text for the caller to handle — no AskUserQuestion prompts, no interactive routing +- Phase 5 returns immediately with "Review complete" (no routing question, no terminal question) The caller receives findings with their original classifications intact and decides what to do with them. Callers invoke headless mode by including `mode:headless` in the skill arguments, e.g.: + ``` Skill("ce-doc-review", "mode:headless docs/plans/my-plan.md") ``` - -If `mode:headless` is not present, the skill runs in its default interactive mode with no behavior change. +If `mode:headless` is not present, the skill runs in its default interactive mode with the routing question, walk-through, and bulk-preview behaviors documented in `references/walkthrough.md` and `references/bulk-preview.md`. ## Phase 1: Get and Analyze Document @@ -124,8 +130,44 @@ Dispatch all agents in **parallel** using the platform's task/agent tool (e.g., | `{document_type}` | "requirements" or "plan" from Phase 1 classification | | `{document_path}` | Path to the document | | `{document_content}` | Full text of the document | +| `{decision_primer}` | Cumulative prior-round decisions in the current session, or an empty `` block on round 1. See "Decision primer" below. | + +Pass each agent the **full document** — do not split into sections. + +### Decision primer + +On round 1 (no prior decisions), set `{decision_primer}` to: + +``` + +Round 1 — no prior decisions. + +``` + +On round 2+ (after one or more prior rounds in the current interactive session), accumulate prior-round decisions and render them as: -Pass each agent the **full document** -- do not split into sections. +``` + +Round 1 — applied (N entries): +- {section}: "{title}" ({reviewer}, {confidence}) + Evidence: "{evidence_snippet}" + +Round 1 — rejected (M entries): +- {section}: "{title}" — Skipped because {reason} + Evidence: "{evidence_snippet}" +- {section}: "{title}" — Deferred to Open Questions because {reason or "no reason provided"} + Evidence: "{evidence_snippet}" + +Round 2 — applied (N entries): +... + +``` + +Each entry carries an `Evidence:` line because synthesis R29 (rejected-finding suppression) and R30 (fix-landed verification) both use an evidence-substring overlap check as part of their matching predicate — without the evidence snippet in the primer, the orchestrator cannot compute the `>50%` overlap test and has to fall back to fingerprint-only matching, which either re-surfaces rejected findings or suppresses too aggressively. The `{evidence_snippet}` is the first evidence quote from the finding, truncated to the first ~120 characters (preserving whole words at the boundary) and with internal quotes escaped. If a finding has multiple evidence entries, use the first one; the rest live in the run artifact and are not needed for the overlap check. + +Accumulate across all rounds in the current session. Skip and Defer actions both count as "rejected" for suppression purposes — both signal the user decided the finding wasn't worth actioning this round. Applied findings stay on the applied list so round-N+1 personas can verify fixes landed (see R30 in `references/synthesis-and-presentation.md`). + +Cross-session persistence is out of scope. A new invocation of ce-doc-review on the same document starts with a fresh round 1 and no carried primer, even if prior sessions deferred findings into the document's Open Questions section. **Error handling:** If an agent fails or times out, proceed with findings from agents that completed. Note the failed agent in the Coverage section. Do not block the entire review on a single agent failure. @@ -133,7 +175,9 @@ Pass each agent the **full document** -- do not split into sections. ## Phases 3-5: Synthesis, Presentation, and Next Action -After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, gate, dedup, promote, resolve contradictions, route by autofix class), auto-fix application, finding presentation, and next-action menu. Do not load this file before agent dispatch completes. +After all dispatched agents return, read `references/synthesis-and-presentation.md` for the synthesis pipeline (validate, per-severity gate, dedup, cross-persona agreement boost, resolve contradictions, auto-promotion, route by three tiers with FYI subsection), `safe_auto` fix application, headless-envelope output, and the handoff to the routing question. + +For the four-option routing question and per-finding walk-through (interactive mode), read `references/walkthrough.md`. For the bulk-action preview used by LFG, Append-to-Open-Questions, and walk-through `LFG-the-rest`, read `references/bulk-preview.md`. Do not load these files before agent dispatch completes. --- diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md b/plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md new file mode 100644 index 000000000..bd444114a --- /dev/null +++ b/plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md @@ -0,0 +1,128 @@ +# Bulk Action Preview + +This reference defines the compact plan preview that Interactive mode shows before every bulk action — LFG (routing option B), Append-to-Open-Questions (routing option C), and the walk-through's `LFG the rest` (option D of the per-finding question). The preview gives the user a single-screen view of what the agent is about to do, with exactly two options to Proceed or Cancel. + +Interactive mode only. + +--- + +## When the preview fires + +Three call sites: + +1. **Routing option B (top-level LFG)** — after the user picks `LFG. Apply the agent's best-judgment action per finding` from the routing question, but before any action executes. Scope: every pending `gated_auto` / above-gate `manual` finding. +2. **Routing option C (top-level Append-to-Open-Questions)** — after the user picks `Append findings to the doc's Open Questions section and proceed` but before any append runs. Scope: every pending `gated_auto` / above-gate `manual` finding. Every finding appears under `Appending to Open Questions (N):` regardless of the agent's natural recommendation, because option C is batch-defer. +3. **Walk-through `LFG the rest`** — after the user picks `LFG the rest — apply the agent's best judgment to this and remaining findings` from a per-finding question, but before the remaining findings are resolved. Scope: the current finding and everything not yet decided. Already-decided findings from the walk-through are not included in the preview. + +In all three cases the user confirms with `Proceed` or backs out with `Cancel`. No per-item decisions inside the preview — per-item decisioning is the walk-through's role. + +--- + +## Preview structure + +The preview is grouped by the action the agent intends to take. Bucket headers appear only when their bucket is non-empty. + +``` +: + +Applying (N): + [P0]
+ [P1]
+ +Appending to Open Questions (N): + [P2]
+ +Skipping (N): + [P2]
+``` + +Worked example for routing option B (top-level LFG): + +``` +LFG plan — 8 findings: + +Applying (4): + [P0] Requirements Trace — Renumber R4 to match unit reference + [P1] Unit 3 Files — Add read-fallback for renamed report file + [P2] Key Technical Decisions — Use framework's Deprecated field rather than hand-rolling + [P3] Overview — Correct wrong count (says 6, list has 5) + +Appending to Open Questions (2): + [P2] Scope Boundaries — Unit 2/3 merge judgment call + [P2] Risks — Alias compatibility-theater concern + +Skipping (2): + [P2] Miscellaneous Notes — Low-confidence style preference + [P3] Abstraction Commentary — Speculative, subjective +``` + +--- + +## Scope summary wording by path + +- **Routing option B (top-level LFG):** header reads `LFG plan — N findings:`. +- **Routing option C (top-level Append-to-Open-Questions):** header reads `Append plan — N findings as Open Questions entries:`. Every finding lands in the `Appending to Open Questions (N):` bucket. +- **Walk-through `LFG the rest`:** header reads `LFG plan — N remaining findings (K already decided):`. Already-decided findings from the walk-through are not included in the preview or in the bucket counts. The `K already decided` counter communicates that the walk-through was partially completed. + +--- + +## Per-finding line format + +Each line uses the compressed form of the framing-quality guidance from the subagent template (observable-consequence-first, no internal section numbering unless needed to locate). The one-line summary is drawn from the persona-produced `why_it_matters` by taking the first sentence (and, when the first sentence is too long for the preview width, paraphrasing it tightly to fit). + +- **Shape:** `[]
` +- **Width target:** keep lines near 80 columns so the preview renders cleanly in narrow terminals. Truncate with ellipsis when necessary. +- **No section numbering** unless the reader needs it to locate the issue (when multiple findings hit the same named section). + +When no `why_it_matters` is available for a finding (rare — only if persona output was malformed), fall back to the finding's title directly. Note the gap in the completion report's Coverage section if it affects more than a few findings in the same run. + +--- + +## Question and options + +After the preview body is rendered, ask the user using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code, the tool should already be loaded from the Interactive-mode pre-load step — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. The text fallback below applies only when that load explicitly fails. + +Stem (adapted to the path): + +- For routing B: `The agent is about to apply the plan above. Proceed?` +- For routing C: `The agent is about to append the findings above to the doc's Open Questions section. Proceed?` +- For walk-through `LFG the rest`: `The agent is about to resolve the remaining findings above. Proceed?` + +Options (exactly two, in all three cases): + +- `Proceed` — execute the plan as shown +- `Cancel` — do nothing, return to the originating question + +Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to presenting numbered options and waiting for the user's next reply. + +--- + +## Cancel semantics + +- **From routing option B Cancel:** return the user to the routing question (the four-option menu). Do not edit the document, do not append any Open Questions entries, do not record any state. +- **From routing option C Cancel:** same — return to the routing question, no side effects. +- **From walk-through `LFG the rest` Cancel:** return the user to the current finding's per-finding question (not to the routing question). The walk-through continues from where it was, with prior decisions intact. + +In every case, `Cancel` changes no on-disk or in-memory state. + +--- + +## Proceed semantics + +When the user picks `Proceed`: + +- **Routing option B (top-level LFG):** for each finding in the plan, execute the recommended action. Apply findings go into the Apply set for a single end-of-batch document-edit pass (see `walkthrough.md` for the Apply batching rules). Defer findings route through `references/open-questions-defer.md`. Skip findings are recorded as no-action. After all actions complete, emit the unified completion report (see `walkthrough.md`). +- **Routing option C (top-level Append-to-Open-Questions):** every finding routes through `references/open-questions-defer.md` for Open Questions append. No document edits apply (beyond the Open Questions section additions themselves). After all appends complete (or fail), emit the unified completion report. +- **Walk-through `LFG the rest`:** same as routing option B, but scoped to the findings the user hadn't decided on. Apply findings join the in-memory Apply set with the ones the user already picked during the walk-through; all dispatch together in the single end-of-walk-through Apply pass. + +Failure during `Proceed` (e.g., an Open Questions append fails for one finding during a batch Defer) follows the failure path defined in `references/open-questions-defer.md` — surface the failure inline with Retry / Fall back / Convert to Skip, continue with the rest of the plan, and capture the failure in the completion report's failure section. + +--- + +## Edge cases + +- **Zero findings in a bucket:** omit the bucket header. A preview with only Apply and Skip does not show an empty `Appending to Open Questions (0):` line. +- **All findings in one bucket:** preview still shows the bucket header; Proceed / Cancel still offered. This is the common case for routing option C (every finding under `Appending to Open Questions`). +- **N=1 preview (only one finding in scope):** the preview still uses the grouped format, just with a single-line bucket. `Proceed` / `Cancel` still apply. +- **Open Questions append unavailable** (document is read-only, append flow reports no-go): routing option C is not offered upstream (see `references/open-questions-defer.md` unavailability handling). LFG (option B) and walk-through `LFG the rest` can still run — they may contain per-finding Defer recommendations from synthesis. Before rendering any LFG-shaped preview, downgrade every Defer recommendation to Skip when the session's cached append-availability is false, and surface the downgrade on the preview itself (e.g., a `Skipping — append unavailable (N):` bucket, or a note in the header: `N Defer recommendations downgraded to Skip — document is read-only.`). +- **Walk-through `LFG the rest` with zero remaining findings:** the walk-through's own logic suppresses `LFG the rest` as an option when N=1 and otherwise, so the preview should never be invoked with zero remaining findings. If it is, render `LFG plan — 0 remaining findings` and fall through to Proceed with no-op. diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json b/plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json index 2948a1b50..9a46a5a5f 100644 --- a/plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json +++ b/plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json @@ -45,8 +45,8 @@ }, "autofix_class": { "type": "string", - "enum": ["auto", "present"], - "description": "How this issue should be handled. auto = one clear correct fix that can be applied silently (terminology, formatting, cross-references, completeness corrections, additions mechanically implied by other content). present = requires individual user judgment." + "enum": ["safe_auto", "gated_auto", "manual"], + "description": "How this issue should be handled. safe_auto = one clear correct fix applied silently (typo, wrong count, stale cross-reference, mechanically-implied addition, terminology drift). gated_auto = concrete fix exists but touches document meaning or scope and warrants user confirmation (substantive additions from codebase-pattern-resolved fixes, framework-native-API substitutions, missing standard controls). manual = requires user judgment; multiple valid approaches. Low-confidence manual findings surface in an FYI subsection at the presentation layer." }, "finding_type": { "type": "string", diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md b/plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md new file mode 100644 index 000000000..6e3f2d125 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md @@ -0,0 +1,171 @@ +# Open Questions Deferral + +This reference defines the Defer action's in-doc append mechanic. When the user chooses Defer on a finding (from the walk-through or from the bulk-preview Append-to-Open-Questions path), an entry for that finding appends to a `## Deferred / Open Questions` section at the end of the document under review. + +Interactive mode only. Invoked by `references/walkthrough.md` (per-finding Defer option) and `references/bulk-preview.md` (routing option C Proceed). + +--- + +## Append flow + +### Step 1: Locate or create the Open Questions section + +Scan the document for an existing `## Deferred / Open Questions` heading (case-sensitive match on the full heading text). Behavior by location: + +- **Heading present at the end of the document (last `##`-level section):** append new content inside this section at the end. +- **Heading present mid-document (not the last `##`-level section):** still append inside the existing heading at that location. Do not create a duplicate at the end — the user positioned the section deliberately. +- **Heading absent:** create `## Deferred / Open Questions` at the end of the document. If the document has a trailing horizontal-rule separator (`---`) or a trailing footer (table, links section), insert the new section above it. If the document has only frontmatter and no body, create the section after the frontmatter block (not at byte 0). + +### Step 2: Locate or create the timestamped subsection + +Within the Open Questions section, scan for a subsection heading matching the current review date: `### From YYYY-MM-DD review`. Behavior: + +- **Subsection present:** append new entries to it. Multiple Defer actions within a single review session accumulate under the same subsection. +- **Subsection absent:** create `### From YYYY-MM-DD review` as the last subsection within the Open Questions section. Insert one blank line before the heading for readability. + +Date format: ISO 8601 calendar date (`YYYY-MM-DD`). If multiple reviews occur on the same document on the same day within the same session, they still share the same subsection. Multi-day same-document reviews get distinct subsections, which is the intended behavior. + +### Step 3: Format and append the entry + +Per deferred finding, append a bullet-point entry with the following fields. The visible content is a reader-friendly summary; an HTML comment on the entry persists the dedup-key fields so Step 4's compound-key check can run reliably across retries and same-day reruns without requiring the entry format itself to carry machine-oriented metadata: + +``` +- **{title}** — {section} ({severity}, {reviewer}, confidence {confidence}) + + {why_it_matters} + + +``` + +Fields come from the finding's schema: + +- `{title}` — the finding's title field +- `{section}` — the finding's section field, unmodified (human-readable) +- `{severity}` — P0 / P1 / P2 / P3 +- `{reviewer}` — the persona that produced the finding (after dedup, the persona with the highest confidence; surface all co-flagging personas if multiple) +- `{confidence}` — rounded to 2 decimal places +- `{why_it_matters}` — the full why_it_matters text, preserving the framing guidance from the subagent template + +HTML-comment fields (machine-readable, used by Step 4 dedup): + +- `{normalized_section}` — `normalize(section)` (lowercase, punctuation-stripped, whitespace-collapsed) +- `{normalized_title}` — `normalize(title)` (same normalization) +- `{evidence_fingerprint}` — first ~120 chars of the finding's first evidence quote, word-boundary-preserving, internal quotes escaped; empty string when the finding had no evidence + +Do not include `suggested_fix` or the full `evidence` array in the appended entry. Those live in the review run artifact (when applicable) and do not belong in the document's Open Questions section — the entry is a concern summary for the reader returning later, not a full decision packet. The HTML-comment dedup-key line is the minimum machine-oriented metadata required for reliable idempotence and deliberately sits on a single line with simple `key="value"` shape so a retry can parse it without a markdown parser. + +### Step 4: Idempotence on compound-key collisions + +If an entry with the same compound key already exists under the same `### From YYYY-MM-DD review` subsection, do not append a duplicate. This can happen when: + +- The same review session re-routes the same finding to Defer a second time (rare but possible via LFG-the-rest after a walk-through Defer) +- The orchestrator retries after a partial failure + +**Compound key for dedup:** `normalize(section) + normalize(title) + evidence_fingerprint`, reconstructed from each existing entry's `` HTML comment (see Step 3 entry format). For a new finding about to append, compute the same fields from the finding's schema data; for existing entries, parse them out of the HTML comment. Match on all three fields. + +- `normalize(section)` and `normalize(title)` use the same normalization as synthesis step 3.3 dedup (lowercase, strip punctuation, collapse whitespace) +- `evidence_fingerprint` is the first ~120 characters of the finding's first evidence quote (same slice used in the decision primer — see `SKILL.md` under "Decision primer"). When no evidence is available on the new finding, fall back to section+title alone. When an existing entry's HTML comment has `evidence=""`, treat the entry as evidence-less and also fall back to section+title for that comparison. + +Title-only dedup is not sufficient: two different findings in the same document (even in the same review date) can legitimately share a short title if their sections and evidence differ. Using only `{title}` would silently drop one of them — losing user-visible backlog context. The compound key mirrors the R29/R30 matching predicate (`section + title + evidence-substring overlap`) so cross-round and intra-round dedup behave consistently. + +**Legacy entries without dedup-key comments:** entries written before this format (if any survive in the wild) lack the HTML comment. When Step 4 encounters such an entry, fall back to title-only comparison for that entry — imperfect, but strictly better than duplicate-appending. This is a backwards-compat behavior for legacy data, not a sanctioned format. + +On collision, record the no-op in the completion report's Coverage section so the user sees the duplicate was suppressed. Cross-subsection collisions (same compound key, different dates) are not deduplicated — each review is allowed to re-raise the same concern. + +--- + +## Concurrent edit safety + +Document edits happen via the platform's edit tool (Edit in Claude Code, or equivalent). Before every append, re-read the document from disk to reduce the window for user-in-editor concurrent-write collisions. If the document's mtime or content has changed unexpectedly between a prior read and the append attempt, abort the append and surface the situation via the failure path below. The user may be editing in their editor during the review session and simultaneous writes would corrupt the document. + +The orchestrator only holds the most recent read in memory, not a persistent lock — interactive review doesn't need lock coordination; it needs observation-before-write. + +--- + +## Failure path + +When the append cannot complete — document is read-only on disk, path is invalid, the platform's edit tool returns an error, concurrent-edit collision detected, or any other write failure — surface the failure inline to the user via the platform's blocking question tool with the following sub-question: + +**Stem:** `Couldn't append the finding to Open Questions. What should the agent do?` + +**Options (exactly three; fixed order):** + +``` +A. Retry the append +B. Record the deferral in the completion report only (don't mutate the document) +C. Convert this finding to Skip +``` + +**Dispatch:** + +- **A Retry** — try the append again. On repeated failure, loop back to the same sub-question. +- **B Record only** — skip the document mutation; record the Deferred action in the completion report with a note that the append failed. The finding does not end up in the document but the user sees in the report that they deferred it. +- **C Convert to Skip** — record the finding as Skip with an explanatory reason ("append to Open Questions failed: "). The finding is treated as no-action for the remainder of the session. + +Silent failure is not acceptable. If the user does not respond to the sub-question (session ends, terminal disconnects), default to option B so the in-memory decision state stays consistent even if the document wasn't written. + +--- + +## Upstream availability signal + +The walk-through and bulk-preview check append-availability before offering Defer as an option. When the document is known-unwritable (e.g., initial read shows it's on a read-only filesystem), the orchestrator caches an `append_available: false` signal at Phase 4 start and Defer is suppressed in the walk-through menu and in the routing question's option C. See `references/walkthrough.md` under "Adaptations" for the menu behavior and `references/bulk-preview.md` under "Edge cases" for the preview behavior. + +When append-availability is true at Phase 4 start but an individual append fails mid-flow, the failure path above handles the specific finding — this does not flip the session-level cached signal (other findings may still append successfully if the failure was transient). + +--- + +## Example appended content + +Starting document state: + +```markdown +## Risks + +...existing content... + +## Deferred / Open Questions + +### From 2026-04-10 review + +- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87) + + The alias exists without documented external consumers... + + + +``` + +After appending two findings in a 2026-04-18 session: + +```markdown +## Risks + +...existing content... + +## Deferred / Open Questions + +### From 2026-04-10 review + +- **Alias compatibility-theater concern** — Risks (P1, scope-guardian, confidence 0.87) + + The alias exists without documented external consumers... + + + +### From 2026-04-18 review + +- **Unit 2/3 merge judgment call** — Scope Boundaries (P2, scope-guardian, confidence 0.78) + + The two units update consumer sites that deploy together. Splitting + adds dependency tracking without enabling independent delivery. + + + +- **Strawman alternatives on migration strategy** — Unit 3 Files (P2, coherence, confidence 0.72) + + The fix options list (a) through (c) as alternatives, but (b) and (c) + are "accept the regression" framings that don't solve the problem the + finding describes. + + +``` diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/review-output-template.md b/plugins/compound-engineering/skills/ce-doc-review/references/review-output-template.md index 7f19a3912..678ad8129 100644 --- a/plugins/compound-engineering/skills/ce-doc-review/references/review-output-template.md +++ b/plugins/compound-engineering/skills/ce-doc-review/references/review-output-template.md @@ -1,9 +1,11 @@ # Document Review Output Template -Use this **exact format** when presenting synthesized review findings. Findings are grouped by severity, not by reviewer. +Use this **exact format** when presenting synthesized review findings in Interactive mode. Findings are grouped by severity, not by reviewer. **IMPORTANT:** Use pipe-delimited markdown tables (`| col | col |`). Do NOT use ASCII box-drawing characters. +This template describes the Phase 4 interactive presentation — what the user sees before the routing question (`references/walkthrough.md`) fires. The headless-mode envelope is documented in `references/synthesis-and-presentation.md` (Phase 4 "Route Remaining Findings" section) and is separate from this template. + ## Example ```markdown @@ -15,9 +17,9 @@ Use this **exact format** when presenting synthesized review findings. Findings - security-lens -- plan adds public API endpoint with auth flow - scope-guardian -- plan has 15 requirements across 3 priority levels -Applied 5 auto-fixes. 4 findings to consider (2 errors, 2 omissions). +Applied 5 safe_auto fixes. 4 actionable findings to consider (2 errors, 2 omissions). 2 FYI observations. -### Auto-fixes Applied +### Applied safe_auto fixes - Standardized "pipeline"/"workflow" terminology to "pipeline" throughout (coherence) - Fixed cross-reference: Section 4 referenced "Section 3.2" which is actually "Section 3.1" (coherence) @@ -29,34 +31,45 @@ Applied 5 auto-fixes. 4 findings to consider (2 errors, 2 omissions). #### Errors -| # | Section | Issue | Reviewer | Confidence | -|---|---------|-------|----------|------------| -| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | +| # | Section | Issue | Reviewer | Confidence | Tier | +|---|---------|-------|----------|------------|------| +| 1 | Requirements Trace | Goal states "offline support" but technical approach assumes persistent connectivity | coherence | 0.92 | manual | ### P1 -- Should Fix #### Errors -| # | Section | Issue | Reviewer | Confidence | -|---|---------|-------|----------|------------| -| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | +| # | Section | Issue | Reviewer | Confidence | Tier | +|---|---------|-------|----------|------------|------| +| 2 | Scope Boundaries | 8 of 12 units build admin infrastructure; only 2 touch stated goal | scope-guardian | 0.80 | manual | #### Omissions -| # | Section | Issue | Reviewer | Confidence | -|---|---------|-------|----------|------------| -| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 0.85 | +| # | Section | Issue | Reviewer | Confidence | Tier | +|---|---------|-------|----------|------------|------| +| 3 | Implementation Unit 3 | Plan proposes custom auth but does not mention existing Devise setup or migration path | feasibility | 0.85 | gated_auto | ### P2 -- Consider Fixing #### Omissions -| # | Section | Issue | Reviewer | Confidence | -|---|---------|-------|----------|------------| -| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | +| # | Section | Issue | Reviewer | Confidence | Tier | +|---|---------|-------|----------|------------|------| +| 4 | API Design | Public webhook endpoint has no rate limiting mentioned | security-lens | 0.75 | gated_auto | + +### FYI Observations + +Low-confidence observations surfaced without requiring a decision. Content advisory only. + +| # | Section | Observation | Reviewer | Confidence | +|---|---------|-------------|----------|------------| +| 1 | Naming | Filename `plan.md` is asymmetric with command name `user-auth`; could go either way | coherence | 0.52 | +| 2 | Risk Analysis | Rollout-cadence decision may benefit from monitoring thresholds, though not blocking | scope-guardian | 0.48 | ### Residual Concerns +Residual concerns are issues the reviewers noticed but could not confirm with above-gate confidence. These are not actionable; they appear here for transparency only and are not promoted into the review surface. + | # | Concern | Source | |---|---------|--------| | 1 | Migration rollback strategy not addressed for Phase 2 data changes | feasibility | @@ -69,21 +82,22 @@ Applied 5 auto-fixes. 4 findings to consider (2 errors, 2 omissions). ### Coverage -| Persona | Status | Findings | Auto | Present | Residual | -|---------|--------|----------|------|---------|----------| -| coherence | completed | 4 | 3 | 1 | 0 | -| feasibility | completed | 2 | 1 | 1 | 1 | -| security-lens | completed | 2 | 1 | 1 | 0 | -| scope-guardian | completed | 1 | 0 | 1 | 0 | -| product-lens | not activated | -- | -- | -- | -- | -| design-lens | not activated | -- | -- | -- | -- | +| Persona | Status | Findings | SafeAuto | GatedAuto | Manual | FYI | Residual | +|---------|--------|----------|----------|-----------|--------|-----|----------| +| coherence | completed | 5 | 3 | 0 | 1 | 1 | 0 | +| feasibility | completed | 3 | 1 | 1 | 0 | 0 | 1 | +| security-lens | completed | 2 | 1 | 1 | 0 | 0 | 0 | +| scope-guardian | completed | 2 | 0 | 0 | 1 | 1 | 0 | +| product-lens | not activated | -- | -- | -- | -- | -- | -- | +| design-lens | not activated | -- | -- | -- | -- | -- | -- | ``` ## Section Rules -- **Summary line**: Always present after the reviewer list. Format: "Applied N auto-fixes. K findings to consider (X errors, Y omissions)." Omit any zero clause. -- **Auto-fixes Applied**: List all fixes that were applied automatically (auto class). Include enough detail per fix to convey the substance -- especially for fixes that add content or touch document meaning. Omit section if none. -- **P0-P3 sections**: Only include sections that have findings. Omit empty severity levels. Within each severity, separate into **Errors** and **Omissions** sub-headers. Omit a sub-header if that severity has none of that type. -- **Residual Concerns**: Findings below confidence threshold that were promoted by cross-persona corroboration, plus unpromoted residual risks. Omit if none. +- **Summary line**: Always present after the reviewer list. Format: "Applied N safe_auto fixes. K actionable findings to consider (X errors, Y omissions). Z FYI observations." Omit any zero clause except the FYI clause when zero (it's informative that none surfaced). +- **Applied safe_auto fixes**: List all fixes that were applied automatically (`safe_auto` tier). Include enough detail per fix to convey the substance — especially for fixes that add content or touch document meaning. Omit section if none. +- **P0-P3 sections**: Only include sections that have actionable findings (`gated_auto` or `manual`). Omit empty severity levels. Within each severity, separate into **Errors** and **Omissions** sub-headers. Omit a sub-header if that severity has none of that type. The `Tier` column surfaces whether a finding is `gated_auto` (concrete fix exists, Apply recommended in walk-through) or `manual` (requires user judgment). +- **FYI Observations**: Low-confidence `manual` findings above the 0.40 FYI floor but below the per-severity gate. Surface here for transparency; these are not actionable and do not enter the walk-through. Omit section if none. +- **Residual Concerns**: Residual concerns noted by personas that did not make it above the confidence gate. Listed for transparency; not promoted into the review surface (cross-persona agreement boost runs on findings that already survived the gate, per synthesis step 3.4). Omit section if none. - **Deferred Questions**: Questions for later workflow stages. Omit if none. -- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal Auto + Present exactly -- if deduplication merged a finding across personas, attribute it to the persona with the highest confidence and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). +- **Coverage**: Always include. All counts are **post-synthesis**. **Findings** must equal SafeAuto + GatedAuto + Manual + FYI exactly — if deduplication merged a finding across personas, attribute it to the persona with the highest confidence and reduce the other persona's count. **Residual** = count of `residual_risks` from this persona's raw output (not the promoted subset in the Residual Concerns section). diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md b/plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md index 7baf1b95f..aa49b1732 100644 --- a/plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md +++ b/plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md @@ -1,6 +1,6 @@ # Document Review Sub-agent Prompt Template -This template is used by the document-review orchestrator to spawn each reviewer sub-agent. Variable substitution slots are filled at dispatch time. +This template is used by the ce-doc-review orchestrator to spawn each reviewer sub-agent. Variable substitution slots are filled at dispatch time. --- @@ -19,34 +19,89 @@ Return ONLY valid JSON matching the findings schema below. No prose, no markdown {schema} Rules: + - You are a leaf reviewer inside an already-running compound-engineering review workflow. Do not invoke compound-engineering skills or agents unless this template explicitly instructs you to. Perform your analysis directly and return findings in the required output format only. - Suppress any finding below your stated confidence floor (see your Confidence calibration section). -- Every finding MUST include at least one evidence item -- a direct quote from the document. +- Every finding MUST include at least one evidence item — a direct quote from the document. - You are operationally read-only. Analyze the document and produce findings. Do not edit the document, create files, or make changes. You may use non-mutating tools (file reads, glob, grep, git log) to gather context about the codebase when evaluating feasibility or existing patterns. +- **Exclude prior-round deferred entries from review scope.** If the document under review contains a `## Deferred / Open Questions` section or subsections such as `### From YYYY-MM-DD review`, ignore that content — it is review output from prior rounds, not part of the document's actual plan/requirements content. Do not flag entries inside it as new findings. Do not quote its text as evidence. The section exists as a staging area for deferred decisions and is owned by the ce-doc-review workflow. - Set `finding_type` for every finding: - - `error`: Something the document says that is wrong -- contradictions, incorrect statements, design tensions, incoherent tradeoffs. - - `omission`: Something the document forgot to say -- missing mechanical steps, absent list entries, undefined thresholds, forgotten cross-references. -- Set `autofix_class` based on whether there is one clear correct fix, not on severity or importance: - - `auto`: One clear correct fix, applied silently. This includes trivial fixes AND substantive ones: - - Internal reconciliation -- one document part authoritative over another (summary/detail mismatches, wrong counts, stale cross-references, terminology drift) - - Implied additions -- correct content mechanically obvious from the document (missing steps, unstated thresholds, completeness gaps) - - Codebase-pattern-resolved -- an established codebase pattern resolves ambiguity (cite the specific file/function in `why_it_matters`) - - Incorrect behavior -- the document describes behavior that is factually wrong, and the correct behavior is obvious from context or the codebase - - Missing standard security measures -- HTTPS enforcement, checksum verification, input sanitization, private IP rejection, or other controls with known implementations where omission is clearly a bug - - Incomplete technical descriptions -- the accurate/complete version is directly derivable from the codebase - - Missing requirements that follow mechanically from the document's own explicit, concrete decisions (not high-level goals -- a goal can be satisfied by multiple valid requirements) - The test is not "is this fix important?" but "is there more than one reasonable way to fix this?" If a competent implementer would arrive at the same fix independently, it is auto -- even if the fix is substantive. Always include `suggested_fix`. NOT auto if more than one reasonable fix exists or if scope/priority judgment is involved. - - `present`: Requires user judgment -- genuinely multiple valid approaches where the right choice depends on priorities, tradeoffs, or context the reviewer does not have. Examples: architectural choices with real tradeoffs, scope decisions, feature prioritization, UX design choices. -- `suggested_fix` is required for `auto` findings. For `present` findings, include only when the fix is obvious. + - `error`: Something the document says that is wrong — contradictions, incorrect statements, design tensions, incoherent tradeoffs. + - `omission`: Something the document forgot to say — missing mechanical steps, absent list entries, undefined thresholds, forgotten cross-references. +- Set `autofix_class` based on whether there is one clear correct fix, not on severity or importance. Three tiers: + - `safe_auto`: One clear correct fix, applied silently. Use only when there is genuinely one right answer — typo, wrong count, stale cross-reference, missing mechanically-implied step, terminology drift, factually incorrect behavior where the correct behavior is derivable from context. Always include `suggested_fix`. + - `gated_auto`: A concrete fix exists but it touches document meaning, scope, or author intent in a way that warrants a one-click confirmation before applying. Use for: substantive additions implied by the document's own decisions, codebase-pattern-resolved fixes, framework-native-API substitutions, missing standard security/reliability controls with known implementations. Always include `suggested_fix`. `gated_auto` is the default tier for "I know the fix, but the author should sign off." + - `manual`: Requires user judgment — genuinely multiple valid approaches where the right choice depends on priorities, tradeoffs, or context the reviewer does not have. Examples: architectural choices with real tradeoffs, scope decisions, feature prioritization, UX design choices. Include `suggested_fix` only when the fix is obvious despite the judgment call. + +- **Strawman-aware classification rule.** When listing alternatives to the primary fix, count only alternatives a competent implementer would genuinely weigh. A "do nothing / accept the defect" option is NOT a real alternative — it is the failure state the finding describes. The same applies to framings like "document in release notes," "accept drift," or "defer to later" when they sidestep the actual problem rather than solving it. If the only alternatives to the primary fix are strawmen (the problem persists under them), the finding is `safe_auto` or `gated_auto`, not `manual`. + + Positive example: "Cache key collision causes stale reads. Fix: include user-id in the cache key. Alternative: never cache this data." → The alternative (disable caching) is a legitimate design choice with real tradeoffs — `manual`. + + Negative example: "Silent read-side failure on renamed config files. Fix: read new name, fall back to old with deprecation warning. Alternative: accept drift and document in release notes." → The alternative does not solve the problem; users on mid-flight runs still hit the failure. Treat as `gated_auto` with the concrete fix. + +- **Strawman safeguard on `safe_auto`.** If you classify a finding as `safe_auto` via strawman-dismissal of alternatives, name the dismissed alternatives explicitly in `why_it_matters` so synthesis and the reader can see the reasoning. When ANY non-strawman alternative exists (even if you judge it weak), downgrade to `gated_auto` — silent auto-apply is reserved for findings with genuinely one option. + +- **Auto-promotion patterns** (findings eligible for `safe_auto` or `gated_auto` even when they're substantive): + - Factually incorrect behavior where the correct behavior is derivable from context or the codebase + - Missing standard security or reliability controls with established implementations (HTTPS enforcement, checksum verification, input sanitization, private IP rejection, fallback-with-deprecation-warning on renames) + - Codebase-pattern-resolved fixes that cite a specific existing pattern in a concrete file or function (the citation is required in `why_it_matters`) + - Framework-native-API substitutions — a hand-rolled implementation duplicates first-class framework behavior (cite the framework API in `why_it_matters`) + - Completeness additions mechanically implied by the document's own explicit decisions (not high-level goals — a goal can be satisfied by multiple valid requirements) + +- `suggested_fix` is required for `safe_auto` and `gated_auto` findings. For `manual` findings, include only when the fix is obvious. - If you find no issues, return an empty findings array. Still populate residual_risks and deferred_questions if applicable. - Use your suppress conditions. Do not flag issues that belong to other personas. + +Writing `why_it_matters` (required field, every finding): + +The `why_it_matters` field is how the reader — a developer triaging findings, a reader returning to the doc months later, a downstream automated surface — understands the problem without re-reading the file. Treat it as the most important prose field in your output; every downstream surface (walk-through questions, bulk-action previews, Open Questions entries, headless output) depends on it being good. + +- **Lead with observable consequence.** Describe what goes wrong from the reader's or implementer's perspective — what breaks, what gets misread, what decision gets made wrong, what the downstream audience experiences. Do not lead with document structure ("Section X on line Y says..."). Start with the effect ("Implementers will disagree on which tier applies when..."). Section references appear later, only when the reader needs them to locate the issue. +- **Explain why the fix resolves the problem.** If you include a `suggested_fix`, the `why_it_matters` should make clear why that specific fix addresses the root cause. When a similar pattern exists elsewhere in the document or codebase (a parallel section, an established convention, a cited code pattern), reference it so the recommendation is grounded in what the team has already chosen. +- **Keep it tight.** Approximately 2-4 sentences. Longer framings are a regression — downstream surfaces have narrow display budgets, and verbose content gets truncated or skimmed. +- **Always produce substantive content.** `why_it_matters` is required by the schema. Empty strings, nulls, and single-phrase entries are validation failures. If you found something worth flagging (confidence at or above your persona's floor), you can explain it — the field exists because every finding needs a reason. + +Illustrative pair — same finding, weak vs. strong framing: + +``` +WEAK (document-citation first; fails the observable-consequence rule): + Section "Classification Tiers" lists four tiers but Section "Synthesis" + routes three. Reconcile. + +STRONG (observable consequence first, grounded fix reasoning): + Implementers will disagree on which tier a finding lands in, because + the Classification Tiers section enumerates four values while the + Synthesis routing only handles three. The document does not say which + enumeration is authoritative. Suggest the Classification Tiers list is + authoritative; drop the fourth value from the tier definition since + Synthesis already lacks a route for it. +``` + +False-positive categories to actively suppress: + +- Pedantic style nitpicks (word choice, bullet vs. numbered lists, comma-vs-semicolon) — style belongs to the document author +- Issues that belong to other personas (see your Suppress conditions) +- Findings already resolved elsewhere in the document (search before flagging) +- Content inside `## Deferred / Open Questions` sections (prior-round review output, not document content) Document type: {document_type} Document path: {document_path} +{decision_primer} + Document content: {document_content} + + +When the `` block above lists entries (round 2+), honor them: + +- Do not re-raise a finding whose title and evidence pattern-match a prior-round rejected (Skipped or Deferred) entry, unless the current document state makes the concern materially different. "Materially different" means the section was substantively edited and your evidence quote no longer appears in the current text — a light-touch edit doesn't count. +- Prior-round Applied findings are informational: the orchestrator verifies those landed via its own matching predicate. You do not need to re-surface them. If the applied fix did not actually land (you find the same issue at the same location), flag it — synthesis will recognize it via the R30 fix-landed predicate. +- Round 1 (no prior decisions) runs with no primer constraints. + +This is a soft instruction; the orchestrator enforces the rule authoritatively via synthesis-level suppression (R29) regardless of persona behavior. Following the primer here reduces noisy re-raises and keeps the Coverage section clean. + ``` diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md b/plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md index f7f7a20fb..b808e76d7 100644 --- a/plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md +++ b/plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md @@ -2,163 +2,259 @@ ## Phase 3: Synthesize Findings -Process findings from all agents through this pipeline. **Order matters** -- each step depends on the previous. +Process findings from all agents through this pipeline. Order matters — each step depends on the previous. The pipeline implements the finding-lifecycle state machine: **Raised → (Confidence Gate | FYI-eligible | Dropped) → Deduplicated → Classified → SafeAuto | GatedAuto | Manual | FYI**. Re-evaluate state at each step boundary; do not carry forward assumptions from earlier steps as prose-level shortcuts. ### 3.1 Validate Check each agent's returned JSON against the findings schema: + - Drop findings missing any required field defined in the schema -- Drop findings with invalid enum values +- Drop findings with invalid enum values (including the pre-rename `auto` / `present` values from older personas — treat those as malformed until all persona output has been regenerated) - Note the agent name for any malformed output in the Coverage section -### 3.2 Confidence Gate +### 3.2 Confidence Gate (Per-Severity) + +Gate findings using per-severity thresholds rather than a single flat floor: + +| Severity | Gate | +|----------|------| +| P0 | 0.50 | +| P1 | 0.60 | +| P2 | 0.65 | +| P3 | 0.75 | + +Findings at or above their severity's gate survive for classification. Findings below the gate are evaluated for FYI-eligibility: -Suppress findings below 0.50 confidence. Store them as residual concerns for potential promotion in step 3.4. +- **FYI-eligible** when `autofix_class` is `manual` and confidence is between 0.40 (FYI floor) and the severity gate. These surface in an FYI subsection at the presentation layer (see 3.7) but do not enter the walk-through or any bulk action — they exist as observational value without forcing a decision. +- **Dropped** when confidence is below 0.40, or when the finding is `safe_auto` / `gated_auto` but below the severity gate (auto-apply findings need confidence above the decision gate to avoid silent mistakes). + +Record the drop count and the FYI count in Coverage. ### 3.3 Deduplicate Fingerprint each finding using `normalize(section) + normalize(title)`. Normalization: lowercase, strip punctuation, collapse whitespace. When fingerprints match across personas: -- If the findings recommend **opposing actions** (e.g., one says cut, the other says keep), do not merge -- preserve both for contradiction resolution in 3.5 + +- If the findings recommend opposing actions (e.g., one says cut, the other says keep), do not merge — preserve both for contradiction resolution in 3.5 - Otherwise merge: keep the highest severity, keep the highest confidence, union all evidence arrays, note all agreeing reviewers (e.g., "coherence, feasibility") -- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence. Decrement the losing persona's Findings count *and* the corresponding route bucket (Auto or Present) so `Findings = Auto + Present` stays exact. +- **Coverage attribution:** Attribute the merged finding to the persona with the highest confidence. Decrement the losing persona's Findings count and the corresponding route bucket so totals stay exact. + +### 3.4 Cross-Persona Agreement Boost -### 3.4 Promote Residual Concerns +When 2+ independent personas flagged the same merged finding (from 3.3), boost the merged finding's confidence by +0.10 (capped at 1.0). Independent corroboration is strong signal — multiple reviewers converging on the same issue is more reliable than any single reviewer's confidence. Note the boost in the Reviewer column of the output (e.g., "coherence, feasibility +0.10"). -Scan the residual concerns (findings suppressed in 3.2) for: -- **Cross-persona corroboration**: A residual concern from Persona A overlaps with an above-threshold finding from Persona B. Promote at P2 with confidence 0.55-0.65. Inherit `finding_type` from the corroborating above-threshold finding. -- **Concrete blocking risks**: A residual concern describes a specific, concrete risk that would block implementation. Promote at P2 with confidence 0.55. Set `finding_type: omission` (blocking risks surfaced as residual concerns are inherently about something the document failed to address). +This replaces the earlier residual-concern promotion step. Findings below the confidence gate are not promoted back into the review surface; they appear in Coverage as residual concerns only. If a below-gate finding is genuinely important, the reviewer should raise their confidence or provide stronger evidence rather than relying on a promotion rule. ### 3.5 Resolve Contradictions When personas disagree on the same section: -- Create a **combined finding** presenting both perspectives -- Set `autofix_class: present` -- Set `finding_type: error` (contradictions are by definition about conflicting things the document says, not things it omits) + +- Create a combined finding presenting both perspectives +- Set `autofix_class: manual` (contradictions are by definition judgment calls) +- Set `finding_type: error` (contradictions are about conflicting things the document says, not things it omits) - Frame as a tradeoff, not a verdict Specific conflict patterns: -- Coherence says "keep for consistency" + scope-guardian says "cut for simplicity" -> combined finding, let user decide -- Feasibility says "this is impossible" + product-lens says "this is essential" -> P1 finding framed as a tradeoff -- Multiple personas flag the same issue -> merge into single finding, note consensus, increase confidence -### 3.6 Promote Pattern-Resolved Findings +- Coherence says "keep for consistency" + scope-guardian says "cut for simplicity" → combined finding, let user decide +- Feasibility says "this is impossible" + product-lens says "this is essential" → P1 finding framed as a tradeoff +- Multiple personas flag the same issue (no disagreement) → handled in 3.3 merge, not here + +### 3.5b Deterministic Recommended-Action Tie-Break + +Every merged finding carries exactly one `recommended_action` field consumed by the walk-through (`references/walkthrough.md`) to mark the `(recommended)` option, by LFG (`references/bulk-preview.md`) to choose what to execute in bulk, and by the stem's yes/no framing. When a merged finding was flagged by multiple personas who implied different actions, synthesis picks the recommended action deterministically so identical review artifacts produce identical walk-through and LFG behavior across runs. + +**Tie-break order (most conservative first):** `Skip > Defer > Apply`. The first action that at least one contributing persona implied wins, scanning in that order. + +- If any contributing persona implied Skip → `recommended_action: Skip` +- Else if any contributing persona implied Defer → `recommended_action: Defer` +- Else → `recommended_action: Apply` + +**Persona-to-action mapping.** A persona implies an action through its classification: -Scan `present` findings for codebase-pattern-resolved auto-eligibility. Promote `present` -> `auto` when **all three** conditions are met: +- `safe_auto` or `gated_auto` → implies Apply +- `manual` with a concrete `suggested_fix` and a recommended resolution → implies Apply (the persona has an opinion about what to do) +- `manual` flagged as a tradeoff or scope question with no recommended resolution → implies Defer (worth revisiting, not worth acting now) +- Any persona flagging the finding as low-confidence or suppression-eligible via residual concerns → implies Skip +- Persona in the contradiction set (3.5) implying "keep as-is / do not change" → implies Skip -1. The finding's `why_it_matters` cites a specific existing codebase pattern -- not just "best practice" or "convention," but a concrete pattern with a file, function, or usage reference -2. The finding includes a concrete `suggested_fix` that follows that cited pattern -3. There is no genuine tradeoff -- the codebase context resolves any ambiguity about which approach to use +If the contributing personas are all silent on action (e.g., a merged `manual` finding from personas that all flagged it as observation without recommendation) → `recommended_action: Apply` as the pragmatic default; the walk-through still lets the user pick any of the four options. -The principle: when a reviewer mentions multiple theoretical approaches but the codebase already has an established pattern that makes one approach clearly correct, the codebase context settles the question. Alternatives mentioned in passing do not create a real tradeoff if the evidence shows the codebase has already chosen. +**Conflict-context surface.** When the tie-break fires (contributing personas implied different actions), record a one-line conflict-context string on the merged finding. The walk-through renders this on the R15 conflict-context line (see `references/walkthrough.md`). Example: `Coherence recommends Apply; scope-guardian recommends Skip. Agent's recommendation: Skip.` -Additional auto-promotion patterns (promote `present` -> `auto` when): -- The finding identifies factually incorrect behavior in the document and the suggested fix describes the correct behavior (not a design choice between alternatives) -- The finding identifies a missing industry-standard security control where the document's own context makes the omission clearly wrong (not a legitimate design choice for the system described), and the suggested fix follows established practice -- The finding identifies an incomplete technical description and the complete version is directly derivable from the codebase (the reviewer cited specific code showing what the description should say) +**Downstream invariant.** The walk-through and bulk-preview never recompute the recommendation — they read `recommended_action` and render `(recommended)` on the matching option. LFG-the-rest and routing option B execute the `recommended_action` across the scoped finding set in bulk. This keeps LFG outcomes reproducible and auditable: the same review artifact always produces the same bulk plan. -Do not promote if the finding involves scope or priority changes where the document author may have weighed tradeoffs invisible to the reviewer. +### 3.6 Promote Auto-Eligible Findings + +Scan `manual` findings for promotion to `safe_auto` or `gated_auto`. Promote when the finding meets one of the consolidated auto-promotion patterns: + +- **Codebase-pattern-resolved.** `why_it_matters` cites a specific existing codebase pattern (concrete file/function/usage reference, not just "best practice" or "convention"), and `suggested_fix` follows that pattern. Promote to `gated_auto` — the user still confirms, but the codebase evidence resolves ambiguity. +- **Factually incorrect behavior.** The document describes behavior that is factually wrong, and the correct behavior is derivable from context or the codebase. Promote to `gated_auto`. +- **Missing standard security/reliability controls.** The omission is clearly a gap (not a legitimate design choice for the system described), and the fix follows established practice (HTTPS enforcement, checksum verification, input sanitization, fallback-with-deprecation-warning on renames). Promote to `gated_auto`. +- **Framework-native-API substitutions.** A hand-rolled implementation duplicates first-class framework behavior, and the framework API is cited. Promote to `gated_auto`. +- **Mechanically-implied completeness additions.** The missing content follows mechanically from the document's own explicit, concrete decisions (not high-level goals). Promote to `safe_auto` when there is genuinely one correct addition; `gated_auto` when the addition is substantive. + +Do not promote if the finding involves scope or priority changes where the author may have weighed tradeoffs invisible to the reviewer. + +**Strawman-downgrade safeguard.** If a `safe_auto` finding names dismissed alternatives in `why_it_matters` (per the subagent template's strawman rule), verify the alternatives are genuinely strawmen. If any alternative is a plausible design choice that the persona dismissed too aggressively, downgrade to `gated_auto` so the user sees the tradeoff before the fix applies. ### 3.7 Route by Autofix Class -**Severity and autofix_class are independent.** A P1 finding can be `auto` if the correct fix is obvious. The test is not "how important?" but "is there one clear correct fix, or does this require judgment?" +**Severity and autofix_class are independent.** A P1 finding can be `safe_auto` if the correct fix is obvious. The test is not "how important?" but "is there one clear correct fix, or does this require judgment?" | Autofix Class | Route | |---------------|-------| -| `auto` | Apply automatically -- one clear correct fix. Includes internal reconciliation (one part authoritative over another), additions mechanically implied by the document's own content, and codebase-pattern-resolved fixes where codebase evidence makes one approach clearly correct. | -| `present` | Present individually for user judgment | +| `safe_auto` | Apply silently in Phase 4. Requires `suggested_fix`. Demote to `gated_auto` if missing. | +| `gated_auto` | Enter the per-finding walk-through with Apply marked (recommended). Requires `suggested_fix`. Demote to `manual` if missing. | +| `manual` | Enter the per-finding walk-through with user-judgment framing. `suggested_fix` is optional. | +| FYI-subsection | `manual` findings below the severity gate but at or above the FYI floor (0.40) — surface in a distinct FYI subsection of the presentation, do not enter the walk-through or any bulk action. | -Demote any `auto` finding that lacks a `suggested_fix` to `present`. +**Auto-eligible patterns for safe_auto:** summary/detail mismatch (body authoritative over overview), wrong counts, missing list entries derivable from elsewhere in the document, stale internal cross-references, terminology drift, prose/diagram contradictions where prose is more detailed, missing steps mechanically implied by other content, unstated thresholds implied by surrounding context. -**Auto-eligible patterns:** summary/detail mismatch (body is authoritative over overview), wrong counts, missing list entries derivable from elsewhere in the document, stale internal cross-references, terminology drift, prose/diagram contradictions where prose is more detailed, missing steps mechanically implied by other content, unstated thresholds implied by surrounding context, completeness gaps where the correct addition is obvious, codebase-pattern-resolved fixes where the reviewer cites a specific existing pattern and the suggested_fix follows it, factually incorrect behavior where the correct behavior is obvious from context or the codebase, missing standard security controls with known implementations, incomplete technical descriptions where the complete version is derivable from the codebase. If the fix requires judgment about *what* to do (not just *what to write*) and the codebase context does not resolve the ambiguity, it belongs in `present`. +**Auto-eligible patterns for gated_auto:** codebase-pattern-resolved fixes, factually incorrect behavior, missing standard security/reliability controls, framework-native-API substitutions, substantive completeness additions mechanically implied by explicit decisions. ### 3.8 Sort -Sort findings for presentation: P0 -> P1 -> P2 -> P3, then by finding type (errors before omissions), then by confidence (descending), then by document order (section position). +Sort findings for presentation: P0 → P1 → P2 → P3, then by finding type (errors before omissions), then by confidence (descending), then by document order (section position). ## Phase 4: Apply and Present -### Apply Auto-fixes +### Apply safe_auto fixes + +Apply all `safe_auto` findings to the document in a single pass: -Apply all `auto` findings to the document in a **single pass**: - Edit the document inline using the platform's edit tool -- Track what was changed for the "Auto-fixes Applied" section -- Do not ask for approval -- these have one clear correct fix +- Track what was changed for the "Applied safe_auto fixes" section +- Do not ask for approval — these have one clear correct fix + +List every applied fix in the output summary so the user can see what changed. Use enough detail to convey the substance of each fix (section, what was changed, reviewer attribution). This is especially important for fixes that add content or touch document meaning — the user should not have to diff the document to understand what the review did. -List every auto-fix in the output summary so the user can see what changed. Use enough detail to convey the substance of each fix (section, what was changed, reviewer attribution). This is especially important for fixes that add content or touch document meaning -- the user should not have to diff the document to understand what the review did. +### Route Remaining Findings -### Present Remaining Findings +After safe_auto fixes apply, remaining findings split into buckets: -**Headless mode:** Do not use interactive question tools. Output all non-auto findings as a structured text summary the caller can parse and act on: +- `gated_auto` and `manual` findings at or above the severity gate → enter the routing question (see Unit 5 / `references/walkthrough.md`) +- FYI-subsection findings → surface in the presentation only, no routing +- Zero actionable findings remaining → skip the routing question; flow directly to Phase 5 terminal question + +**Headless mode:** Do not use interactive question tools. Output all findings as a structured text envelope the caller can parse: ``` Document review complete (headless mode). -Applied N auto-fixes: +Applied N safe_auto fixes: -
: () -
: () -Findings (requires judgment): +Gated-auto findings (concrete fix, requires user confirmation): [P0] Section:
(<reviewer>, confidence <N>) Why: <why_it_matters> - Suggested fix: <suggested_fix or "none"> + Suggested fix: <suggested_fix> + +Manual findings (requires judgment): [P1] Section: <section> — <title> (<reviewer>, confidence <N>) Why: <why_it_matters> Suggested fix: <suggested_fix or "none"> +FYI observations (low-confidence, no decision required): + +[P3] Section: <section> — <title> (<reviewer>, confidence <N>) + Why: <why_it_matters> + Residual concerns: - <concern> (<source>) Deferred questions: - <question> (<source>) + +Review complete ``` -Omit any section with zero items. Then proceed directly to Phase 5 (which returns immediately in headless mode). +Omit any section with zero items. The envelope preserves backward compatibility — callers that only read "Applied N safe_auto fixes" and "Manual findings" continue to work; the `Gated-auto findings` and `FYI observations` sections surface alongside, not instead of, the existing buckets. End with "Review complete" as the terminal signal so callers can detect completion. **Interactive mode:** -Present `present` findings using the review output template (read `references/review-output-template.md`). Within each severity level, separate findings by type: -- **Errors** (design tensions, contradictions, incorrect statements) first -- these need resolution -- **Omissions** (missing steps, absent details, forgotten entries) second -- these need additions +Present findings using the review output template (read `references/review-output-template.md`). Within each severity level, separate findings by type: + +- Errors (design tensions, contradictions, incorrect statements) first — these need resolution +- Omissions (missing steps, absent details, forgotten entries) second — these need additions + +Brief summary at the top: "Applied N safe_auto fixes. K findings to consider (X errors, Y omissions). Z FYI observations." + +Include the Coverage table, applied fixes, FYI observations (as a distinct subsection), residual concerns, and deferred questions. + +### R29 Rejected-Finding Suppression (Round 2+) + +When the orchestrator is running round 2+ on the same document in the same session, the decision primer (see `SKILL.md` — Decision primer) carries forward every prior-round Skipped and Deferred finding. Synthesis suppresses re-raised rejected findings rather than re-surfacing them to the user. -Brief summary at the top: "Applied N auto-fixes. K findings to consider (X errors, Y omissions)." +For each current-round finding, compare against the primer's rejected list: -Include the Coverage table, auto-fixes applied, residual concerns, and deferred questions. +- **Matching predicate:** same as R30 — `normalize(section) + normalize(title)` fingerprint augmented with evidence-substring overlap check (>50%). If a current-round finding matches a prior-round rejected finding on fingerprint AND evidence overlap, drop the current-round finding. +- **Materially-different exception:** if the current document state has changed around the finding's section since the prior round (e.g., the section was edited and the evidence quote no longer appears in the current text), treat the finding as new — the underlying context shifted and the concern may be genuinely different now. The persona's evidence itself reveals this: a quote that doesn't appear in the current document is a signal the prior-round rejection no longer applies. +- **On suppression:** record the drop in Coverage with a "previously rejected, re-raised this round" note so the user can see what was suppressed. The user can explicitly escalate by invoking the review again on a different context if they believe the suppression was wrong. + +This rule runs at synthesis time, not at the persona level. Personas have a soft instruction via the subagent template's `{decision_primer}` variable to avoid re-raising rejected findings, but the orchestrator is the authoritative gate — if a persona re-raises despite the primer, synthesis drops the finding. + +### R30 Fix-Landed Matching Predicate + +When the orchestrator is running round 2+ on the same document (see Unit 7 multi-round memory), synthesis verifies that prior-round Applied findings actually landed. For each prior-round Applied finding: + +- **Matching predicate:** `normalize(section) + normalize(title)` (same fingerprint as 3.3 dedup) augmented with an evidence-substring overlap check. If any current-round persona raises a finding whose fingerprint matches a prior-round Applied finding AND shares >50% of its evidence substring with the prior-round evidence, treat it as a fix-landed regression. +- **Section renames count as different locations.** If the section name has changed between rounds (edit introduced a heading rename), treat the new section as a different location and the current-round finding as new. +- **On match:** flag the finding as "fix did not land" in the report rather than surfacing it as a new finding. Include the prior-round finding's title and the current-round persona's evidence so the user can see why the verification flagged it. ### Protected Artifacts During synthesis, discard any finding that recommends deleting or removing files in: + - `docs/brainstorms/` - `docs/plans/` - `docs/solutions/` These are pipeline artifacts and must not be flagged for removal. -## Phase 5: Next Action +## Phase 5: Next Action — Terminal Question -**Headless mode:** Return "Review complete" immediately. Do not ask questions. The caller receives the text summary from Phase 4 and handles any remaining findings. +**Headless mode:** Return "Review complete" immediately. Do not ask questions. The caller receives the text envelope from Phase 4 and handles any remaining findings. -**Interactive mode:** +**Interactive mode:** fire the terminal question using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). This question is distinct from the mid-flow routing question (`references/walkthrough.md`) — the routing question chooses *how* to engage with findings, this one chooses *what to do next* once engagement is complete. Do not merge them. + +**Stem:** `Apply decisions and what next?` + +**Options (three by default; two in the zero-actionable case):** + +When `fixes_applied_count > 0` (at least one safe_auto or Apply decision has landed this session): + +``` +A. Apply decisions and proceed to <next stage> +B. Apply decisions and re-review +C. Exit without further action +``` + +When `fixes_applied_count == 0` (zero-actionable case, or the user took routing option D / every walk-through decision was Skip): + +``` +A. Proceed to <next stage> +B. Exit without further action +``` + +The `<next stage>` substitution uses the document type from Phase 1: + +- Requirements document → `ce-plan` +- Plan document → `ce-work` -**Ask using the platform's interactive question tool** -- do not print the question as plain text output: -- Claude Code: `AskUserQuestion` -- Codex: `request_user_input` -- Gemini: `ask_user` -- Fallback (no question tool available): present numbered options and stop; wait for the user's next message +**Label adaptation:** when no decisions are queued to apply, the primary option drops the `Apply decisions and` prefix — the label should match what the system is doing. `Apply decisions and proceed` when fixes are queued; `Proceed` when nothing is queued. -Offer these two options. Use the document type from Phase 1 to set the "Review complete" description: +**Caller-context handling (implicit):** the terminal question's "Proceed to <next stage>" option is interpreted contextually by the agent from the visible conversation state. When ce-doc-review is invoked from inside another skill's flow (e.g., ce-brainstorm Phase 4 re-review, ce-plan phase 5.3.8), the agent does not fire a nested `/ce-plan` or `/ce-work` dispatch — it returns control to the caller's flow which continues its own logic. When invoked standalone, "Proceed" dispatches the appropriate next skill. No explicit caller-hint argument is required; if this implicit handling proves unreliable in practice, an explicit `nested:true` flag can be added as a follow-up. -1. **Refine again** -- Address the findings above, then re-review -2. **Review complete** -- description based on document type: - - requirements document: "Create technical plan with ce-plan" - - plan document: "Implement with ce-work" +### Iteration limit -After 2 refinement passes, recommend completion -- diminishing returns are likely. But if the user wants to continue, allow it. +After 2 refinement passes, recommend completion — diminishing returns are likely. But if the user wants to continue, allow it; the primer carries all prior-round decisions so later rounds suppress repeat findings cleanly. -Return "Review complete" as the terminal signal for callers. +Return "Review complete" as the terminal signal for callers, regardless of which option the user picked. ## What NOT to Do @@ -170,4 +266,4 @@ Return "Review complete" as the terminal signal for callers. ## Iteration Guidance -On subsequent passes, re-dispatch personas and re-synthesize. The auto-fix mechanism and confidence gating prevent the same findings from recurring once fixed. If findings are repetitive across passes, recommend completion. +On subsequent passes, re-dispatch personas with the multi-round decision primer (see Unit 7) and re-synthesize. Fixed findings self-suppress because their evidence is gone from the current doc; rejected findings are handled by the R29 pattern-match suppression rule; applied-fix verification uses the R30 matching predicate above. If findings are repetitive across passes after these mechanisms run, recommend completion. diff --git a/plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md b/plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md new file mode 100644 index 000000000..989e89579 --- /dev/null +++ b/plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md @@ -0,0 +1,236 @@ +# Per-finding Walk-through + +This reference defines Interactive mode's per-finding walk-through — the path the user enters by picking option A (`Review each finding one by one — accept the recommendation or choose another action`) from the routing question, plus the unified completion report that every terminal path (walk-through, LFG, Append-to-Open-Questions, zero findings) emits. + +Interactive mode only. + +--- + +## Routing question (the entry point) + +After `safe_auto` fixes apply and synthesis produces the remaining finding set, the orchestrator asks a four-option routing question before any walk-through or bulk action runs. + +Use the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_user` in Gemini). In Claude Code, the tool should already be loaded from the Interactive-mode pre-load step in `SKILL.md` — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. Rendering the routing question as narrative text is a bug, not a valid fallback. + +**Stem:** `What should the agent do with the remaining N findings?` + +**Options (fixed order; no option is labeled `(recommended)` — the routing choice is user-intent):** + +``` +A. Review each finding one by one — accept the recommendation or choose another action +B. LFG. Apply the agent's best-judgment action per finding +C. Append findings to the doc's Open Questions section and proceed +D. Report only — take no further action +``` + +The per-finding `(recommended)` labeling lives inside the walk-through (option A) and the bulk preview (options B/C), where it's applied per-finding from synthesis step 3.5b's `recommended_action`. The routing question itself does not recommend one of A/B/C/D because the right route depends on user intent (engage / trust / triage / skim), not on the finding-set shape — a rule that mapped finding-set shape to routing recommendation (e.g., "most findings are Apply-shaped → recommend LFG") would pressure users toward automated paths in ways that conflict with the user-intent framing. + +If all remaining findings are FYI-subsection-only (no `gated_auto` or above-gate `manual` findings), skip the routing question entirely and flow to the Phase 5 terminal question. + +**Dispatch by selection:** + +- **A** — load this walk-through (per-finding loop). Apply decisions accumulate in memory; Open-Questions defers execute inline via `references/open-questions-defer.md`; Skip decisions are recorded as no-action; `LFG the rest` routes through `references/bulk-preview.md`. +- **B** — load `references/bulk-preview.md` scoped to every pending `gated_auto` / `manual` finding. On Proceed, execute the plan: Apply → end-of-batch document edit; Open-Questions defers → `references/open-questions-defer.md`; Skip → no-op. On Cancel, return to the routing question. +- **C** — load `references/bulk-preview.md` with every pending finding in the Open-Questions bucket (regardless of the agent's natural recommendation). On Proceed, route every finding through `references/open-questions-defer.md`; no document edits apply. On Cancel, return to the routing question. +- **D** — do not enter any dispatch phase. Emit the completion report and flow to Phase 5 terminal question. + +--- + +## Entry (walk-through mode) + +The walk-through receives, from the orchestrator: + +- The merged findings list in severity order (P0 → P1 → P2 → P3), filtered to `gated_auto` and `manual` findings that survived the per-severity confidence gate. FYI-subsection findings are not included — they surface in the final report only and have no walk-through entry. +- The run id for artifact lookups (when applicable). + +Each finding's recommended action has already been normalized by synthesis step 3.5b (Deterministic Recommended-Action Tie-Break, `Skip > Defer > Apply`) — the walk-through surfaces that recommendation via the merged finding's `recommended_action` field and does not recompute it. + +--- + +## Per-finding presentation + +Each finding is presented in two parts: a terminal output block carrying the explanation, and a question via the platform's blocking question tool carrying the decision. Never merge the two — the terminal block uses markdown; the question uses plain text. + +### Terminal output block (print before firing the question) + +Render as markdown. Labels on their own line, blank lines between sections: + +``` +## Finding {N} of {M} — {severity} {plain-English title} + +Section: {section} + +**What's wrong** + +{plain-English problem statement from why_it_matters} + +**Proposed fix** + +{suggested_fix — rendered per the substitution rules below: prose-first, intent-language} + +**Why it works** + +{short reasoning, grounded in a pattern cited in the document or codebase when available} + +{Conflict-context line, when applicable — see below} +``` + +Substitutions: + +- **`{plain-English title}`** — a 3–8 word summary suitable as a heading. Derived from the merged finding's `title` field but rephrased so it reads as observable consequence (e.g., "Implementers will pick different tiers" rather than "Section X-Y lists four tiers"). For document-review findings, observable consequence is the *effect on a reader, implementer, or downstream decision*, not runtime behavior. +- **`{section}`** — from the finding's `section` field. +- **`why_it_matters`** — from the merged finding's `why_it_matters` field. Rendered as-is; the subagent template's framing guidance ensures it's already observable-consequence-first. +- **`suggested_fix`** — from the merged finding's `suggested_fix` field. Render as prose describing intent, not as raw markup. The user's job is to trust or reject the action — they don't need to review exact text. Rules: + - **Default — one sentence describing the effect.** What does the fix achieve, and where does it live? Prefer intent language over quoted text. + - Good: `Drop the Advisory tier from the enum; advisory-style findings surface in an FYI subsection at the presentation layer.` + - Good: `Add a deployment-ordering constraint requiring Units 3 and 4 in a single commit.` + - Bad: `Change "autofix_class: [auto, gated_auto, advisory, present]" to "autofix_class: [safe_auto, gated_auto, manual]" in findings-schema.json on line 48.` — too syntax-focused for a decision loop + - **Code-span budget** — at most 2 inline backtick spans per sentence, each a single identifier, flag, or short phrase (e.g., `` `safe_auto` ``, `` `<work-context>` ``). Always leave a space before and after each backtick span. + - **Raw code blocks** — only for short (≤5-line) genuinely additive content where no before-state exists. Above 5 lines, switch to a summary. + - **No diff blocks.** Document mutations render as prose. +- **`Why it works`** — grounded reasoning that, where possible, references a similar pattern already used in the document or codebase. One to three sentences. +- **Conflict-context line (when applicable)** — when contributing personas implied different actions for this finding and synthesis step 3.6 broke the tie, surface that briefly. Example: `Coherence recommends Apply; scope-guardian recommends Skip. Agent's recommendation: Skip.` The orchestrator's recommendation — the post-tie-break value — is what the menu labels "recommended." + +### Question stem (short, decision-focused) + +After the terminal block renders, fire the platform's blocking question tool with a compact two-line stem: + +``` +Finding {N} of {M} — {severity} {short handle}. +{Action framing in a phrase}? +``` + +Where: + +- **Short handle** matches the `{plain-English title}` from the terminal block heading. +- **Action framing** — one phrase describing what the single recommended action does, as a yes/no question. Examples: `Apply the rename?`, `Defer to Open Questions since the tradeoff is genuine?`, `Skip since the document already resolves this elsewhere?`. + +Never enumerate alternatives in the stem. One recommendation as a yes/no — the option list carries the alternatives. When the recommendation is close, surface the disagreement in the conflict-context line, not as a multi-option stem. + +### Confirmation between findings + +After the user answers and before printing the next finding's terminal block, emit a one-line confirmation of the action taken. Examples: `→ Applied. Edit staged at "Scope Boundaries" section.`, `→ Deferred. Entry appended to "## Deferred / Open Questions".`, `→ Skipped.` + +### Options (four; adapted as noted) + +Fixed order. Never reorder: + +``` +A. Apply the proposed fix +B. Defer — append to the doc's Open Questions section +C. Skip — don't apply, don't append +D. LFG the rest — apply the agent's best judgment to this and remaining findings +``` + +**Mark the post-tie-break recommendation with `(recommended)` on its option label.** Required, not optional. Any of the four options can carry it: + +``` +A. Apply the proposed fix (recommended) +B. Defer — append to the doc's Open Questions section +C. Skip — don't apply, don't append +D. LFG the rest +``` + +When reviewers disagreed or evidence cuts against the default, still mark one option — whichever synthesis produced — and surface the disagreement in the conflict-context line. + +### Adaptations + +- **N=1 (exactly one pending finding):** the terminal block's heading omits `Finding N of M` and renders as `## {severity} {plain-English title}`. The stem's first line drops the position counter, becoming `{severity} {short handle}.` Option D (`LFG the rest`) is suppressed because no subsequent findings exist — the menu shows three options: Apply / Defer / Skip. + +- **Open-Questions append unavailable** (read-only document, write-failed): when `references/open-questions-defer.md` reports the in-doc append mechanic cannot run, option B is omitted. The stem appends one line explaining why (e.g., `Defer unavailable — document is read-only in this environment.`). The menu shows three options: Apply / Skip / LFG the rest. Before rendering options, remap any per-finding `Defer` recommendation from synthesis to `Skip` so the `(recommended)` marker lands on an option that's actually in the menu. Surface the remap on the conflict-context line (e.g., `Synthesis recommended Defer; downgraded to Skip — document is read-only.`). + +- **Combined N=1 + no-append:** the menu shows two options: Apply / Skip. + +Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to presenting the options as a numbered list and waiting for the user's next reply. + +--- + +## Per-finding routing + +For each finding's answer: + +- **Apply the proposed fix** — add the finding's id to an in-memory Apply set. Advance to the next finding. Do not edit the document inline — Apply accumulates for end-of-walk-through batch execution. +- **Defer — append to Open Questions section** — invoke the append flow from `references/open-questions-defer.md`. The walk-through's position indicator stays on the current finding during any failure-path sub-question (Retry / Fall back / Convert to Skip). On success, record the append location and reference in the in-memory decision list and advance. On conversion-to-Skip from the failure path, advance with the failure noted in the completion report. +- **Skip — don't apply, don't append** — record Skip in the in-memory decision list. Advance. No side effects. +- **LFG the rest — apply the agent's best judgment** — exit the walk-through loop. Dispatch the bulk preview from `references/bulk-preview.md`, scoped to the current finding and everything not yet decided. The preview header reports the count of already-decided findings ("K already decided"). If the user picks Cancel from the preview, return to the current finding's per-finding question (not to the routing question). If the user picks Proceed, execute the plan per `references/bulk-preview.md` — Apply findings join the in-memory Apply set with the ones the user already picked, Defer findings route through `references/open-questions-defer.md`, Skip is no-op — then proceed to end-of-walk-through execution. + +--- + +## Override rule + +"Override" means the user picks a different preset action (Defer or Skip in place of Apply, or Apply in place of the agent's recommendation). No inline freeform custom-fix authoring — the walk-through is a decision loop, not a pair-editing surface. A user who wants a variant of the proposed fix picks Skip and hand-edits outside the flow; if they also want the finding tracked, they can Defer first and edit afterward. + +--- + +## State + +Walk-through state is **in-memory only**. The orchestrator maintains: + +- An Apply set (finding ids the user picked Apply on) +- A decision list (every answered finding with its action and any metadata like `append_location` for Deferred or `reason` for Skipped) +- The current position in the findings list + +Nothing is written to disk per-decision except the in-doc Open Questions appends (which are external side effects — those cannot be rolled back). An interrupted walk-through (user cancels the prompt, session compacts, network dies) discards all in-memory state. Apply decisions have not been dispatched yet (they batch at end-of-walk-through), so they are cleanly lost with no document changes. + +Cross-session persistence is out of scope. Mirrors `ce-code-review`'s walk-through state rules. + +--- + +## End-of-walk-through execution + +After the loop terminates — either every finding has been answered, or the user took `LFG the rest → Proceed` — the walk-through hands off to the execution phase: + +1. **Apply set:** in a single pass, the orchestrator applies every accumulated Apply-set finding's `suggested_fix` to the document. Document edits happen inline via the platform's edit tool — ce-doc-review has no batch-fixer subagent (per scope boundary); the orchestrator performs the edits directly, since `gated_auto` and `manual` fixes for documents are single-file markdown changes with no cross-file dependencies. +2. **Defer set:** already executed inline during the walk-through via `references/open-questions-defer.md`. Nothing to dispatch here. +3. **Skip:** no-op. + +After execution completes (or after `LFG the rest → Cancel` followed by the user working through remaining findings one at a time, or after the loop runs to completion), emit the unified completion report described below. + +--- + +## Unified completion report + +Every terminal path of Interactive mode emits the same completion report structure. This covers: + +- Walk-through completed (all findings answered) +- Walk-through bailed via `LFG the rest → Proceed` +- Top-level LFG (routing option B) completed +- Top-level Append-to-Open-Questions (routing option C) completed +- Zero findings after `safe_auto` (routing question was skipped — the completion summary is a one-line degenerate case of this structure) + +### Minimum required fields + +- **Per-finding entries:** for every finding the flow touched, a line with — at minimum — title, severity, the action taken (Applied / Deferred / Skipped), the append location for Deferred entries, and a one-line reason for Skipped entries (grounded in the finding's confidence or the one-line `why_it_matters` snippet). +- **Summary counts by action:** totals per bucket (e.g., `4 applied, 2 deferred, 2 skipped`). +- **Failures called out explicitly:** any Apply that failed (e.g., document write error), any Open-Questions append that failed. Failures surface above the per-finding list so they are not missed. +- **End-of-review verdict:** carried over from Phase 4's Coverage section. + +### Report ordering + +Failures first (above the per-finding list), then per-finding entries grouped by action bucket in the order `Applied / Deferred / Skipped`, then summary counts, then Coverage (FYI observations, residual concerns), then the verdict. + +### Zero-findings degenerate case + +When the routing question was skipped because no `gated_auto` / above-gate `manual` findings remained after `safe_auto`, the completion report collapses to its summary-counts + verdict form with one added line — the count of `safe_auto` fixes applied. The summary wording: + +No FYI or residual concerns: + +``` +All findings resolved — 3 safe_auto fixes applied. + +Verdict: Ready. +``` + +FYI or residual concerns remain: + +``` +All actionable findings resolved — 3 safe_auto fixes applied. (2 FYI observations, 1 residual concern remain in the report.) + +Verdict: Ready. +``` + +--- + +## Execution posture + +The walk-through is operationally read-only with respect to the project except for three permitted writes: the in-memory Apply set / decision list (managed by the orchestrator), the in-doc Open Questions appends (external side effects managed by `references/open-questions-defer.md`), and the end-of-walk-through batch document edits (the orchestrator's final Apply pass). Persona agents remain strictly read-only. Unlike `ce-code-review`, there is no fixer subagent — the orchestrator owns the document edit directly. diff --git a/tests/fixtures/ce-doc-review/seeded-plan.md b/tests/fixtures/ce-doc-review/seeded-plan.md new file mode 100644 index 000000000..4c566be37 --- /dev/null +++ b/tests/fixtures/ce-doc-review/seeded-plan.md @@ -0,0 +1,213 @@ +--- +title: Seeded Test Fixture for ce-doc-review Pipeline Validation +type: feat +status: active +date: 2026-04-18 +--- + +<!-- +This is a SEEDED TEST FIXTURE for ce-doc-review pipeline validation. +It contains deliberately-planted issues across each tier shape so the +new synthesis pipeline (safe_auto / gated_auto / manual / FYI / dropped) +can be measured against known expected classifications. + +Seed map (run this plan through ce-doc-review to verify): + +- safe_auto candidates (3): wrong count (Requirements Trace says 6, list + has 5), terminology drift (data store vs database used interchangeably), + stale cross-reference (see-Unit-7 but no Unit 7 exists) +- gated_auto candidates (3): missing fallback-with-deprecation-warning on + rename, deployment-ordering guarantee missing between skill+code commit, + framework-native-API substitution (hand-rolled deprecation vs using + cobra's Deprecated field) +- manual candidates (5): scope-guardian tension (Unit 2 could be merged + with Unit 3), product-lens premise question (is the refactor the right + solution), coherence design tension (two sections disagree on status), + scope-guardian complexity challenge (is this abstraction warranted), + product-lens trajectory concern (does this paint the system into a + corner) +- FYI candidates (5, confidence 0.40-0.65 at P3): filename-symmetry + observation, drift note, stylistic preference without evidence of + impact, speculative future-work concern, subjective readability note +- drop-worthy P3s (3, confidence 0.55-0.74): vague style nitpick, low- + signal "consider X" residual, theoretical scalability concern without + current evidence + +The descriptions intentionally vary in evidence quality so the confidence +gate is exercised. +--> + +# Seeded Test Fixture Plan + +## Problem Frame + +This fixture exercises the ce-doc-review pipeline against representative +issue shapes. The imagined feature is a refactor renaming the `crowd-sniff` +CLI command to `browser-sniff` across 6 implementation units, with +alias-compatibility, skill updates, and a schema migration. + +## Requirements Trace + +6 requirements planned: + +- R1. Rename command and add deprecation alias +- R2. Update skills that invoke the command +- R3. Rename output files from `crowd-report` to `browser-report` +- R4. Migrate data store entries that reference the old name +- R5. Update CLI tests + +(Only 5 items listed despite "6 requirements" — seeded wrong-count +safe_auto candidate.) + +## Scope Boundaries + +- Not changing the command's runtime behavior +- Not changing consumer-facing output formats beyond the rename + +## Key Technical Decisions + +- Keep a hidden alias `crowd-sniff` for backward compatibility (see Unit 7 + below for alias deprecation plan — seeded stale cross-reference; Unit 7 + does not exist in this plan) +- Store deprecation state in the data store +- Emit deprecation warning when alias is used + +(Uses "data store" here and "database" elsewhere — seeded terminology +drift safe_auto candidate.) + +## Implementation Units + +- [ ] Unit 1: Rename the CLI command + +**Goal:** Rename `crowd-sniff` to `browser-sniff` in the CLI framework. + +**Files:** `internal/cli/crowd_sniff.go` + +**Approach:** Move the command definition. Keep the old name as an alias. +Print a one-line deprecation warning to stdout when alias is used. (Seeded +gated_auto: cobra's native `Deprecated` field handles this uniformly; +hand-rolling the deprecation warning duplicates framework behavior.) + +**Test scenarios:** + +- Happy path: `browser-sniff` runs without warning +- Happy path: `crowd-sniff` runs and prints deprecation warning +- Edge case: `-h` on either variant shows the same help + +- [ ] Unit 2: Update skills to invoke new command + +**Goal:** Update every skill that shells out to `crowd-sniff` to call +`browser-sniff` instead. + +**Files:** `plugins/*/skills/*/SKILL.md` (grep for "crowd-sniff") + +**Approach:** sed rename across skill files. Keep alias working for +external consumers that may still invoke `crowd-sniff` directly. + +(Seeded manual: this unit could be merged with Unit 3 since both update +consumer sites that will deploy together — scope-guardian candidate for +"Units 2 and 3 could be one unit.") + +- [ ] Unit 3: Rename output files + +**Goal:** Change output filename from `crowd-report.md` to +`browser-report.md`. + +**Files:** `internal/cli/output.go`, `internal/pipeline/writer.go` + +**Approach:** Write new name, read new name. No fallback — consumers that +read `crowd-report.md` will need to update. (Seeded gated_auto: missing +fallback-with-deprecation-warning on rename; mid-flight consumers and +published content will silently fail. Industry-standard pattern is read +new name first, fall back to old with warning for one release.) + +**Test scenarios:** + +- Happy path: new writes go to `browser-report.md` + +(Seeded FYI: test coverage only covers the happy path and misses the +read-side failure modes entirely, but flagging this is low-signal since +the unit explicitly chose no-fallback.) + +- [ ] Unit 4: Migrate data store entries + +**Goal:** Update database entries that reference the old name. + +**Files:** `db/migrate/20260418_rename_crowd_sniff.rb` + +**Approach:** Single-transaction migration. No deployment-ordering +guarantee between this migration and the code changes in Units 1-3. If +the migration runs before Units 1-3 land, the code reads stale data. +If after, new code temporarily sees old entries until migration runs. +(Seeded gated_auto: deployment-ordering guarantee missing; concrete fix +is to require Units 1-4 land in a single commit/PR.) + +- [ ] Unit 5: Update CLI tests + +**Goal:** Update CLI tests to exercise both names. + +**Files:** `internal/cli/cli_test.go` + +**Approach:** Add test coverage for the new command name and the alias +behavior. + +**Test scenarios:** + +- Happy path: new name test +- Happy path: alias name test with deprecation warning assertion + +## Risks + +- The filename rename affects downstream consumers' readers. The chosen + approach (no-fallback) is subjective and could go either way — keeping + strict "move on" semantics vs. backward-compatible read fallback. + (Seeded manual: genuine design tension between "clean break" and + "compatibility period"; scope-guardian vs. product-lens judgment call.) + +- The alias is compatibility theater if there are no external consumers. + We don't have evidence of external consumers. (Seeded manual: + product-lens premise challenge — "is the alias justified given no + external consumers are documented?") + +## Miscellaneous Notes + +The filename `browser-report.md` is asymmetric with the command name +`browser-sniff` — there's no `-sniff-report.md`. This could go either way +depending on whether command/output parity is valued. (Seeded FYI: +filename asymmetry observation, no wrong answer, low-stakes.) + +Consider renaming the database column `crowd_data` to `browser_data` for +consistency. (Seeded FYI: stylistic preference without evidence of +impact.) + +The refactor may paint the system into a corner if we later want to +support both crowd-based and browser-based sniffing. (Seeded manual: +product-lens trajectory concern about future path dependencies.) + +## Deferred to Implementation + +- Exact deprecation message wording +- Release notes phrasing + +## Known Drift + +`crowd_data` column name remains in the data store schema (legacy). We +may rename it later. (Seeded FYI: drift note without concrete fix.) + +## Abstraction Commentary + +The refactor introduces an `AliasedCommand` abstraction to bundle the +rename + deprecation-warning behavior. This might be overkill for a +one-command rename. (Seeded manual: scope-guardian complexity challenge +— is the abstraction warranted for one use case?) + +## Low-Signal Residuals (Seeded Drop-Worthy P3s) + +- The plan's section ordering could be improved; "Miscellaneous Notes" + feels like a catch-all. (Seeded drop: vague style nitpick at P3, + confidence should register below 0.75 gate.) +- Consider whether the schema migration strategy scales if the codebase + grows 10x. (Seeded drop: theoretical scalability concern without + current evidence, P3.) +- Some sentences could be tighter. (Seeded drop: low-signal "consider X" + at P3.) diff --git a/tests/pipeline-review-contract.test.ts b/tests/pipeline-review-contract.test.ts index 67ee9fd42..4f3412fa0 100644 --- a/tests/pipeline-review-contract.test.ts +++ b/tests/pipeline-review-contract.test.ts @@ -353,3 +353,276 @@ describe("ce-plan review contract", () => { expect(content).not.toContain("**Options for Standard or Lightweight plans:**") }) }) + +describe("ce-doc-review contract", () => { + test("findings-schema autofix_class enum uses ce-code-review-aligned tier names", async () => { + const schema = JSON.parse( + await readRepoFile("plugins/compound-engineering/skills/ce-doc-review/references/findings-schema.json") + ) + const enumValues = schema.properties.findings.items.properties.autofix_class.enum + + // Three-tier system aligned with ce-code-review's first three tier names + expect(enumValues).toEqual(["safe_auto", "gated_auto", "manual"]) + + // No advisory tier — advisory-style findings surface as an FYI subsection at presentation layer + expect(enumValues).not.toContain("advisory") + + // Old tier names must be gone after the rename + expect(enumValues).not.toContain("auto") + expect(enumValues).not.toContain("present") + }) + + test("subagent template carries framing guidance and strawman rule", async () => { + const template = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/subagent-template.md" + ) + + // Framing guidance block present + expect(template).toContain("observable consequence") + expect(template).toContain("2-4 sentences") + + // Strawman-aware classification rule + expect(template).toContain("Strawman-aware classification rule") + expect(template).toContain("is NOT a real alternative") + + // Strawman safeguard on safe_auto + expect(template).toContain("Strawman safeguard") + + // Persona exclusion of Open Questions section (prevents round-2 feedback loop) + expect(template).toContain("Exclude prior-round deferred entries") + expect(template).toContain("Deferred / Open Questions") + + // Decision primer slot and rules + expect(template).toContain("{decision_primer}") + expect(template).toContain("<decision-primer-rules>") + }) + + test("synthesis pipeline routes three tiers with per-severity gates and FYI subsection", async () => { + const synthesis = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md" + ) + + // Per-severity confidence gate with the specific thresholds + expect(synthesis).toContain("Per-Severity") + expect(synthesis).toMatch(/P0\s*\|\s*0\.50/) + expect(synthesis).toMatch(/P1\s*\|\s*0\.60/) + expect(synthesis).toMatch(/P2\s*\|\s*0\.65/) + expect(synthesis).toMatch(/P3\s*\|\s*0\.75/) + + // FYI floor at 0.40 for low-confidence manual findings + expect(synthesis).toContain("0.40") + expect(synthesis).toContain("FYI floor") + + // Three-tier routing table present + expect(synthesis).toContain("`safe_auto`") + expect(synthesis).toContain("`gated_auto`") + expect(synthesis).toContain("`manual`") + + // Cross-persona agreement boost (replaces residual-concern promotion) + expect(synthesis).toContain("Cross-Persona Agreement Boost") + expect(synthesis).toContain("+0.10") + + // R29 and R30 round-2 rules + expect(synthesis).toContain("R29 Rejected-Finding Suppression") + expect(synthesis).toContain("R30 Fix-Landed Matching Predicate") + }) + + test("headless envelope surfaces new tiers distinctly", async () => { + const synthesis = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md" + ) + + // Bucket headers for the new tiers appear in the headless envelope template + expect(synthesis).toContain("Applied N safe_auto fixes") + expect(synthesis).toContain("Gated-auto findings") + expect(synthesis).toContain("Manual findings") + expect(synthesis).toContain("FYI observations") + + // Terminal signal preserved for programmatic callers + expect(synthesis).toContain("Review complete") + }) + + test("terminal question is three-option by default with label adaptation", async () => { + const synthesis = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/synthesis-and-presentation.md" + ) + + // Three options when fixes are queued + expect(synthesis).toContain("Apply decisions and proceed to <next stage>") + expect(synthesis).toContain("Apply decisions and re-review") + expect(synthesis).toContain("Exit without further action") + + // Two options in the zero-actionable case with the adapted label + expect(synthesis).toContain("fixes_applied_count == 0") + expect(synthesis).toContain("zero-actionable case") + + // Next-stage substitution rules documented + expect(synthesis).toContain("Requirements document") + expect(synthesis).toContain("Plan document") + expect(synthesis).toContain("ce-plan") + expect(synthesis).toContain("ce-work") + }) + + test("SKILL.md has Interactive mode rules with AskUserQuestion pre-load", async () => { + const content = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/SKILL.md" + ) + + // Interactive mode rules section at top + expect(content).toContain("## Interactive mode rules") + expect(content).toContain("AskUserQuestion") + expect(content).toContain("ToolSearch") + expect(content).toContain("numbered-list fallback") + + // Decision primer variable in the dispatch table + expect(content).toContain("{decision_primer}") + expect(content).toContain("<prior-decisions>") + + // References loaded lazily via backtick paths for walk-through and bulk-preview + expect(content).toContain("`references/walkthrough.md`") + expect(content).toContain("`references/bulk-preview.md`") + }) + + test("walkthrough and bulk-preview reference files exist with required mechanics", async () => { + const walkthrough = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/walkthrough.md" + ) + const bulkPreview = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/bulk-preview.md" + ) + + // Routing question distinguishing words present (front-loaded per AGENTS.md Interactive Question Tool Design) + expect(walkthrough).toContain("Review each finding one by one") + expect(walkthrough).toContain("LFG") + expect(walkthrough).toContain("Append findings to the doc's Open Questions section") + expect(walkthrough).toContain("Report only") + + // Four per-finding options + expect(walkthrough).toContain("Apply the proposed fix") + expect(walkthrough).toContain("Defer — append to the doc's Open Questions section") + expect(walkthrough).toContain("Skip — don't apply, don't append") + expect(walkthrough).toContain("LFG the rest") + + // Recommended marker mandatory + expect(walkthrough).toContain("(recommended)") + + // No advisory variant (advisory is a presentation-layer concept, not a walkthrough option) + expect(walkthrough).not.toContain("Acknowledge — mark as reviewed") + + // No tracker-detection machinery (ce-doc-review has no external tracker) + expect(walkthrough).not.toContain("named_sink_available") + expect(walkthrough).not.toContain("any_sink_available") + expect(walkthrough).not.toContain("[TRACKER]") + + // Bulk preview has Proceed/Cancel options and the four bucket labels + expect(bulkPreview).toContain("Proceed") + expect(bulkPreview).toContain("Cancel") + expect(bulkPreview).toContain("Applying (N):") + expect(bulkPreview).toContain("Appending to Open Questions (N):") + expect(bulkPreview).toContain("Skipping (N):") + + // No Acknowledge bucket in bulk preview either + expect(bulkPreview).not.toContain("Acknowledging (N):") + }) + + test("open-questions-defer reference implements append mechanic with failure path", async () => { + const defer = await readRepoFile( + "plugins/compound-engineering/skills/ce-doc-review/references/open-questions-defer.md" + ) + + // Append mechanic steps + expect(defer).toContain("## Deferred / Open Questions") + expect(defer).toContain("### From YYYY-MM-DD review") + + // Entry format includes required fields but excludes suggested_fix and evidence + expect(defer).toContain("{title}") + expect(defer).toContain("{severity}") + expect(defer).toContain("{reviewer}") + expect(defer).toContain("{confidence}") + expect(defer).toContain("{why_it_matters}") + + // Failure-path sub-question with three options + expect(defer).toContain("Retry") + expect(defer).toContain("Record the deferral in the completion report only") + expect(defer).toContain("Convert this finding to Skip") + + // No tracker-detection logic (this is the in-doc defer path, not tracker-defer) + expect(defer).not.toContain("named_sink_available") + expect(defer).not.toContain("[TRACKER]") + }) +}) + +describe("ce-compound frontmatter schema expansion contract", () => { + test("problem_type enum includes the four new knowledge-track values", async () => { + const schema = await readRepoFile( + "plugins/compound-engineering/skills/ce-compound/references/schema.yaml" + ) + + // Four new knowledge-track values present in the enum + expect(schema).toContain("architecture_pattern") + expect(schema).toContain("design_pattern") + expect(schema).toContain("tooling_decision") + expect(schema).toContain("convention") + + // best_practice remains valid as fallback + expect(schema).toContain("best_practice") + }) + + test("ce-compound-refresh schema stays in sync with canonical ce-compound schema", async () => { + const canonical = await readRepoFile( + "plugins/compound-engineering/skills/ce-compound/references/schema.yaml" + ) + const refresh = await readRepoFile( + "plugins/compound-engineering/skills/ce-compound-refresh/references/schema.yaml" + ) + + // Duplicate schemas must be identical (kept in sync intentionally per AGENTS.md) + expect(refresh).toEqual(canonical) + }) + + test("yaml-schema.md documents category mappings for the four new values", async () => { + const mapping = await readRepoFile( + "plugins/compound-engineering/skills/ce-compound/references/yaml-schema.md" + ) + + expect(mapping).toContain("architecture_pattern` -> `docs/solutions/architecture-patterns/") + expect(mapping).toContain("design_pattern` -> `docs/solutions/design-patterns/") + expect(mapping).toContain("tooling_decision` -> `docs/solutions/tooling-decisions/") + expect(mapping).toContain("convention` -> `docs/solutions/conventions/") + }) +}) + +describe("ce-learnings-researcher domain-agnostic contract", () => { + test("agent prompt frames as domain-agnostic not bug-focused", async () => { + const agent = await readRepoFile( + "plugins/compound-engineering/agents/research/ce-learnings-researcher.agent.md" + ) + + // Domain-agnostic identity framing + expect(agent).toContain("domain-agnostic institutional knowledge researcher") + + // Multiple learning shapes named as first-class + expect(agent).toContain("Architecture patterns") + expect(agent).toContain("Design patterns") + expect(agent).toContain("Tooling decisions") + expect(agent).toContain("Conventions") + + // Structured <work-context> input accepted + expect(agent).toContain("<work-context>") + expect(agent).toContain("Activity:") + expect(agent).toContain("Concepts:") + expect(agent).toContain("Decisions:") + expect(agent).toContain("Domains:") + + // Dynamic subdirectory probe replaces hardcoded category table + expect(agent).toContain("Probe") + expect(agent).toContain("discover which subdirectories actually exist") + + // Critical-patterns.md read is conditional, not assumed + expect(agent).toMatch(/critical-patterns.md.*exists/i) + + // Integration Points list no longer includes ce-doc-review (agent is ce-plan-owned) + const integration = agent.substring(agent.indexOf("Integration Points")) + expect(integration).not.toContain("ce-doc-review") + }) +})