diff --git a/docs/plans/2026-03-27-kilroy-stabilization.md b/docs/plans/2026-03-27-kilroy-stabilization.md new file mode 100644 index 00000000..f2a93872 --- /dev/null +++ b/docs/plans/2026-03-27-kilroy-stabilization.md @@ -0,0 +1,840 @@ +# Kilroy Stabilization Plan + +Date: 2026-03-27 + +## Problem Statement + +Kilroy is an attractor-pattern graph runner that orchestrates multi-stage LLM workflows. +It works, but it's fragile, hard to set up, hard to debug, and fails in ways that are +difficult to diagnose. The root causes are architectural: + +1. **Three concerns in one layer.** Graph execution, LLM orchestration, and software-development + workflow opinions (git worktrees, artifact policies, per-node commits) are inseparable. You + can't run a simple graph without buying into the full stack. + +2. **Heavy where it should be light; light where it should be heavy.** Configuration requires + specifying everything (CXDB, provider backends, model catalogs, artifact policies) with no + sensible defaults. But validation is shallow — credentials are checked for presence not + validity, and the system fails 2 hours into a run for things it could catch in 10 seconds. + +3. **Invisible decisions.** Edge routing uses 4+ interacting mechanisms (conditions, + preferred_label, suggested_next_ids, retry_target chains). Provider selection, fidelity + resolution, failure classification, and implicit fan-out detection all happen silently. + When a run takes an unexpected path, there's no way to understand why without reading code. + +4. **File-based state assembly.** Run state is spread across progress.ndjson, checkpoint.json, + final.json, status.json per node, and CXDB events. Piecing together what happened requires + reading multiple files and mentally replaying the run. No queryable operational state exists. + +5. **Graphs encode too much.** Prompts, failure policies, verification scripts, status contracts, + model routing, and fidelity modes all live in DOT files. Graphs are hard to write, hard to + read, and easy to get subtly wrong. + +## Design Principles + +These guide every task in this plan: + +- **Fail fast.** Validate everything checkable before execution starts. 10 seconds, not 2 hours. +- **Explicit over implicit.** Every routing decision, provider selection, and state transition + should be visible in the graph or in decision logs. No silent fallbacks. +- **Layered architecture.** The graph runner is the core. Git, CXDB, artifact policies, and + software-development workflow patterns are composable layers on top, not baked in. +- **Zero-config start.** `kilroy run --graph pipeline.dot` should work with no config file. + Everything else is opt-in. +- **Observable decisions.** The system should answer "what happened and why" from a single + query, not a filesystem crawl. + +## Phasing + +Each phase delivers standalone value and builds on the previous one. Phase 0 can start today. +Script-based test graphs (no LLM involvement) validate each change to isolate engine behavior +from model variability. + +--- + +## Phase 0: Immediate Stabilization + +Goal: Make the current system more predictable without structural changes. Each task is +independently shippable. Use script-based test graphs to validate — nodes run shell commands +that produce deterministic output, so we're testing the engine plumbing, not LLM behavior. + +### 0.1 Create Script-Based Test Graphs + +**What:** Create a small suite of DOT graphs where every node runs a shell script instead of +an LLM. These become the validation harness for all subsequent Phase 0 work and beyond. + +**Context:** Currently there's no way to test the engine's traversal, routing, and checkpoint +behavior without involving LLMs. We need deterministic graphs that exercise the engine's core +codepaths: linear traversal, conditional routing, retry loops, failure handling. + +**Done when:** +- A `testdata/` directory contains at least 4 graphs: + - `linear.dot`: start → A → B → C → done (basic traversal) + - `linear_verify.dot`: start → implement → verify → done, with verify→implement on fail + (hill-climber pattern with a verify script that succeeds on Nth attempt) + - `conditional.dot`: start → process → (path_a | path_b based on outcome) → done + - `fail_fast.dot`: start → node_that_fails → done (tests failure handling) +- Each graph uses `type="command"` or `shape=parallelogram` nodes with shell scripts +- Shell scripts are in `testdata/scripts/`, produce deterministic outcomes via status.json +- All graphs pass `kilroy attractor validate` +- All graphs execute successfully with `kilroy attractor run` + +### 0.2 Make CXDB Truly Optional + +**What:** Remove the assertion that CXDB configuration is required. Users should be able to +run graphs without CXDB configured, running, or even installed. + +**Context:** `config.go:313` asserts CXDB fields are required ("required in v1"). The engine +already nil-checks CXDB access everywhere. The `--no-cxdb` CLI flag exists but config +validation blocks before it's consulted when no CXDB config is provided. This means even a +user who doesn't want CXDB has to provide dummy CXDB config values. + +**Done when:** +- A run config with no `cxdb` section is accepted without error +- `kilroy attractor run --graph testdata/linear.dot` works with a minimal config (repo path only) +- Running without CXDB produces no CXDB-related errors or warnings +- Running WITH CXDB still works as before +- Existing tests pass (`go test ./...`) + +### 0.3 Strict Outcome Validation + +**What:** Reject unrecognized status values from node outcomes instead of silently passing +them through. Typos in status names currently cause silent routing failures. + +**Context:** `runtime/status.go` ParseStageStatus accepts any non-empty string as a valid +status. If an LLM writes `"outcome": "proccess"` (typo), it's treated as a custom outcome. +No edge condition matches it, and routing silently falls through to default behavior. The +canonical values are: success, partial_success, retry, fail, skipped. + +**Done when:** +- Unknown status values produce a clear error with the invalid value and the list of valid values +- The canonical status set is documented in one place in code +- The test graph suite from 0.1 exercises both valid and invalid outcomes +- Existing graphs that intentionally use custom outcomes (if any in the test suite) are migrated + to use `preferred_label` for conditional routing instead +- Existing tests pass + +### 0.4 Decision Logging + +**What:** Add structured events to the progress stream for every routing decision the engine +makes. When a run takes an unexpected path, the decision log should explain exactly why. + +**Context:** `next_hop.go` evaluates edge conditions, matches preferred labels, applies weight +tiebreaking, and falls through retry_target chains — but most of these decisions are invisible. +`progress.go` captures stage start/end events but not the reasoning behind edge selection. +Provider selection and retry decisions are also unlogged. + +**Done when:** +- Edge condition evaluation logs: which edges were evaluated, what their conditions were, + whether each condition matched, and which edge was selected +- Provider/model selection logs: which provider and model were chosen for each node +- Retry decision logs: when a retry occurs, the attempt number, backoff delay, and why +- All new events appear in progress.ndjson with structured fields +- The test graphs from 0.1 produce readable decision trails that explain their routing +- Running a conditional graph and reading progress.ndjson clearly shows why each path was taken + +### 0.5 Real Credential Validation + +**What:** Actually test provider credentials during preflight by making lightweight API calls, +rather than just checking if environment variables are set. + +**Context:** `provider_preflight.go` has prompt-probe infrastructure that already makes test +API calls with retry logic. But it's disabled by default and runs in warn-only mode. The +common failure mode: a run starts, executes for hours, then fails at a node that needs a +provider whose API key is missing, invalid, or expired. + +**Done when:** +- Preflight makes a real API call to every provider/model combination declared in the graph +- Invalid or missing credentials fail preflight with a clear message naming the provider, + model, and what's wrong +- Preflight reports ALL credential issues at once, not one at a time +- A graph referencing a provider with a bad key fails in preflight, not at execution time +- Preflight can be skipped with an explicit flag for offline/CI use +- The fail_fast test graph from 0.1 validates this behavior + +### 0.6 Dry-Run Mode + +**What:** Add `--dry-run` that validates everything — graph structure, provider credentials, +model availability — and reports results without executing any nodes. + +**Context:** The current `--preflight` flag runs validation but doesn't do the full bootstrap +(repo checks, artifact policy resolution, effective provider routing). A dry-run should go +further: confirm that the run WOULD succeed by doing everything except actually executing +handlers. + +**Done when:** +- `kilroy attractor run --graph test.dot --dry-run` validates graph, resolves all providers, + tests all credentials, and reports a summary of what would happen +- No git branches are created, no worktrees made, no nodes executed +- Output shows: graph structure summary, provider assignments per node, credential status, + any warnings or errors +- Exit code reflects whether the run would succeed (0) or fail (non-zero) +- Works with the test graph suite from 0.1 + +### 0.7 Zero-Config Startup + +**What:** Make `kilroy attractor run --graph pipeline.dot` work with no config file at all. +Auto-detect providers from environment, use sensible defaults for everything else. + +**Context:** Currently you must write a YAML config file specifying CXDB, provider backends, +model catalogs, artifact policies, and more. A new user has to understand the entire system +architecture to even start. Most of this has one reasonable default. + +**Done when:** +- Running with only `--graph` and no `--config` succeeds when provider API keys are in the + environment (auto-detect: ANTHROPIC_API_KEY → anthropic provider, OPENAI_API_KEY → openai, etc.) +- No CXDB required (builds on 0.2) +- Default model selected per provider (the provider's current best general-purpose model) +- Default backend selected based on what's available (API key present → api; CLI binary found → cli) +- Repo path defaults to current directory if it's a git repo, or a temp workspace if not +- Config file remains available for overriding any default +- The test graph suite runs with zero config + +--- + +## Phase 0.8: CLI-Only Agent Backend (Deprecate API) + +Goal: Simplify the agent backend to CLI-only. Use the appropriate CLI tool for each provider +instead of maintaining both API and CLI execution paths. + +### Design Context + +Team feedback: use `opencode` for all providers that don't have a dedicated CLI agent. +The provider mapping becomes: + +- Claude → `claude` CLI (with `--dangerously-skip-permissions`) +- OpenAI → `codex` CLI (with `--sandbox workspace-write`) +- Google → `gemini` CLI (with `--yolo`) +- Everything else → `opencode` CLI (multi-provider, open-source) + +This eliminates the entire API backend: the `internal/agent/` package (session management, +tool registry, profiles, output truncation), the `internal/llm/` package (provider adapters, +unified client), and the complexity of maintaining two fundamentally different execution +models. The codergen handler becomes: "invoke a CLI binary with a prompt, capture output." + +This also resolves the CLI-vs-API convergence gap — CLI converges better for structural +reasons (purpose-built agents with native tools), so going all-CLI means convergence +improves everywhere. It sidesteps Anthropic ToS concerns since we're using official/community +CLI tooling as intended. + +### 0.8.1 Add opencode as a Provider Backend + +**What:** Add opencode CLI support alongside the existing claude/codex/gemini CLI backends. +opencode becomes the default backend for any provider that doesn't have a dedicated CLI. + +**Context:** `internal/providerspec/builtin.go` defines CLI invocation templates per provider. +Adding opencode follows the same pattern. opencode supports multiple LLM providers through +its own configuration. + +**Done when:** +- opencode is a recognized provider backend in the provider spec system +- Graphs can specify `llm_provider=opencode` or have it selected as default +- The invocation template handles opencode's flags for headless/non-interactive mode +- A test graph using opencode validates the integration +- opencode availability is checked in preflight + +### 0.8.2 Deprecate API Backend + +**What:** Mark the API backend as deprecated. New graphs should use CLI backends only. +The API codepath remains functional but is no longer the recommended path. + +**Context:** The API backend lives in `internal/agent/` (session, tool registry, profiles) +and `internal/llm/` (provider adapters). Deprecation means: warning when API backend is +selected, documentation updated, no new features added to the API path. + +**Done when:** +- Using `backend: api` in config produces a deprecation warning +- Documentation recommends CLI backends for all providers +- The warning names the recommended CLI alternative for the provider +- Existing API-backend graphs continue to work (no breaking change) + +--- + +## Phase 1: Run Database + +Goal: Introduce SQLite as the single source of truth for run-level operational state. Replace +the file-based state assembly (progress.ndjson, checkpoint.json, final.json, per-node +status.json) with queryable structured data. + +### Design Context + +CXDB is a conversation-history store optimized for turn-level LLM interaction data. It's good +at recording what happened inside a node's LLM session. It is not designed for run-level +operational state — it can't query across runs, filter by outcome, or aggregate decisions. + +SQLite fills the operational gap: which nodes ran, what they returned, which edges were taken, +why routing decisions were made. CXDB remains the opt-in deep-dive tool for inspecting LLM +conversations within a node. + +The database lives at the XDG state path (the codebase already uses `XDG_STATE_HOME` for +state). Default: `~/.local/state/kilroy/runs.db`. Shared across runs with isolation by run_id. + +**Schema management:** Numbered migration files applied automatically on DB open. Migrations +are append-only (add columns and tables, never remove or rename). Each migration is a single +SQL file checked into the repo. The migration runner and the schema contract must be clearly +documented in AGENTS.md or equivalent so that future work (human or agent) follows the pattern. + +### 1.1 Schema Design and Migration Infrastructure + +**What:** Design the initial schema and build the migration runner. The schema must support +the queries that matter: "what's the status of run X?", "why did node Y take edge Z?", +"which runs failed this week?", "what provider was used for node W?" + +**Context:** Today this information is scattered across files that must be parsed and +cross-referenced. The schema should make these queries trivial. + +**Done when:** +- A migration runner exists that auto-applies numbered SQL files on DB open +- The initial schema is documented with purpose annotations per table and column +- The schema supports: runs, nodes (with attempts), outcomes, edge decisions, provider selections +- A Go package provides typed read/write access to the schema +- Migration tests verify forward migration from empty DB +- AGENTS.md or equivalent documents the migration pattern and rules + +### 1.2 Engine Writes to Run DB + +**What:** Wire the engine to write operational events to the run DB alongside (not instead of) +existing file-based output. This is additive — nothing breaks. + +**Context:** The engine currently writes progress events via `appendProgress()` in engine.go. +The DB writes should happen at the same points, with the same information, but into structured +tables instead of ndjson lines. + +**Done when:** +- Every run creates a row in the runs table +- Every node execution creates rows in nodes and outcomes tables +- Every edge decision creates a row in decisions table +- Existing file-based output (progress.ndjson, checkpoint.json, final.json) still works +- Test graphs from Phase 0 produce correct DB entries +- DB contents match what progress.ndjson reports (cross-validation) + +### 1.3 Status Command Reads from Run DB + +**What:** Update `kilroy attractor status` to query the run DB as its primary data source, +with filesystem fallback for older runs. + +**Context:** The status command currently parses progress.ndjson, checkpoint.json, and final.json +to assemble run state. This is fragile and slow for large runs. The DB makes these queries +instant and reliable. + +**Done when:** +- `kilroy attractor status --run ` shows run state from the DB +- `kilroy attractor status --list` shows all known runs with their status +- `kilroy attractor status --run --decisions` shows the edge decision log +- Falls back to file-based parsing if run predates the DB +- Works with test graphs from Phase 0 + +--- + +## Phase 2: Lifecycle Hooks + +Goal: Decouple git, CXDB, artifact policies, and other concerns from the engine core. Replace +hardcoded behavior with a hook system where the engine fires lifecycle events and registered +hooks respond. + +### Design Context + +Currently the engine directly calls gitutil functions, CXDB sinks, and artifact policy logic. +This means you can't run a graph without git, and every engine change risks breaking these +subsystems. Lifecycle hooks make these composable: a minimal setup has zero hooks (engine just +runs), a full coding-workflow setup registers hooks for git, CXDB, filesystem snapshots, etc. + +Hook points: `run.before`, `run.after`, `node.before`, `node.after`, `node.on_fail`, +`node.on_retry`, `run.on_blocked`. These are the natural lifecycle boundaries where +cross-cutting concerns attach. + +### 2.1 Define Hook Interface and Registration + +**What:** Design and implement the hook interface, registration mechanism, and the lifecycle +event points in the engine where hooks fire. + +**Context:** The engine currently has these concerns hardcoded in its main loop: git branch +creation (run.before), git worktree setup (run.before), CXDB context creation (run.before), +per-node git commits (node.after), CXDB event emission (node.after), checkpoint writes +(node.after), stall detection (node.during). + +**Done when:** +- A hook interface exists with methods for each lifecycle point +- Hooks can be registered on the engine before a run starts +- The engine calls registered hooks at the correct lifecycle points +- A no-op hook can be registered and fires correctly (verified by test) +- Multiple hooks can be registered and fire in order +- Hook errors are logged but don't crash the engine (configurable: warn vs fail) + +### 2.2 Extract Git Operations into Hooks + +**What:** Move git branch creation, worktree management, per-node commits, and SHA tracking +out of the engine core and into a git lifecycle hook. The engine doesn't know about git. The +git hook provides worktree isolation, per-node checkpointing, and branch management as an +opt-in capability. + +**Context:** Team feedback: "Don't make kilroy understand worktrees." The engine should work +in whatever directory it's given. Git operations (worktree creation, per-node commits, SHA +tracking) are valuable but should be composable — register the git hook and you get them, +don't register it and the engine still works fine in a plain directory. + +The recommended workflow becomes: the caller creates a worktree (or the git hook does it on +`run.before`), kilroy executes inside it, the git hook commits per node and tracks state. +But none of that is required — kilroy can run in a temp directory with no git at all. + +Git operations are currently spread across engine.go (~15 callsites), +parallel_handlers.go (branch isolation), and resume.go (SHA validation). + +**Done when:** +- The engine core has zero direct gitutil imports +- A GitHook implements the hook interface and provides all current git behavior +- Runs WITH the GitHook behave identically to current behavior (worktrees, commits, SHAs) +- Runs WITHOUT the GitHook succeed in any directory (no git required) +- Parallel branch isolation works both ways: git hook uses worktrees, no-git uses temp dirs +- The test graph suite from Phase 0 works both with and without the GitHook + +### 2.3 Extract CXDB into Hooks + +**What:** Move CXDB context creation, event emission, and turn appending out of the engine +core and into a CXDB lifecycle hook. + +**Context:** CXDB integration is already mostly nil-safe, but the plumbing (sink creation, +context ID management, turn appending) is still wired directly into the engine. Extracting it +into a hook makes the nil-checking unnecessary — if no CXDB hook is registered, no CXDB code +runs. + +**Done when:** +- The engine core has zero direct CXDB imports +- A CXDBHook implements the hook interface +- Runs with the CXDBHook behave identically to current CXDB behavior +- Runs without the CXDBHook have zero CXDB overhead +- Existing CXDB integration tests pass + +### 2.4 Make Checkpoint/Resume Optional via Hooks + +**What:** Checkpoint and resume become hook-provided capabilities, not engine-mandated +behavior. Some workflows (exploration, report generation, multi-repo searches) don't benefit +from checkpointing. Making it optional removes overhead and complexity for those cases. + +**Context:** Currently the engine always writes checkpoint.json after each node and always +supports resume. For workflows that don't modify a repo or produce incremental artifacts, +this is pure overhead. With the hook system, checkpoint becomes a registered hook that can +be omitted. + +**Done when:** +- Checkpoint writes happen in a hook, not in the engine core +- Runs without a checkpoint hook complete successfully (no checkpoint files written) +- Runs with the checkpoint hook behave identically to current behavior +- Resume requires the checkpoint hook to have been active during the original run +- A clear error message if resume is attempted on a run that had no checkpointing + +### 2.5 Filesystem Change Tracking Hook + +**What:** Provide a hook that captures what filesystem changes each node made during execution. +This is useful for understanding what an agent did without reading its full conversation log. + +**Context:** Currently there's no structured record of what files a node created, modified, +or deleted. The git hook captures this implicitly (via commits), but without git there's no +visibility. A dedicated hook can snapshot before/after and produce a diff. + +**Done when:** +- A filesystem tracking hook can be registered that records file changes per node +- Changes are recorded in the run DB (or a structured output) as: path, operation (create/modify/delete), size +- The hook works independently of git (doesn't require worktrees or commits) +- Change records are visible via `kilroy attractor status --run --node ` + +### 2.6 Extract Artifact Policy into Hooks + +**What:** Move artifact policy resolution, checkpoint exclusion globs, and managed root +configuration out of the engine core and into a hook. + +**Context:** Artifact policies are resolved once per run and applied as configuration to node +execution. Currently this is tightly coupled to the engine's setup phase and config parsing. + +**Done when:** +- Artifact policy is applied via a hook, not hardcoded in the engine +- Runs without the hook use no artifact policy (simple default behavior) +- Runs with the hook behave identically to current behavior +- Configuration for artifact profiles still works through the run config + +--- + +## Phase 2.5: Run Supervisor + +Goal: Introduce a supervisor layer that sits above the graph runner and manages run lifecycle, +health monitoring, blocker detection, and user notification. This is the "babysitter" that +watches runs and knows when to surface issues vs when to stay quiet. + +### Design Context + +Currently the engine tries to be both executor and supervisor. The stall watchdog, loop_restart +circuit breakers, and failure classification heuristics are all supervisor concerns jammed into +the traversal loop. This makes the engine complex and the supervision logic hard to tune. + +The supervisor sits outside the engine. It monitors runs by querying the run DB (Phase 1) and +intervenes through lifecycle hooks (Phase 2). It can manage multiple concurrent runs. Its +policies are configurable: "if stuck for 10 minutes, try a different model; if stuck for 30 +minutes, notify me; if the same node fails 3 times with the same error, stop and ask." + +This is also the layer that fulfills the "quickly determine if user input is needed" goal. +The supervisor watches early run progress and surfaces blockers (missing auth, missing data, +infrastructure failures) to the user as fast as possible, then stays quiet while the run +proceeds unattended. + +### 2.5.1 Supervisor Core and Health Monitoring + +**What:** Build the supervisor as a wrapper around the engine that monitors run health by +polling the run DB and applying configurable policies. + +**Context:** The engine's current stall watchdog (hardcoded thresholds, embedded in the main +loop) is a primitive version of this. The supervisor replaces it with a clean separation: +the engine runs, the supervisor observes and reacts. + +**Done when:** +- A supervisor can start a run and monitor its progress via the run DB +- Configurable health policies: stall detection, repeated failure detection, progress thresholds +- The supervisor can detect and report: stalled nodes, repeated failures on the same node, + runs that haven't made progress in N minutes +- Health status is queryable: `kilroy attractor status --run ` shows supervisor assessment +- The supervisor works with the script-based test graphs (e.g., a graph with a deliberately + slow node triggers stall detection) + +### 2.5.2 Blocker Detection and User Notification + +**What:** The supervisor identifies situations that require user attention and surfaces them +clearly, distinguishing "stuck and retrying" from "truly blocked." + +**Context:** The current system doesn't distinguish between transient failures (worth retrying) +and blockers (need human input). A rate-limited API call looks the same to the user as a +missing database. The supervisor should classify blockers and notify the user with actionable +context. + +**Done when:** +- The supervisor classifies run states: healthy, degraded (retrying), blocked (needs user), + failed (terminal) +- Blocked state includes: what's blocking, which node, what the user can do about it +- A notification mechanism exists (at minimum: CLI output, structured event in DB; extensible + to webhooks, etc.) +- The "quickly determine if user input needed" pattern works: supervisor watches early nodes + and surfaces prerequisite failures (auth, missing resources) within the first few minutes +- Test graphs exercise the blocked detection (e.g., a node that requires a file that doesn't exist) + +### 2.5.3 Self-Introspection + +**What:** The system should be able to explain its own state, decisions, and plan in +human-readable terms. Not just "what happened" but "what is the system's model of what's +happening and why." + +**Context:** Team feedback: "needs to introspect itself." This means the supervisor can +answer questions like: +- "Why are you stuck?" → "Node X has failed 3 times with the same build error. The failure + is deterministic — retrying won't help." +- "What's your plan?" → "I'm on node 4 of 7. Next is verify, then review." +- "What have you tried?" → "Attempt 1: error A. Attempt 2: same. Attempt 3: different error B." + +This is built on the run DB (Phase 1) for data and the supervisor for interpretation. + +**Done when:** +- `kilroy attractor status --run --explain` produces a human-readable narrative of + the run's current state, recent decisions, and what's expected next +- The explanation includes: current node, progress through graph, recent failures with + reasons, what the system plans to do next +- For stuck/failed runs: the explanation diagnoses why (deterministic failure, missing + resource, repeated error) and suggests what the user can do +- The explanation is generated from run DB data, not from parsing log files + +### 2.5.4 Supervisor Intervention Policies + +**What:** Allow the supervisor to take corrective action based on configurable policies, using +lifecycle hooks as the intervention mechanism. + +**Context:** Today, model escalation and retry decisions are hardcoded in the engine. The +supervisor should own these decisions: "this node failed 3 times with model X, try model Y" +or "this run has been stuck for an hour, pause and notify." The engine provides the hooks; +the supervisor decides when and how to use them. + +**Done when:** +- The supervisor can pause and resume a run +- The supervisor can request a model/provider change for a failing node (via hook or context update) +- Policies are configurable (not hardcoded): retry limits, escalation models, stall thresholds, + notification preferences +- A default policy set works out of the box (sane defaults, not overwhelming configuration) +- The engine's current hardcoded stall watchdog and escalation logic can be removed in favor + of supervisor-driven policies + +--- + +## Phase 3: Explicit Routing + +Goal: Replace the multi-mechanism routing system with a single, predictable mechanism. +Every routing decision should be visible in the graph as an explicit edge with an explicit +condition. + +### Design Context + +The current routing system has 4+ interacting mechanisms evaluated in priority order: +1. Condition expressions on edges +2. preferred_label matching +3. suggested_next_ids targeting +4. retry_target chains (node → node fallback → graph → graph fallback) + +Plus implicit fan-out detection (BFS for convergence points), goal gate enforcement at exit, +and failure classification that affects routing eligibility. A graph author cannot predict +which mechanism will fire without mentally simulating the engine. + +The target: **edges have conditions, conditions are evaluated in declaration order, first +match wins.** Retry routing, failure handling, and default paths are all expressed as explicit +edges in the graph. The engine evaluates and logs; the graph declares. + +### 3.1 Consolidate Routing to Condition-Based Edges + +**What:** Make condition-based edge evaluation the single routing mechanism. Remove +preferred_label matching, suggested_next_ids, and implicit weight-based tiebreaking from +the engine's routing logic. + +**Context:** `next_hop.go` implements the 5-step priority cascade. The condition evaluator +in `cond/cond.go` already handles the important cases. The other mechanisms exist as fallbacks +for cases where conditions aren't specified, but they make routing unpredictable. + +**Done when:** +- Edge routing evaluates conditions in edge declaration order, selects first match +- An unconditional edge (no condition attribute) matches anything (serves as default/fallback) +- preferred_label and suggested_next_ids are no longer consulted for routing +- The decision log clearly shows: which edges were candidates, which conditions matched, + which edge was selected +- Test graphs from Phase 0 validate the new routing behavior +- A migration guide documents how to convert existing graphs (retry_target → explicit edges) + +### 3.2 Remove Implicit Fan-Out Detection + +**What:** Remove the engine's BFS-based convergence detection that silently enables or +disables parallel execution based on graph topology. + +**Context:** When a node has multiple eligible outgoing edges, the engine runs `findJoinNode()` +to detect topological convergence. If found, it dispatches parallel branches. If not, it +silently falls back to single-edge selection. This means the same graph structure can produce +either parallel or sequential execution depending on subtle topology differences. + +**Done when:** +- Parallel execution only happens at explicit parallel nodes (shape=component) +- Nodes with multiple outgoing edges where multiple conditions match produce a clear error + or warning, not silent single-edge fallback +- The engine never silently converts sequential routing into parallel execution +- Validation warns when multiple unconditional edges leave a non-parallel node +- Test graphs verify both parallel (explicit) and multi-edge (error) cases + +### 3.3 Retire retry_target Chains + +**What:** Remove the retry_target and fallback_retry_target attributes from nodes and graphs. +Replace with explicit retry edges in the graph. + +**Context:** retry_target is an invisible routing mechanism: when a node fails and no +fail-condition edge matches, the engine checks node.retry_target, then node.fallback_retry_target, +then graph.retry_target, then graph.fallback_retry_target. This 4-level chain is invisible in +the graph and interacts with failure classification in non-obvious ways. + +**Done when:** +- retry_target and fallback_retry_target attributes are no longer consulted by the engine +- Retry routing is expressed as edges: `node_a -> retry_node [condition="outcome=fail"]` +- Validation warns if a node has outgoing edges but no edge handles `outcome=fail` +- A migration guide documents the conversion pattern +- Existing test graphs are migrated to explicit retry edges + +### 3.4 Retire Goal Gate Mechanism + +**What:** Remove the goal_gate enforcement that happens at the exit node. Replace with +explicit edges — if a critical node must succeed before the pipeline can exit, express that +as graph structure, not as a metadata annotation checked at a distance. + +**Context:** goal_gate=true on a node means the engine tracks it and checks at exit time +whether it reached SUCCESS. If not, it routes to the node's retry_target. This is another +invisible routing layer: the graph author marks a node as critical, but the enforcement +happens at the exit node, not at the marked node. With explicit routing (Phase 3.1), the +graph can express "don't proceed unless this succeeded" directly as edge conditions. + +**Done when:** +- goal_gate attribute is no longer consulted by the engine +- Critical-path enforcement is expressed as edge conditions on the relevant nodes +- Validation no longer requires goal_gate nodes to have retry_targets +- A migration guide documents the conversion pattern + +### 3.5 Prompt File Support + +**What:** Support `prompt_file` attribute on nodes to reference external prompt files. Prompts +can be inline (current behavior) or external (file reference) — author's choice per node. + +**Context:** Production graphs embed multi-paragraph prompts as DOT string attributes. This +makes graphs hard to read and hard to diff. Moving prompts to separate files improves +readability without losing the ability to have simple inline prompts for small nodes. The +graph remains the single source of truth for structure; prompt files are referenced by path +and resolved at parse time. + +**Done when:** +- Nodes can use `prompt_file="path/to/prompt.md"` instead of or alongside `prompt="..."` +- prompt_file paths are resolved relative to the graph file's directory +- Validation checks that referenced prompt files exist +- Variable expansion ($goal, etc.) works in prompt files the same as inline prompts +- Inline `prompt` and `prompt_file` can coexist (prompt_file takes precedence, or they + concatenate — decide during implementation) +- Existing graphs with inline prompts continue to work unchanged + +### 3.6 Simplify Fidelity Model + +**What:** Collapse the 6 fidelity modes (full, truncate, compact, summary:low/medium/high) +and 4-level precedence chain into a simpler model that the engine manages automatically. +Graph authors shouldn't need to think about token budgets. + +**Context:** The fidelity system controls how much conversation context carries between nodes. +It has 6 modes, a 4-level precedence chain (edge → node → graph → default), and a separate +thread_id resolution chain. This is infrastructure-level complexity exposed as an authoring +concern. In practice, there are two meaningful modes: "continue the conversation" (the agent +keeps its context) and "fresh start" (new context with a summary of what happened). + +**Done when:** +- The graph-authoring surface is reduced to at most 2-3 fidelity options with clear semantics +- The engine handles context budgeting automatically (compaction, summarization) without + graph author involvement +- The precedence chain is simplified (node attribute or graph default, no edge-level override) +- Thread management is handled by the agent handler, not the engine +- Existing graphs with fidelity attributes still work (backward compat during transition) + +--- + +## Phase 4: Clean Handler Interface + +Goal: Establish a clean boundary between the engine (graph traversal, routing, lifecycle) and +handlers (the work each node does). Handlers own their dependencies. The engine provides +minimal context. + +### Design Context + +Currently the codergen handler receives the full engine, the graph, the logs root, provider +runtimes, the model catalog, and the CXDB sink. This couples handlers to engine internals. +The target: handlers receive node attributes and a run context (key-value store), and return +an outcome. How they invoke LLMs, manage tools, or interpret results is their concern. + +The parallel/fan-in boundary also needs cleanup: the engine owns the concurrency lifecycle +(recognizing parallel structure, managing worker pools, evaluating join policies), while +hooks handle isolation (git worktrees, temp directories) and handlers do the actual branch work. + +### 4.1 Define Minimal Handler Interface + +**What:** Design the handler interface that handlers implement. It should provide enough +context for handlers to do their work without coupling them to engine internals. + +**Context:** The current handler interface varies by handler type. Codergen handlers receive +substantially more context than tool handlers. The goal is a uniform interface that works for +all handler types while allowing handlers to bring their own dependencies via closure or +constructor injection. + +**Done when:** +- A handler interface is defined with a single Execute method +- The method receives: node attributes (read-only), run context (read/write), workspace path, + and a logger/event emitter +- The method returns: an Outcome (status, failure_reason, context_updates) +- The three core handler types (agent, command, gate) implement this interface +- The engine dispatches to handlers through this interface only +- Test graphs validate all three handler types through the new interface + +### 4.2 Reform Status Contract + +**What:** Replace the file-based status contract (LLM writes JSON to a specific path) with +handler-interpreted outcomes. The handler extracts success/fail from the agent's behavior, +not from a file the agent was instructed to write. + +**Context:** Currently, every codergen node's prompt must include instructions to write +status.json to `$KILROY_STAGE_STATUS_PATH` with specific fields. LLMs get the format wrong, +write to the wrong path, or forget entirely. This is the most fragile part of the pipeline — +a typo in a JSON field name silently breaks routing. + +With a reformed contract, the handler owns interpretation: for CLI agents, exit code + output +parsing. For API agents, the final message content. For command/tool nodes, exit code. The +LLM doesn't need to know about status.json at all. The handler reports a structured outcome +to the engine through the handler interface, which writes it to the run DB. + +**Done when:** +- The agent handler can determine success/fail without the LLM writing a status file +- Command handlers use exit code (0 = success, non-zero = fail) +- Agent handlers interpret the agent's final output or behavior to determine outcome +- Status file writing is optional (an agent CAN still write one to provide richer outcome + data like preferred_label or context_updates, but it's not required for basic routing) +- Prompts no longer need the KILROY_STAGE_STATUS_PATH boilerplate +- The run DB records the outcome regardless of how it was determined + +### 4.3 Clean Parallel/Fan-In Boundary + +**What:** Separate the engine's concurrency management from the handler-level isolation and +merge strategies. The engine manages worker pools and join policies. Hooks manage isolation. +Handlers do the branch work. + +**Context:** The parallel handler currently creates git worktrees, clones engine instances, +manages worker pools, evaluates join policies, and handles fan-in winner selection — all in +one handler. The engine-level concerns (concurrency, join policy) should be in the engine. +The isolation concerns (worktrees, temp dirs) should be in hooks. The winner selection should +be in the fan-in handler. + +**Done when:** +- The engine recognizes parallel nodes and manages concurrent branch dispatch +- Join policy evaluation (wait_all, first_success, k_of_n, quorum) is engine-level +- Error policy (continue, fail_fast, ignore) is engine-level +- Branch isolation is provided by lifecycle hooks (git hook creates worktrees; no-git hook + creates temp directories) +- Fan-in winner selection remains handler-level +- Parallel execution works with and without git hooks +- Test graphs with parallel nodes validate the new boundary + +--- + +## Test Strategy + +Every phase is validated with the script-based test graph suite from 0.1. No LLM involvement +in engine validation. + +For each completed task: +1. `go test ./...` passes (existing tests) +2. All test graphs in the suite execute correctly +3. The specific behavior change is demonstrated (e.g., decision log shows edge evaluation + for 0.4, bad credentials fail preflight for 0.5) + +LLM-backed validation (real graphs with real models) happens after Phase 0 is complete and +the engine plumbing is proven solid with deterministic graphs. + +## Progressive Delivery + +This plan is designed to build while the plane is flying: + +- **Phase 0** (8 tasks) — each independently shippable, immediately improves the current system; + includes CLI-only backend shift (opencode for non-claude/codex providers, deprecate API) +- **Phase 1** (3 tasks) — additive; the DB writes alongside existing files, nothing breaks +- **Phase 2** (6 tasks) — decouples concerns; git removed from engine (caller manages worktrees), + CXDB/artifacts into hooks, checkpoint optional +- **Phase 2.5** (4 tasks) — adds the supervisor; run monitoring, blocker detection, + self-introspection, intervention policies +- **Phase 3** (6 tasks) — changes graph semantics and authoring model; requires graph migration +- **Phase 4** (3 tasks) — cleans interfaces; handler contract, status reform, parallel boundary + +At any point, we can pause and use the system as-is. Each completed phase leaves the system +strictly better than before. + +--- + +## Future Work (Beyond This Plan) + +These are known gaps that this plan doesn't address but should be tackled eventually: + +### Remove API Backend Entirely + +Phase 0.8 deprecates the API backend in favor of CLI-only (claude, codex, gemini, opencode). +Once the deprecation has been in place and CLI-only is proven reliable, the API backend code +(`internal/agent/`, `internal/llm/`) can be removed entirely. This is a significant code +reduction that simplifies maintenance and testing. + +### Configuration Surface Simplification + +Zero-config startup (Phase 0.7) addresses the new-user experience, but the config file +itself remains a combinatorial explosion for advanced users. The config model should be +reviewed and simplified: collapse redundant options, remove options that have only one +reasonable value, group related settings, and provide clear documentation of what each +option does and when you'd change it. + +### Graph Authoring Tooling + +Writing DOT files by hand is error-prone. Future tooling could include: a graph visualizer +that shows the execution topology, a graph editor with validation feedback, graph templates +for common patterns (linear+verify, fan-out consensus), and prompt scaffolding that generates +the boilerplate for new nodes. diff --git a/internal/attractor/engine/config.go b/internal/attractor/engine/config.go index f1037dea..1a67bb38 100644 --- a/internal/attractor/engine/config.go +++ b/internal/attractor/engine/config.go @@ -309,9 +309,6 @@ func validateConfig(cfg *RunConfigFile) error { if strings.TrimSpace(cfg.Repo.Path) == "" { return fmt.Errorf("repo.path is required") } - if strings.TrimSpace(cfg.CXDB.BinaryAddr) == "" || strings.TrimSpace(cfg.CXDB.HTTPBaseURL) == "" { - return fmt.Errorf("cxdb.binary_addr and cxdb.http_base_url are required in v1") - } if cfg.CXDB.Autostart.WaitTimeoutMS < 0 { return fmt.Errorf("cxdb.autostart.wait_timeout_ms must be >= 0") } diff --git a/internal/attractor/engine/phase0_tool_graph_test.go b/internal/attractor/engine/phase0_tool_graph_test.go new file mode 100644 index 00000000..72512849 --- /dev/null +++ b/internal/attractor/engine/phase0_tool_graph_test.go @@ -0,0 +1,283 @@ +// Integration tests for script-based (tool-node-only) graphs. +// Validates engine traversal, routing, and failure handling without LLM involvement. + +package engine + +import ( + "context" + "encoding/json" + "os" + "path/filepath" + "strings" + "testing" + "time" + + "github.com/danshapiro/kilroy/internal/attractor/runtime" +) + +func TestToolGraph_Linear(t *testing.T) { + t.Setenv("XDG_STATE_HOME", t.TempDir()) + repo := initTestRepo(t) + logsRoot := t.TempDir() + pinned := writePinnedCatalog(t) + + dot := []byte(`digraph linear { + graph [goal="Test linear traversal"] + start [shape=Mdiamond] + step_a [shape=parallelogram, tool_command="echo step_a_done"] + step_b [shape=parallelogram, tool_command="echo step_b_done"] + step_c [shape=parallelogram, tool_command="echo step_c_done"] + done [shape=Msquare] + start -> step_a -> step_b -> step_c -> done +}`) + cfg := minimalToolGraphConfig(repo, pinned) + ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) + defer cancel() + + res, err := RunWithConfig(ctx, dot, cfg, RunOptions{ + RunID: "linear-test", + LogsRoot: logsRoot, + DisableCXDB: true, + }) + if err != nil { + t.Fatalf("RunWithConfig: %v", err) + } + if res.FinalStatus != runtime.FinalSuccess { + t.Fatalf("expected success, got %q", res.FinalStatus) + } + + // Verify all three steps executed + for _, nodeID := range []string{"step_a", "step_b", "step_c"} { + stdout := filepath.Join(logsRoot, nodeID, "stdout.log") + data, err := os.ReadFile(stdout) + if err != nil { + t.Fatalf("read %s stdout: %v", nodeID, err) + } + if !strings.Contains(string(data), nodeID+"_done") { + t.Fatalf("%s stdout: got %q, want to contain %q", nodeID, string(data), nodeID+"_done") + } + } +} + +func TestToolGraph_LinearVerify_HillClimber(t *testing.T) { + t.Setenv("XDG_STATE_HOME", t.TempDir()) + repo := initTestRepo(t) + logsRoot := t.TempDir() + pinned := writePinnedCatalog(t) + + // Create a counter file in the repo — verify succeeds on 3rd attempt + counterFile := filepath.Join(repo, "attempt_counter") + if err := os.WriteFile(counterFile, []byte("0"), 0o644); err != nil { + t.Fatal(err) + } + commitAll(t, repo, "add counter") + + dot := []byte(`digraph hill_climber { + graph [goal="Test hill-climber verify loop"] + start [shape=Mdiamond] + implement [ + shape=parallelogram, + tool_command="count=$(cat attempt_counter); count=$((count + 1)); echo $count > attempt_counter; echo implemented_attempt_$count" + ] + verify [ + shape=parallelogram, + tool_command="count=$(cat attempt_counter); if [ $count -ge 3 ]; then echo 'all checks pass'; exit 0; else echo 'checks failed on attempt '$count; exit 1; fi" + ] + done [shape=Msquare] + start -> implement + implement -> verify + verify -> done [condition="outcome=success"] + verify -> implement +}`) + cfg := minimalToolGraphConfig(repo, pinned) + ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) + defer cancel() + + res, err := RunWithConfig(ctx, dot, cfg, RunOptions{ + RunID: "hill-climber-test", + LogsRoot: logsRoot, + DisableCXDB: true, + }) + if err != nil { + t.Fatalf("RunWithConfig: %v", err) + } + if res.FinalStatus != runtime.FinalSuccess { + t.Fatalf("expected success, got %q", res.FinalStatus) + } +} + +func TestToolGraph_Conditional(t *testing.T) { + t.Setenv("XDG_STATE_HOME", t.TempDir()) + repo := initTestRepo(t) + logsRoot := t.TempDir() + pinned := writePinnedCatalog(t) + + // process writes a marker file; path_a and path_b check for it + dot := []byte(`digraph conditional { + graph [goal="Test conditional routing"] + start [shape=Mdiamond] + process [shape=parallelogram, tool_command="echo processed"] + check [shape=diamond] + path_a [shape=parallelogram, tool_command="echo took_path_a"] + done [shape=Msquare] + start -> process -> check + check -> path_a + path_a -> done +}`) + cfg := minimalToolGraphConfig(repo, pinned) + ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) + defer cancel() + + res, err := RunWithConfig(ctx, dot, cfg, RunOptions{ + RunID: "conditional-test", + LogsRoot: logsRoot, + DisableCXDB: true, + }) + if err != nil { + t.Fatalf("RunWithConfig: %v", err) + } + if res.FinalStatus != runtime.FinalSuccess { + t.Fatalf("expected success, got %q", res.FinalStatus) + } + + // Verify path_a was taken + stdout := filepath.Join(logsRoot, "path_a", "stdout.log") + data, err := os.ReadFile(stdout) + if err != nil { + t.Fatalf("read path_a stdout: %v", err) + } + if !strings.Contains(string(data), "took_path_a") { + t.Fatalf("path_a stdout: got %q, want to contain 'took_path_a'", string(data)) + } +} + +func TestToolGraph_FailFast(t *testing.T) { + t.Setenv("XDG_STATE_HOME", t.TempDir()) + repo := initTestRepo(t) + logsRoot := t.TempDir() + pinned := writePinnedCatalog(t) + + // A node that fails with no outgoing edge causes the run to fail. + dot := []byte(`digraph fail_fast { + graph [goal="Test failure handling"] + start [shape=Mdiamond] + failing_step [shape=parallelogram, tool_command="echo 'something went wrong' >&2; exit 1"] + done [shape=Msquare] + start -> failing_step -> done +}`) + cfg := minimalToolGraphConfig(repo, pinned) + ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) + defer cancel() + + res, err := RunWithConfig(ctx, dot, cfg, RunOptions{ + RunID: "fail-fast-test", + LogsRoot: logsRoot, + DisableCXDB: true, + }) + if err != nil { + t.Fatalf("RunWithConfig: %v", err) + } + + // The node fails, but the unconditional edge routes to done. + // Verify the node itself recorded a failure. + statusPath := filepath.Join(logsRoot, "failing_step", "status.json") + data, err := os.ReadFile(statusPath) + if err != nil { + t.Fatalf("read status.json: %v", err) + } + var out runtime.Outcome + if err := json.Unmarshal(data, &out); err != nil { + t.Fatalf("parse status.json: %v", err) + } + if out.Status != runtime.StatusFail { + t.Fatalf("expected fail status, got %q", out.Status) + } + _ = res +} + +func TestToolGraph_NoCXDBConfig(t *testing.T) { + t.Setenv("XDG_STATE_HOME", t.TempDir()) + repo := initTestRepo(t) + logsRoot := t.TempDir() + pinned := writePinnedCatalog(t) + + dot := []byte(`digraph no_cxdb { + graph [goal="Test running without CXDB config"] + start [shape=Mdiamond] + step [shape=parallelogram, tool_command="echo works_without_cxdb"] + done [shape=Msquare] + start -> step -> done +}`) + // Config with NO CXDB section at all + cfg := &RunConfigFile{} + cfg.Version = 1 + cfg.Repo.Path = repo + cfg.ModelDB.OpenRouterModelInfoPath = pinned + cfg.ModelDB.OpenRouterModelInfoUpdatePolicy = "pinned" + cfg.LLM.CLIProfile = "test_shim" + + ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) + defer cancel() + + res, err := RunWithConfig(ctx, dot, cfg, RunOptions{ + RunID: "no-cxdb-test", + LogsRoot: logsRoot, + // Note: NOT setting DisableCXDB — relying on empty config being treated as disabled + }) + if err != nil { + t.Fatalf("RunWithConfig: %v", err) + } + if res.FinalStatus != runtime.FinalSuccess { + t.Fatalf("expected success, got %q", res.FinalStatus) + } +} + +func TestToolGraph_WorkspaceLifecycle(t *testing.T) { + t.Setenv("XDG_STATE_HOME", t.TempDir()) + repo := initTestRepo(t) + logsRoot := t.TempDir() + pinned := writePinnedCatalog(t) + + dot := []byte(`digraph workspace { + graph [goal="Test workspace setup and cleanup in flow"] + start [shape=Mdiamond] + setup [shape=parallelogram, tool_command="mkdir -p workspace && echo setup_done > workspace/state.txt"] + work [shape=parallelogram, tool_command="cat workspace/state.txt && echo work_done >> workspace/state.txt"] + verify [shape=parallelogram, tool_command="grep -q work_done workspace/state.txt"] + done [shape=Msquare] + start -> setup -> work -> verify -> done +}`) + cfg := minimalToolGraphConfig(repo, pinned) + ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second) + defer cancel() + + res, err := RunWithConfig(ctx, dot, cfg, RunOptions{ + RunID: "workspace-test", + LogsRoot: logsRoot, + DisableCXDB: true, + }) + if err != nil { + t.Fatalf("RunWithConfig: %v", err) + } + if res.FinalStatus != runtime.FinalSuccess { + t.Fatalf("expected success, got %q", res.FinalStatus) + } +} + +// minimalToolGraphConfig returns a RunConfigFile suitable for tool-node-only graphs. +func minimalToolGraphConfig(repoPath, pinnedCatalogPath string) *RunConfigFile { + cfg := &RunConfigFile{} + cfg.Version = 1 + cfg.Repo.Path = repoPath + cfg.ModelDB.OpenRouterModelInfoPath = pinnedCatalogPath + cfg.ModelDB.OpenRouterModelInfoUpdatePolicy = "pinned" + cfg.LLM.CLIProfile = "test_shim" + return cfg +} + +// commitAll stages and commits all changes in the repo. +func commitAll(t *testing.T, repo, msg string) { + t.Helper() + runCmd(t, repo, "git", "add", "-A") + runCmd(t, repo, "git", "commit", "-m", msg) +} diff --git a/internal/attractor/engine/run_with_config.go b/internal/attractor/engine/run_with_config.go index 9e0ca63f..ebac12bb 100644 --- a/internal/attractor/engine/run_with_config.go +++ b/internal/attractor/engine/run_with_config.go @@ -330,8 +330,8 @@ func bootstrapRunWithConfig(ctx context.Context, dotSource []byte, cfg *RunConfi bin *cxdb.BinaryClient startup *CXDBStartupInfo ) - if !overrides.DisableCXDB { - // CXDB is required in v1 and must be reachable. + cxdbConfigured := strings.TrimSpace(cfg.CXDB.BinaryAddr) != "" && strings.TrimSpace(cfg.CXDB.HTTPBaseURL) != "" + if !overrides.DisableCXDB && cxdbConfigured { var cxdbStartup *CXDBStartupInfo cxdbClient, bin, cxdbStartup, err = ensureCXDBReady(ctx, cfg, opts.LogsRoot, opts.RunID) if err != nil {