Skip to content

Findings: autonomous-readiness: plan→vote→implement→log→tune loop for arbitrary goals (8, health: adequate) — system review 2026-05-31 #3164

@williamzujkowski

Description

@williamzujkowski

Findings catalog — autonomous-readiness: plan→vote→implement→log→tune loop for arbitrary goals

From the 2026-05-31 full-codebase review (epic #3143). Domain health: adequate. This issue is the durable, individually-trackable list of findings for this domain; thematic work is tracked under epic #3143 (related phase: #3151).

Findings

  • [HIGH][architecture] Plan→Vote→Implement loop is solid; Tune phase decoupled from execution
    • Evidence: packages/nexus-agents/src/pipeline/dev-pipeline.ts:203-237; consensus-plan.ts:468-495
    • Fix: Wire improvement_review MCP tool outputs (ImprovementSignal[]) directly into the pipeline task decomposition phase: detected signals (routing floor breaches, fitness drops, failure concentration) should auto-create PipelineTask objects and feed into the next cycle's decompose() stage, not just file GitHub issues. Currently improvement signals only produce issue URLs with no feedback loop to pipeline.
  • [HIGH][architecture] Outcome recording is CLI-specific but tuning hooks are missing for strategic routing changes
    • Evidence: packages/nexus-agents/src/orchestration/outcomes/outcome-store.ts:85-102; packages/nexus-agents/src/cli-adapters/composite-router.ts (missing file)
    • Fix: OutcomeStore records outcomes with family/vendor enrichment and queryByModelWithFamilyFallback() enables cold-start warm-start. But there's no mechanism to apply recorded learnings back to routing config (adapting budget constraints, thresholds, or CLI affinity based on observed performance). Add a 'routing tune' stage that reads weather_report, detects pathological patterns (e.g., 'gemini always timeouts on security category'), and emits routing-policy changes that are applied to the next orchestrate() invocation.
  • [MED][modularity] Belief memory (hindsight) is fire-and-forget with no integration to voting or plan refinement
    • Evidence: packages/nexus-agents/src/pipeline/dev-pipeline.ts:268-334
    • Fix: dev-pipeline applies hindsight records to IHindsightBeliefMemory after execution (lines 325), but consensus_vote and planning stages do NOT consume this memory to inform voting weights or plan reasoning. Hindsight should flow backward: before the next vote, the architect should see 'prior plan approach X failed 3 times last week; consider Y instead'. Add an optional 'hindsight context' parameter to executeConsensusPlan() and plan() stages that retrieves relevant belief updates.
  • [MED][user-journey] Research stage is isolated; research_discover output never feeds into planning/voting — ✅ RESOLVED via Pass structured ResearchContext (not just text) to plan/vote — #3258 Option B follow-up #3372 (structured ResearchContext → plan/vote, PR feat(research): structured ResearchContext — direct tool calls in the research stage (#3372 increment 1) #3806) + research: MobiMem routing patterns never see research-derived task categories — experience patterns trained on CLI sequences on... #3234 (research-maturity → routing outcomes, PR feat(routing): record research-maturity on routing outcomes + measurement surface (#3234) #3816). (Note: the Evidence cites a stale _vendor/ checkout path — chore: audit 2026-05-31 system-review issues for stale _vendor/ checkout paths #3695; the in-tree path is packages/nexus-agents/src/pipeline/.) (chore: reconcile the 12 system-review finding catalogs — tick resolved sub-findings / split residuals #3696 reconciliation 2026-06-09)
    • Evidence: packages/nexus-agents/src/pipeline/central-hub-vision.test.ts:14-19 (documents vision but not implemented); agent-executor.ts (research stage calls research_discover but output is not wired to plan prompts)
    • Fix: The research stage calls research_discover to populate the research context, but the plan() and vote() stages receive only raw research text—not the structured metadata (techniques_extracted, quality_signals, verdict_notes) that would help voters understand research confidence. Pass a ResearchContext object (not just string) through the pipeline containing technique tags, adoption status, and quality signals so voting can weight recommendations by research maturity.
  • [MED][mission-gap] No explicit feedback loop between vote rejection and improvement discovery
    • Evidence: packages/nexus-agents/src/mcp/tools/consensus-vote.ts (records vote outcomes); improvement-review.ts (detects fitness/routing signals); no consumer links the two
    • Fix: When consensus_vote rejects a plan, the rejection reason (DRY_VIOLATION, OVER_ENGINEERING, etc. per ADR 0016) should seed the next improvement_review cycle as domain-specific signals ('DRY violations are common for this task type'; 'OVER_ENGINEERING detected; simplify scope'). Currently rejection reasons are local to the proposal. Add a rejection-signal analyzer that feeds vote feedback into the observability layer.
  • [MED][architecture] Policy gates in V2 Pipeline OS spec exist but no wiring to autonomy loop or real policy decisions — ✅ RESOLVED via feat(orchestration): enforce policy at the consensus→execute seam + converge audit sink #3704/feat(pipeline): activate #3177 stage-boundary policy enforcement (default WARN, #3703) #3705/observability: converge policy audit emission (emitPolicyEvent vs emitPolicyEvents) onto one durable sink #3710 (+observability: durable policy audit (#3710) records violations only — no allow-baseline for the would-block rate #3727 would-block-rate denominator) (chore: reconcile the 12 system-review finding catalogs — tick resolved sub-findings / split residuals #3696 reconciliation 2026-06-09)
    • Evidence: docs/v2/04-v2-architecture-pipeline-os.md:40-52; PolicyGateSpec type defined but no consumer in run_pipeline/run_graph_workflow that enforces learned policies
    • Fix: V2 architecture declares PolicyGateSpec between stages, but the actual policy-decision enforcement is structural (gates exist as node types) not learned (gates do not learn from outcomes or fitness signals). Implement adaptive gates: a policy-gate stage should read the OutcomeStore + FitnessAudit, apply a learned policy (e.g., 'if prior similar task failed on security, add security expert to decomposition'), and conditionally proceed or route to remediation. Gates should be data-driven.
  • [MED][modularity] Task routing (CompositeRouter) learns from outcomes but has no persistence or distributed sync
    • Evidence: packages/nexus-agents/src/orchestration/outcomes/outcome-store.ts:1-40 (in-memory, max 10k entries); getOutcomeStore() is process singleton (no distributed state); CLI-adapters routing uses computeQualityReward() on every call (O(N) scan per executeTask per orchestrate invocation)
    • Fix: OutcomeStore is in-memory and process-local. For autonomous multi-agent swarms or remote orchestrate() calls, routing decisions cannot be cached or shared. Add optional persistent outcome store backend (SQLite, Redis, append-only JSONL) with a cache layer in CompositeRouter so routing decisions do not thrash N recent outcomes per task. This is blocking distributed autonomy and scaling.
  • [MED][correctness] No bounded-iteration safeguard or cost-control loop back from execution to plan approval
    • Evidence: packages/nexus-agents/src/pipeline/dev-pipeline.ts:143-145 (MAX_VOTE_ITERATIONS=3, MAX_QA_ITERATIONS=3 hardcoded); pipeline-tool.ts:87 (dryRun stops after vote but cost/token tracking is not enforced)
    • Fix: Loops have max iterations (vote ≤3, QA ≤3) but no per-task cost accounting or global budget enforcement. If a task is estimated to cost $50 (buildDryRunReport) and actual execution is tracking at $200, the pipeline should interrupt and route to escalation. Add a cost-enforcement stage after each execute/validate that checks actual spend vs. plan estimate and decides proceed/refine/reject based on budget constraints from CompositeRouter.

Composability notes

The primitives (OutcomeStore, executeConsensusPlan, runDevPipeline, improvement_review, weather_report) are individually well-designed and modular. However, the COMPOSITION of these into a closed-loop autonomous cycle is incomplete. Specifically: (1) Improvement signals are produced (improvement_review surfaces issues) but not consumed by the pipeline to auto-generate next-cycle tasks. (2) Hindsight/belief memory flows one direction (outcomes → beliefs) but not backward (beliefs → voting). (3) Research outputs are string-only, not structured metadata, so research quality signals cannot inform voting weights. (4) Policy gates are architectural placeholders in V2 spec but not wired to actual learned policies from outcomes. (5) Routing learner (CompositeRouter) is ephemeral and cannot be distributed or persisted. To achieve true reusable building-block status, each primitive must declare its dependencies (e.g., executeConsensusPlan requires weather_report context, runDevPipeline optionally consumes improvement signals) and the pipeline orchestrator must wire these dependencies before execution. Currently each tool is callable standalone but their integration into a feedback loop is manual/implicit."

Mission gaps

  • Autonomous loop fails to close: improvement signals (bugs, routing failures, fitness drops) are detected but do NOT auto-create tasks for the next cycle. Signals are filed as GitHub issues (human-driven) but the system cannot independently self-improve by decomposing them.
  • Tuning phase is missing: outcomes are logged and aggregated (weather_report, fitness_score) but there is NO automatic adjustment of orchestration parameters (routing thresholds, budget constraints, policy gates, CLI affinity) in response to observed performance.
  • Distributed autonomy is blocked: OutcomeStore, composite router, and all learner state is in-memory and process-local. Swarms of remote agents cannot share routing decisions or outcome history.
  • Arbitrary goal scope narrowing: Pipeline is task-driven (task → plan → implement) but has no automated scope-tightening when plans are rejected or over-budget. Vote rejection feedback does not automatically trigger scope-analysis stage.

Part of epic #3143. Full review record: docs/archive/system-review-2026-05-31.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestp2Priority 2 - Medium impact, moderate changes needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions