fix(watchdog,session): wipe stale task state on re-run; guard watchdog nudges to current task; promote session agent log to WARNING#23
Open
yoni-bagelman-thenvoi wants to merge 257 commits into
Conversation
feat(state): FSM + gated handoffs (cb-phase) — RFC Phase 2
…chdog feat(watchdog): mechanical signals + cycle caps — RFC Phase 3
feat(state): universal rehydration — RFC Phase 4
The WS4 deterministic stall-detection path was dead-on-arrival (caught by the pre-E2E sweep; all three sweeps triangulated it): 1. watchdog._mark_blocked_via_fsm called fsm.transition() without the keyword-only `store=` arg → TypeError, swallowed by the guard → the blocked transition + audit row never happened. 2. fsm.VALID_TRANSITIONS had no `watchdog` caller edge, so even with store= the call raised InvalidTransitionError. The RFC itself was inconsistent (WS2 table omitted watchdog; WS4 required it). Fix: pass store=self._store, and add the `(any non-terminal, watchdog) → blocked` wildcard in fsm._is_allowed (mirroring conductor→abandoned). Drop the now-dead "Phase 2 not merged" ImportError branch + stale logger. Tests: de-mock test_fsm_transition_called_when_present so it exercises the REAL FSM (asserts the subtask is durably blocked + audit-logged) — the mock was masking the bug. Fix test_cycle_cap_marks_blocked_after_no_progress, which asserted the deferred-suffix that only appeared because of the bug. Update the RFC transition table to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-fsm-edge fix(watchdog): make the stall→blocked FSM transition actually fire
…_stderr Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-and-cap fix(deps): cap click <9 and modernize CliRunner tests off removed mix_stderr
…ydration on real git+sqlite Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on-gate test(rails): integration gate for deterministic rails (real git + sqlite)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d-cap feat(fsm): deterministic per-subtask review-round cap
…estration A CTO-facing, mechanism-level account covering: - the /codeband command and jam integration (the three Monitors, synthesized push in lieu of TeamCreate, room-ownership, the hard-won fixes, and where it deviates from stock jam) - the deterministic-orchestration hardening of codeband (the five workstreams, the decide/enforce split, deviations from stock codeband and band-of-devs, and the dormant→P5-activation status) - the 5-minute onboarding skill and plugin distribution - the broader "protocoled patterns as skills" library this is an instance of Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…empt-cap feat(fsm): deterministic per-subtask verify-attempt cap
cb-phase verify now emits stable, machine-greppable rejection tags with a concrete next step and a distinct exit code per failure mode: REJECTED [dirty_tree] (exit 2) — <n> uncommitted files; commit or stash REJECTED [no_pr] (exit 3) — no open PR for branch <b>; push + open PR REJECTED [verify_failed] (exit 4)— verify command exit + last ~20 lines BLOCKED [cap_reached] (exit 5) — verify-attempt cap; escalated to human The tags feed the verify-gate activation's telemetry later, so they are part of the contract. Dormant: no prompt calls cb-phase yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `cb-phase review <subtask> --approve|--reject`, routing the verdict through fsm.transition with caller_role=reviewer: --approve → review_pending → review_passed --reject → review_pending → review_failed (one failed review round) Legal ONLY from review_pending; the FSM raises (and writes nothing) from any other state. This is the structural bind that makes the verify gate non-bypassable: review_passed is reachable only from review_pending, itself reachable only via the verify gate — so no path to "approved" skips verify. Extends the (A) integration gate (real git + sqlite): approve/reject legal from review_pending; illegal from in_progress / verify_pending / blocked / merged. Dormant: no prompt calls cb-phase review yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a subtask is in `blocked` — from ANY source (the watchdog's own stall cap, the cb-phase verify-attempt cap, or the FSM review-round cap) — the watchdog posts a Band @mention to the owner/CC participant, carrying the subtask id and the durable blocked reason from the transition log. This is the primary, auditable signal; the CC-side Monitors remain the fail-safe. Escalate-once per subtask. DORMANT by default: with no owner_id supplied (the runner does not pass one pre-activation) the blocked-escalation patrol no-ops, preserving today's plain room post. Guarded so a store/notify failure never breaks the patrol loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(p5): Stage 1b wiring — verify errors, reviewer-verdict command, owner escalation
…ted-prompts chore(proposed): stage integrated prompts + knowledge
…eview_failed _cmd_verify now accepts subtasks in in_progress (first submit), review_failed (rework), or verify_pending (retry) states, walking only legal FSM edges to reach verify_pending before running gates. Review-round cap rejection from review_failed escalates to blocked with a structured BLOCKED [review_cap_reached] message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace all five agent prompts (coder, code_reviewer, planner, plan_reviewer, conductor) with their integrated versions that include verify-gate and cb-phase workflow instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add coding-standards, testing, and security guides to src/codeband/knowledge/ and include knowledge/*.md in package-data so they ship in the built wheel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add load_knowledge() to prompts.py and inject craft standards into each runner: full suite (coding-standards, testing, security) for coders and code reviewers; testing-only for planners and plan reviewers. Knowledge is appended after roster, before recovery context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests WS5: Update 3 pinned prompt assertions to match integrated prompt text, add 4 knowledge injection tests, and add 10 verify-gate integration tests covering in_progress/review_failed entry paths, cap escalation, count durability, and invalid entry state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, _walk_to_verify_pending treated any InvalidTransitionError on review_failed → in_progress as a review-cap rejection and force-escalated to blocked. Now the review_round count is checked proactively before the transition attempt; non-cap errors surface as normal rejections without mutating state to blocked. Adds regression test: test_non_cap_error_does_not_block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-integration-verify-gate Prompt integration + verify-gate activation
…-from-initiator feat(watchdog): escalate blocked subtasks to the task initiator
…t sdk) (#104) * feat(transport): opt-in jam delivery behind CODEBAND_DELIVERY (default sdk) Adds a second message-delivery transport for swarm agents that is structurally immune to the mark-processed-422 cursor-pin, selectable per run by CODEBAND_DELIVERY (env) / agents.delivery (yaml), defaulting to the current SDK path. The jam path is opt-in and fully dormant (never imported) when off. Instead of the SDK ExecutionContext's WebSocket + /next server cursor (which wedges on a swallowed mark_processed 422), the jam path pulls inbound messages from the local jam daemon over its wire-stable Unix-socket Control contract — a durable per-peer queue with non-fatal acks and no head-of-line cursor: - codeband/transport/jam_control.py: async HTTP/JSON-over-UDS client for jamd's Control routes (adopt/inbox/ack/send/reply/ping). ack() never raises on a rejection (the swallowed-422 case) — it returns an AckOutcome; the message stays queued and other messages keep flowing. - codeband/transport/jam_runtime.py: JamAgent, exposing the same .run()/.stop() contract as thenvoi.Agent. A dispatcher polls inbox and fans messages to per-room workers (one reused ExecutionContext each, serial within a room, concurrent across rooms — mirrors the SDK, no cross-room head-of-line). Each worker reproduces the SDK ExecutionContext semantics that matter: self-message filter, MessageRetryTracker budget (attempt recorded before processing), context hydration, then the SAME DefaultPreprocessor + adapter.on_event so the brain sees an identical AgentInput. Outbound stays on the SDK REST tools (unchanged). Onboarding adopts the existing Band agent as a generic Pull peer over the socket; a jam-mode startup preflight fails fast if jamd is down. The brain (FSM/gates/cb-phase/StateStore/watchdog/pool/auth/preflight/doctor) and the wedge-recovery machinery (#102/#103/watchdog heal rung) are untouched — they still cover the sdk fallback. Flipping back is flag-only, no code revert. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(transport): async jam preflight + close UDS client on run() exit Addresses adversarial code-review round 1: - high: _jam_delivery_preflight called asyncio.run() but run_local/run_agent await it from inside a running event loop → RuntimeError. Made the preflight async and await it at both call sites; the regression test is now async so it exercises the real running-loop startup path. - medium (leak): JamAgent.run()'s finally now closes the control client via stop() on clean (transport-fatal) return as well as cancellation, and a close() alias is added for distributed run_agent's teardown — the httpx UDS client no longer leaks. Added a test that a transport-fatal run() exit closes the client. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(transport)+doctor: document jam SDK-internals coupling + add tripwire check Makes the jam-delivery runtime's coupling to band-sdk internals explicit so it surfaces at upgrade time, not at first jam run: - jam_runtime.py module docstring gains an 'SDK-INTERNALS COUPLING — RE-VERIFY ON ANY band-sdk BUMP' note listing the internal/private surfaces it depends on (ExecutionContext._ensure_fresh_context, ExecutionContext, the adapter _thenvoi_agent_id attr, DefaultPreprocessor, MessageRetryTracker, the streaming payload models, ThenvoiLink/MessageEvent), most-fragile first. - doctor.check_jam_delivery_sdk_coupling imports each of those symbols and reports a clear, actionable result: OK when present; FAIL if CODEBAND_DELIVERY=jam is selected and one moved (the active path is broken); WARN otherwise (opt-in path won't work, sdk path unaffected — exit code not tripped for sdk users). Additive only; the default sdk path is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…th (#105) * feat(transport): durable per-agent processed-message dedupe for jam path A JamAgent restart after a failed ack would find the message still queued in jamd (anti-wedge property) and re-process it — double-send. This adds a SQLite-backed durable record keyed by (scope, message_id) that survives the restart and skips re-delivery. Write point: immediately after adapter.on_event succeeds and BEFORE the ack, so a failed ack + restart finds the record. Check point: in _process before record_attempt, so a durable skip never consumes a retry slot. Both points are non-fatal (log + continue) to preserve anti-wedge semantics and avoid blocking delivery on a SQLite outage. In-memory _handled/_inflight guards are untouched (fast path for no-restart case). No new tables need ALTER TABLE migrations — CREATE TABLE IF NOT EXISTS in _SCHEMA is idempotent on existing DBs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: ruff format the two new-code files Changed files jam_runtime.py and test_jam_delivery.py were unformatted by ruff standards; fix introduced by the durable-dedupe commit. No logic change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ify-optional fix: make fresh-init verify opt-in
…x-cleanup fix: recognize verifier agents during setup cleanup
…rePR docs: align orchestration drift notes
Finding 1 — CODEBAND_FALLBACK_* child-process leakage: _resolve_claude/codex_auth() stores the stripped API key in os.environ under a renamed key. Codex CLI subprocess inherits os.environ, so every spawned Codex process (and any shell-access bash tool) could read the original API key. Add _clear_auth_fallbacks() and call it in `cb run` immediately before agent spawn (whether preflight ran or not). Tests: four subprocess-boundary tests in TestClearAuthFallbacks verify the fallback is absent in real child processes after clearing. Finding 2 — send_task() ignores $WORKSPACE state-path resolver: kickoff.py manually computed project_dir / workspace_path, bypassing resolve_workspace_path() which honors the $WORKSPACE env var that Docker images set. Docker containers would write the task row under a project-relative path while agents read from $WORKSPACE — split state. Fix: use resolve_workspace_path() in send_task(). Adds $WORKSPACE regression test for cb task / send_task() (register-task had one; this path was unguarded). Finding 3 — Docker entrypoint prefers OPENAI_API_KEY over ChatGPT sub: entrypoint.sh checked OPENAI_API_KEY first, using it even when a mounted ChatGPT subscription auth also existed — the opposite of cb run's _resolve_codex_auth() subscription-first policy. Swap condition order so subscription wins when both are present. Adds two meta-tests asserting the condition order in the script file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sweep flagged 8 ruff findings (F401 unused imports, F841 unused locals) in test files. All were in tests, none touch production logic. Fixes tests that advertise `ruff check src/ tests/` as a passing check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sarial-sweep-2026-06-18 fix(security): adversarial sweep — three security findings + ruff cleanup
…lection A1 (F13): cb approve now exits nonzero and skips room notification when record_approval_grant returns [] (unbound PR, no request marker). Previously it posted APPROVED unconditionally even when no durable grant was recorded, letting the Conductor dispatch Mergemaster on a phantom approval. A2 (F14): add --no-notify flag so /codeband coordinator flows can record the durable grant (cb approve --no-notify <N>) then post the notification as their own jam identity (jam send --as $HANDLE) rather than falling back to BAND_API_KEY (the human key). The coordinator's notification wording — "Durable merge grant recorded for PR #N — please proceed" — is distinct from a bare chat approval; conductor.md now teaches the Conductor to treat it as a valid approval signal and route/nudge Mergemaster. B-narrow (B2): /codeband's Step 3 now onboards with jam --session "$JAM_SESSION" (== TEAM, stable per-repo slug) instead of relying on jam daemon status. The Python peer-resolution in Step 6 is replaced: instead of picking "first running=true peer from jam list" (which can bind to an unrelated Lyra session), it calls jam --session "$JAM_SESSION" status to get the correct handle deterministically. JAM_SESSION is exported from Step 3 so Step 6 picks it up. The mechanism is swappable to fresh-per-run by changing the session key value alone, without re-architecting peer selection. Tests: +3 new (no-notification when no grant, --no-notify skips send, positive notify path); fix existing slash-approve test whose mock now returns a real grant line; update prompt_role_consistency for --no-notify. record_approval_grant, StateStore.record_merge_approval, and the merge FSM/SHA gate are untouched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bution-a1-a2-b-narrow fix(approval): close F13 ghost-approval, F14 wrong-author, B2 peer selection
Mints a fresh session agent at cb run startup, enrolls it in the active task room, and deletes it on clean exit. Stale agents from crashed runs (dead-pid or old-heartbeat) are swept at startup. CODEBAND_SESSION_AGENT_KEY is set in the process env so runner.py's heartbeat loop fires and send_room_message posts as the session agent identity rather than the human key. Key additions: - session_agent.py: delete_session_agent() (clean-exit cleanup) and enroll_session_agent_in_room() (late enrollment for existing rooms) - cli/__init__.py: _provision_coordinator_identity() async helper + wiring in the run command (try/finally for guaranteed cleanup) - Tests: 12 new tests covering mint/enroll/delete, sweep marker removal, crash-recovery (dead-pid stale condition), and full cb run wiring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ession-agent-identity feat(coordinator): wire codeband-session-* identity into cb run
Description-only drift now logs a warning and reuses the existing agent credentials. Previously setup-agents deleted and immediately re-registered drifted agents, which rotated live credentials. That was dangerous because active swarms could lose access and Band.ai could reject same-name re-registration before the old name was released.
…le absent The old no-match fallback told the Conductor to check get_participants() if no peer description matched, but it did not make that check an ordered requirement before declaring a role exhausted. That left room for the Conductor to treat absence from lookup_peers(not_in_chat=room_id) as absence from the platform, even though already-added agents are intentionally excluded from that result. The new instruction makes the participant lookup mandatory before any absent/exhausted conclusion, requires the role to be missing from both lookup_peers and current room participants before stopping, and explicitly says that a role present in participants but missing from lookup_peers is the normal post-add state and should proceed as successful recruitment.
…o-description-drift-delete fix(setup-agents): skip destructive delete for description-only drift
…-visibility fix(conductor): enforce room-participant fallback before declaring role absent
…g nudges to current task; promote session agent log to WARNING Promote session agent registration to WARNING so the operator-visible coordinator identity is easier to spot in logs. Clear task-room state on same-repo /codeband reruns before registering the next task, preserving non-routing workspace artifacts. Resolve the current task from the active room pointer before watchdog subtask patrols, skip stale-task subtasks, and update fixtures to write the canonical pointer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/codebandre-runs before registering the next task, so stale subtask→PR bindings never bleed into a new run. Preserves non-routing workspace artifacts.These are the safety pre-requisites identified in the df#10 dogfood run before df#11 can proceed.
Test plan
pytest tests/test_watchdog_upgrade.py tests/test_task_scoped_identity.py tests/test_watchdog_acceptance_advance_rung.py tests/test_watchdog_backstop_rung.py tests/test_rails_integration.py— 150 passpytest -q— 1478 pass (0 failures)🤖 Generated with Claude Code