Skip to content

fix(watchdog,session): wipe stale task state on re-run; guard watchdog nudges to current task; promote session agent log to WARNING#23

Open
yoni-bagelman-thenvoi wants to merge 257 commits into
band-ai:mainfrom
yoni-bagelman-thenvoi:fix/df11-prereqs
Open

fix(watchdog,session): wipe stale task state on re-run; guard watchdog nudges to current task; promote session agent log to WARNING#23
yoni-bagelman-thenvoi wants to merge 257 commits into
band-ai:mainfrom
yoni-bagelman-thenvoi:fix/df11-prereqs

Conversation

@yoni-bagelman-thenvoi

Copy link
Copy Markdown

Summary

  • B1: Clear task-room state on same-repo /codeband re-runs before registering the next task, so stale subtask→PR bindings never bleed into a new run. Preserves non-routing workspace artifacts.
  • B3: Promote session agent registration from INFO to WARNING so the coordinator identity is visible in operator-filtered logs (WARNING+ mode).
  • B4: Resolve the current task from the active room pointer before watchdog subtask patrols; skip stale-task subtasks. Prevents watchdog from nudging Mergemaster to merge a PR bound to a different (prior) task.

These are the safety pre-requisites identified in the df#10 dogfood run before df#11 can proceed.

Test plan

  • pytest tests/test_watchdog_upgrade.py tests/test_task_scoped_identity.py tests/test_watchdog_acceptance_advance_rung.py tests/test_watchdog_backstop_rung.py tests/test_rails_integration.py — 150 pass
  • Full suite pytest -q — 1478 pass (0 failures)

🤖 Generated with Claude Code

yoni-bagelman-thenvoi and others added 30 commits May 31, 2026 13:48
feat(state): FSM + gated handoffs (cb-phase) — RFC Phase 2
…chdog

feat(watchdog): mechanical signals + cycle caps — RFC Phase 3
feat(state): universal rehydration — RFC Phase 4
The WS4 deterministic stall-detection path was dead-on-arrival (caught by
the pre-E2E sweep; all three sweeps triangulated it):

1. watchdog._mark_blocked_via_fsm called fsm.transition() without the
   keyword-only `store=` arg → TypeError, swallowed by the guard → the
   blocked transition + audit row never happened.
2. fsm.VALID_TRANSITIONS had no `watchdog` caller edge, so even with store=
   the call raised InvalidTransitionError. The RFC itself was inconsistent
   (WS2 table omitted watchdog; WS4 required it).

Fix: pass store=self._store, and add the `(any non-terminal, watchdog) →
blocked` wildcard in fsm._is_allowed (mirroring conductor→abandoned). Drop
the now-dead "Phase 2 not merged" ImportError branch + stale logger.

Tests: de-mock test_fsm_transition_called_when_present so it exercises the
REAL FSM (asserts the subtask is durably blocked + audit-logged) — the mock
was masking the bug. Fix test_cycle_cap_marks_blocked_after_no_progress,
which asserted the deferred-suffix that only appeared because of the bug.
Update the RFC transition table to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-fsm-edge

fix(watchdog): make the stall→blocked FSM transition actually fire
…_stderr

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-and-cap

fix(deps): cap click <9 and modernize CliRunner tests off removed mix_stderr
…ydration on real git+sqlite

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on-gate

test(rails): integration gate for deterministic rails (real git + sqlite)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d-cap

feat(fsm): deterministic per-subtask review-round cap
…estration

A CTO-facing, mechanism-level account covering:
- the /codeband command and jam integration (the three Monitors,
  synthesized push in lieu of TeamCreate, room-ownership, the hard-won
  fixes, and where it deviates from stock jam)
- the deterministic-orchestration hardening of codeband (the five
  workstreams, the decide/enforce split, deviations from stock codeband
  and band-of-devs, and the dormant→P5-activation status)
- the 5-minute onboarding skill and plugin distribution
- the broader "protocoled patterns as skills" library this is an instance of

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…empt-cap

feat(fsm): deterministic per-subtask verify-attempt cap
cb-phase verify now emits stable, machine-greppable rejection tags with a
concrete next step and a distinct exit code per failure mode:

  REJECTED [dirty_tree] (exit 2)   — <n> uncommitted files; commit or stash
  REJECTED [no_pr] (exit 3)        — no open PR for branch <b>; push + open PR
  REJECTED [verify_failed] (exit 4)— verify command exit + last ~20 lines
  BLOCKED  [cap_reached] (exit 5)  — verify-attempt cap; escalated to human

The tags feed the verify-gate activation's telemetry later, so they are part
of the contract. Dormant: no prompt calls cb-phase yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `cb-phase review <subtask> --approve|--reject`, routing the verdict
through fsm.transition with caller_role=reviewer:
  --approve → review_pending → review_passed
  --reject  → review_pending → review_failed (one failed review round)

Legal ONLY from review_pending; the FSM raises (and writes nothing) from any
other state. This is the structural bind that makes the verify gate
non-bypassable: review_passed is reachable only from review_pending, itself
reachable only via the verify gate — so no path to "approved" skips verify.

Extends the (A) integration gate (real git + sqlite): approve/reject legal
from review_pending; illegal from in_progress / verify_pending / blocked /
merged. Dormant: no prompt calls cb-phase review yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a subtask is in `blocked` — from ANY source (the watchdog's own stall
cap, the cb-phase verify-attempt cap, or the FSM review-round cap) — the
watchdog posts a Band @mention to the owner/CC participant, carrying the
subtask id and the durable blocked reason from the transition log. This is the
primary, auditable signal; the CC-side Monitors remain the fail-safe.

Escalate-once per subtask. DORMANT by default: with no owner_id supplied (the
runner does not pass one pre-activation) the blocked-escalation patrol no-ops,
preserving today's plain room post. Guarded so a store/notify failure never
breaks the patrol loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(p5): Stage 1b wiring — verify errors, reviewer-verdict command, owner escalation
…ted-prompts

chore(proposed): stage integrated prompts + knowledge
…eview_failed

_cmd_verify now accepts subtasks in in_progress (first submit),
review_failed (rework), or verify_pending (retry) states, walking
only legal FSM edges to reach verify_pending before running gates.
Review-round cap rejection from review_failed escalates to blocked
with a structured BLOCKED [review_cap_reached] message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace all five agent prompts (coder, code_reviewer, planner,
plan_reviewer, conductor) with their integrated versions that
include verify-gate and cb-phase workflow instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add coding-standards, testing, and security guides to
src/codeband/knowledge/ and include knowledge/*.md in package-data
so they ship in the built wheel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add load_knowledge() to prompts.py and inject craft standards into
each runner: full suite (coding-standards, testing, security) for
coders and code reviewers; testing-only for planners and plan
reviewers. Knowledge is appended after roster, before recovery
context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests

WS5: Update 3 pinned prompt assertions to match integrated prompt
text, add 4 knowledge injection tests, and add 10 verify-gate
integration tests covering in_progress/review_failed entry paths,
cap escalation, count durability, and invalid entry state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, _walk_to_verify_pending treated any InvalidTransitionError
on review_failed → in_progress as a review-cap rejection and
force-escalated to blocked. Now the review_round count is checked
proactively before the transition attempt; non-cap errors surface as
normal rejections without mutating state to blocked.

Adds regression test: test_non_cap_error_does_not_block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-integration-verify-gate

Prompt integration + verify-gate activation
…-from-initiator

feat(watchdog): escalate blocked subtasks to the task initiator
yoni-bagelman-thenvoi and others added 30 commits June 17, 2026 21:22
…t sdk) (#104)

* feat(transport): opt-in jam delivery behind CODEBAND_DELIVERY (default sdk)

Adds a second message-delivery transport for swarm agents that is structurally
immune to the mark-processed-422 cursor-pin, selectable per run by
CODEBAND_DELIVERY (env) / agents.delivery (yaml), defaulting to the current SDK
path. The jam path is opt-in and fully dormant (never imported) when off.

Instead of the SDK ExecutionContext's WebSocket + /next server cursor (which
wedges on a swallowed mark_processed 422), the jam path pulls inbound messages
from the local jam daemon over its wire-stable Unix-socket Control contract — a
durable per-peer queue with non-fatal acks and no head-of-line cursor:

- codeband/transport/jam_control.py: async HTTP/JSON-over-UDS client for jamd's
  Control routes (adopt/inbox/ack/send/reply/ping). ack() never raises on a
  rejection (the swallowed-422 case) — it returns an AckOutcome; the message
  stays queued and other messages keep flowing.
- codeband/transport/jam_runtime.py: JamAgent, exposing the same .run()/.stop()
  contract as thenvoi.Agent. A dispatcher polls inbox and fans messages to
  per-room workers (one reused ExecutionContext each, serial within a room,
  concurrent across rooms — mirrors the SDK, no cross-room head-of-line). Each
  worker reproduces the SDK ExecutionContext semantics that matter: self-message
  filter, MessageRetryTracker budget (attempt recorded before processing),
  context hydration, then the SAME DefaultPreprocessor + adapter.on_event so the
  brain sees an identical AgentInput. Outbound stays on the SDK REST tools
  (unchanged). Onboarding adopts the existing Band agent as a generic Pull peer
  over the socket; a jam-mode startup preflight fails fast if jamd is down.

The brain (FSM/gates/cb-phase/StateStore/watchdog/pool/auth/preflight/doctor)
and the wedge-recovery machinery (#102/#103/watchdog heal rung) are untouched —
they still cover the sdk fallback. Flipping back is flag-only, no code revert.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(transport): async jam preflight + close UDS client on run() exit

Addresses adversarial code-review round 1:
- high: _jam_delivery_preflight called asyncio.run() but run_local/run_agent
  await it from inside a running event loop → RuntimeError. Made the preflight
  async and await it at both call sites; the regression test is now async so it
  exercises the real running-loop startup path.
- medium (leak): JamAgent.run()'s finally now closes the control client via
  stop() on clean (transport-fatal) return as well as cancellation, and a
  close() alias is added for distributed run_agent's teardown — the httpx UDS
  client no longer leaks. Added a test that a transport-fatal run() exit closes
  the client.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(transport)+doctor: document jam SDK-internals coupling + add tripwire check

Makes the jam-delivery runtime's coupling to band-sdk internals explicit so it
surfaces at upgrade time, not at first jam run:

- jam_runtime.py module docstring gains an 'SDK-INTERNALS COUPLING — RE-VERIFY ON
  ANY band-sdk BUMP' note listing the internal/private surfaces it depends on
  (ExecutionContext._ensure_fresh_context, ExecutionContext, the adapter
  _thenvoi_agent_id attr, DefaultPreprocessor, MessageRetryTracker, the streaming
  payload models, ThenvoiLink/MessageEvent), most-fragile first.
- doctor.check_jam_delivery_sdk_coupling imports each of those symbols and reports
  a clear, actionable result: OK when present; FAIL if CODEBAND_DELIVERY=jam is
  selected and one moved (the active path is broken); WARN otherwise (opt-in path
  won't work, sdk path unaffected — exit code not tripped for sdk users).

Additive only; the default sdk path is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…th (#105)

* feat(transport): durable per-agent processed-message dedupe for jam path

A JamAgent restart after a failed ack would find the message still queued
in jamd (anti-wedge property) and re-process it — double-send. This adds a
SQLite-backed durable record keyed by (scope, message_id) that survives the
restart and skips re-delivery.

Write point: immediately after adapter.on_event succeeds and BEFORE the ack,
so a failed ack + restart finds the record. Check point: in _process before
record_attempt, so a durable skip never consumes a retry slot. Both points
are non-fatal (log + continue) to preserve anti-wedge semantics and avoid
blocking delivery on a SQLite outage.

In-memory _handled/_inflight guards are untouched (fast path for no-restart
case). No new tables need ALTER TABLE migrations — CREATE TABLE IF NOT EXISTS
in _SCHEMA is idempotent on existing DBs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: ruff format the two new-code files

Changed files jam_runtime.py and test_jam_delivery.py were unformatted by
ruff standards; fix introduced by the durable-dedupe commit. No logic change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ify-optional

fix: make fresh-init verify opt-in
…x-cleanup

fix: recognize verifier agents during setup cleanup
…rePR

docs: align orchestration drift notes
Finding 1 — CODEBAND_FALLBACK_* child-process leakage:
_resolve_claude/codex_auth() stores the stripped API key in os.environ
under a renamed key. Codex CLI subprocess inherits os.environ, so every
spawned Codex process (and any shell-access bash tool) could read the
original API key. Add _clear_auth_fallbacks() and call it in `cb run`
immediately before agent spawn (whether preflight ran or not).
Tests: four subprocess-boundary tests in TestClearAuthFallbacks verify
the fallback is absent in real child processes after clearing.

Finding 2 — send_task() ignores $WORKSPACE state-path resolver:
kickoff.py manually computed project_dir / workspace_path, bypassing
resolve_workspace_path() which honors the $WORKSPACE env var that Docker
images set. Docker containers would write the task row under a
project-relative path while agents read from $WORKSPACE — split state.
Fix: use resolve_workspace_path() in send_task(). Adds $WORKSPACE
regression test for cb task / send_task() (register-task had one; this
path was unguarded).

Finding 3 — Docker entrypoint prefers OPENAI_API_KEY over ChatGPT sub:
entrypoint.sh checked OPENAI_API_KEY first, using it even when a
mounted ChatGPT subscription auth also existed — the opposite of cb
run's _resolve_codex_auth() subscription-first policy. Swap condition
order so subscription wins when both are present. Adds two meta-tests
asserting the condition order in the script file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sweep flagged 8 ruff findings (F401 unused imports, F841 unused
locals) in test files. All were in tests, none touch production logic.
Fixes tests that advertise `ruff check src/ tests/` as a passing check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sarial-sweep-2026-06-18

fix(security): adversarial sweep — three security findings + ruff cleanup
…lection

A1 (F13): cb approve now exits nonzero and skips room notification when
record_approval_grant returns [] (unbound PR, no request marker). Previously
it posted APPROVED unconditionally even when no durable grant was recorded,
letting the Conductor dispatch Mergemaster on a phantom approval.

A2 (F14): add --no-notify flag so /codeband coordinator flows can record
the durable grant (cb approve --no-notify <N>) then post the notification
as their own jam identity (jam send --as $HANDLE) rather than falling back
to BAND_API_KEY (the human key). The coordinator's notification wording
— "Durable merge grant recorded for PR #N — please proceed" — is distinct
from a bare chat approval; conductor.md now teaches the Conductor to treat
it as a valid approval signal and route/nudge Mergemaster.

B-narrow (B2): /codeband's Step 3 now onboards with
jam --session "$JAM_SESSION" (== TEAM, stable per-repo slug) instead of
relying on jam daemon status. The Python peer-resolution in Step 6 is
replaced: instead of picking "first running=true peer from jam list" (which
can bind to an unrelated Lyra session), it calls
jam --session "$JAM_SESSION" status to get the correct handle
deterministically. JAM_SESSION is exported from Step 3 so Step 6 picks it
up. The mechanism is swappable to fresh-per-run by changing the session key
value alone, without re-architecting peer selection.

Tests: +3 new (no-notification when no grant, --no-notify skips send,
positive notify path); fix existing slash-approve test whose mock now
returns a real grant line; update prompt_role_consistency for --no-notify.
record_approval_grant, StateStore.record_merge_approval, and the merge
FSM/SHA gate are untouched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bution-a1-a2-b-narrow

fix(approval): close F13 ghost-approval, F14 wrong-author, B2 peer selection
Mints a fresh session agent at cb run startup, enrolls it in the active
task room, and deletes it on clean exit. Stale agents from crashed runs
(dead-pid or old-heartbeat) are swept at startup. CODEBAND_SESSION_AGENT_KEY
is set in the process env so runner.py's heartbeat loop fires and
send_room_message posts as the session agent identity rather than the human key.

Key additions:
- session_agent.py: delete_session_agent() (clean-exit cleanup) and
  enroll_session_agent_in_room() (late enrollment for existing rooms)
- cli/__init__.py: _provision_coordinator_identity() async helper + wiring
  in the run command (try/finally for guaranteed cleanup)
- Tests: 12 new tests covering mint/enroll/delete, sweep marker removal,
  crash-recovery (dead-pid stale condition), and full cb run wiring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ession-agent-identity

feat(coordinator): wire codeband-session-* identity into cb run
Description-only drift now logs a warning and reuses the existing agent credentials. Previously setup-agents deleted and immediately re-registered drifted agents, which rotated live credentials. That was dangerous because active swarms could lose access and Band.ai could reject same-name re-registration before the old name was released.
…le absent

The old no-match fallback told the Conductor to check get_participants() if no peer description matched, but it did not make that check an ordered requirement before declaring a role exhausted. That left room for the Conductor to treat absence from lookup_peers(not_in_chat=room_id) as absence from the platform, even though already-added agents are intentionally excluded from that result.

The new instruction makes the participant lookup mandatory before any absent/exhausted conclusion, requires the role to be missing from both lookup_peers and current room participants before stopping, and explicitly says that a role present in participants but missing from lookup_peers is the normal post-add state and should proceed as successful recruitment.
…o-description-drift-delete

fix(setup-agents): skip destructive delete for description-only drift
…-visibility

fix(conductor): enforce room-participant fallback before declaring role absent
…g nudges to current task; promote session agent log to WARNING

Promote session agent registration to WARNING so the operator-visible coordinator identity is easier to spot in logs.

Clear task-room state on same-repo /codeband reruns before registering the next task, preserving non-routing workspace artifacts.

Resolve the current task from the active room pointer before watchdog subtask patrols, skip stale-task subtasks, and update fixtures to write the canonical pointer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant