feat(orchestration): introduce orchestration system by Fodesu · Pull Request #408 · memohai/Memoh

Fodesu · 2026-04-26T17:31:16Z

Summary

This PR adds the first full orchestration control plane for Memoh.
It introduces a task-oriented runtime on top of the existing User Bot flow: users can start orchestration runs, the system can decompose work into DAG tasks, dispatch task attempts to workers, verify results, recover expired work, and replan failed branches.
Main changes:

Add orchestration kernel schema and APIs for runs, tasks, attempts, results, planning intents, verifications, action ledger, and recovery state.
Add workerd and verifyd runtime paths for task execution and verification.
Add planner / replanner LLM runtime, including stricter output validation and replacement-plan handling.
Add executor boundaries, attempt / verification fences, worker lease fencing, and recovery logic for expired attempts, verifications, and planning intents.
Add orchestration web UI for run selection, DAG view, task inspector, thinking/result/log display, stop controls, and better task visualization.
Generate updated OpenAPI / SDK outputs for the orchestration API.
Add unit, integration, and blackbox coverage for planner/replanner, worker/verifier execution, recovery paths, and migration rollback.

Why

The existing User Bot runtime is good for conversational work and short tool loops, but it is not enough for long-running task execution. Complex tasks need durable state, task decomposition, verification, retries, cancellation, recovery, and replanning.
This PR separates those concerns:

User Bot remains the interaction layer.
Orchestration becomes the control layer for long-running task work.

Test plan

TEST_POSTGRES_DSN='TEST_DB go test ./cmd/workerd ./cmd/verifyd ./internal/orchestration ./internal/orchestrationexec -count=1

Add internal/orchestrationfacts: a kernel-side consumer that subscribes to the global attempt-fact stream and validates each envelope against Postgres. The consumer never writes; it just classifies envelopes (accepted, stale, orphan, mismatch, invalid) and surfaces structured outcomes through a hook so tests can assert without scraping logs. Wire it through cmd/agent. Returns nil and the FX graph still builds when the bus or queries are missing, which keeps the in-process bus deployments unchanged.

Add a useRunEventStream composable that holds an EventSource against the run watch endpoint, restarts on selection changes, reconnects with gentle backoff after the server closes the stream, and exposes the current connection status. Drive the orchestration inspector off this stream: while the SSE connection is open the heavy poll loop is skipped entirely, and a debounced refetch is scheduled on every event so the live view stays in sync with the kernel without spamming Postgres. The fallback timer is slowed to 5s and is intended only as a safety net for when the bus is unavailable.

Add a partial index on orchestration_human_checkpoints(timeout_at) for open rows, plus ListTimedOutOpen and MarkOrchestrationHumanCheckpointTimedOut queries. Drop the runtime guard that rejected non-zero TimeoutAt and extend the run recovery loop to flip expired checkpoints to timed_out using the configured default action so blocked tasks resume on their own. Update integration coverage: the previous "rejects timeout until supported" test now drives a full timeout cycle end-to-end. Also wire a NotifyOrchestrationVerificationReady query so future verifyd wakeups can piggyback on Postgres LISTEN/NOTIFY.

Drop the legacy DAG components (run-graph.vue / task-node.vue) and the @vue-flow/core dependency now that the inspector renders its own layered DAG inline. Trim model.ts of helpers no longer referenced by the page. Refresh i18n: add named statuses for every lifecycle state the inspector renders, route the run header through statusMeta() so the display string matches localized chips, and use dedicated keys for checkpoint resolve toasts so the run-refresh string stays scoped to its own surface.

Replace the original task-oriented orchestration plan with a staged implementation plan keyed off RFC.md. The new structure tracks Stage 0 (Postgres kernel, already shipped), Stage 1 (NATS bus + outbox + facts + SSE, landed by this branch), and the upcoming Stages 2-5 (blackboard, EnvSession, accounting, experience layer). It also flags the small Stage 1.5 follow-ups (kernel-side fact-driven state transitions, real JetStream integration test, ops doc) so reviewers can see what is intentionally deferred.

Stage 2 of the RFC introduces a reconstructable runtime view that workers, verifiers, and the orchestrator coordinate through. This change lays the foundation: - Value envelope (schema_version, writer_type, writer_id, attempt_id, claim_epoch, updated_at, persistence_class, payload) so every entry carries enough metadata for ownership/CAS checks and the future rebuild-from-Postgres path. - Key model with Scope (run/task), namespace constants, canonical bb.{scope}.{owner}.{namespace}[.path...] string form, and prefix validation. - Store interface with Get/Put/CompareAndSwap/Delete/List plus an InMemoryStore implementation used by tests and single-process setups. - Sentinel errors for not-found, revision conflict, stale writer, unauthorised writer, CAS-required namespace, and closed store. The package is self-contained and not yet wired into the kernel; later commits add the JetStream KV backend, the writer ownership / CAS layer, and InputManifest dispatch integration.

Adds a Writer wrapper around the blackboard Store that codifies the PLAN.md Stage 2.2 ownership table: - Workers may only write under their own task scope and may not touch the verifier namespace. - Verifiers may only write the verifier namespace at run or task scope. - Orchestrator may write any scope. result.* writes always require CompareAndSwap. The Writer rejects bare Put on result.* with ErrCASRequired, and CompareAndSwap fences stale writers whose ClaimEpoch is older than the value already in the store (ErrStaleWriter), even when the revision matches. WriterIdentity bundles the writer_type / writer_id / task_id / attempt_id / claim_epoch the authorisation layer needs and is validated up front by NewWriter. A small Reader wrapper mirrors Writer for symmetry in dependency injection. Backends are still free to expose Store directly. Tests cover orchestrator/worker/verifier authorisation, foreign-task rejection, cross-namespace rejection, CAS-required Put rejection, stale claim_epoch fencing, revision conflict propagation, and the Reader round-trip.

Implements Store on top of a NATS JetStream KV bucket. The bucket name defaults to MEMOH_ORCH_BLACKBOARD; the entire orchestration runtime view lives in one bucket with run/task scopes namespaced inside via the canonical bb.{scope}.{owner}.{namespace}[.path...] key form. Revisions are passed straight through to JetStream so CompareAndSwap maps to the bucket's optimistic-concurrency primitives without any client-side emulation. Insert (expected==0) maps to KV Create; non-zero expected maps to KV Update; both surface CAS conflicts as ErrRevisionConflict regardless of whether the underlying error is ErrKeyExists or APIError(StreamWrongLastSequence). List opens a filtered watcher over prefix and prefix.>, drains until JetStream signals initial-sync complete, and rebuilds canonical Keys from the raw KV key strings. Delete leaves a NATS tombstone so the entry is hidden from List and surfaced as ErrNotFound on Get. A small factory selects in-memory or JetStream based on an empty URL, mirroring the orchestrationbus factory shape so callers can stay on the in-memory backend in tests and stand-alone CLI tools. Round-trip, CAS, list, and delete tests run against an actual NATS server when TEST_NATS_URL is set (mirrors the TEST_POSTGRES_DSN gating already used by other Postgres-backed tests). Buckets are deleted on test cleanup so devenv state stays clean.

…spatch Wires the kernel into the Stage 2 blackboard runtime view. - Service.SetBlackboardStore registers an optional Store. When unset, every blackboard call site short-circuits and behavior is identical to the pre-Stage-2 kernel; tests and callers that have not opted in see no change. - On task completion the kernel publishes bb.task.{task_id}.result.summary as the orchestrator-class writer. CAS keeps repeated commits idempotent without threading the worker's claim epoch through every kernel call site. Postgres remains the authoritative copy and blackboard publish failures are logged only. - On task dispatch the kernel snapshots the live revisions of run-scope context plus the dependencies' result.* keys into orchestration_input_manifests.captured_blackboard_revisions, so verifier replay can later reconstruct the same view. cmd/agent provides the JetStream- or in-memory-backed Store via the existing [nats] config block and wires it into the kernel through an fx.Invoke. Single-process deployments and tests stay on the in-memory backend through the empty-URL factory fall back. Integration coverage: TestIntegrationBlackboardCaptureAndPublish drives a full producer/consumer DAG through the kernel and asserts both the publish path (root + producer result entries appear in the store) and the capture path (consumer's manifest carries the producer's blackboard revision).

Adds the recovery primitive and admin entry point that close out the Stage 2 contract. After a JetStream KV bucket loss, an operator can invoke this against a quiesced run to repopulate the runtime view from the authoritative Postgres copy without manual intervention or replay. Service.RebuildBlackboard validates run access via the standard caller identity, then walks two Postgres sources: - orchestration_runs.goal/input/control_policy/source_metadata writes bb.run.{run_id}.context.goal as the run-scope snapshot a worker would see at dispatch. - orchestration_task_results joined with the producing attempt writes bb.task.{task_id}.result.summary entries, matching the payload shape the live publish path produces during CompleteAttempt. Each write goes through the orchestrator-class Writer with CAS against the current revision, so concurrent live writes still reject the rebuild attempt instead of silently overwriting fresher state. Per-key errors are counted in the result and logged but do not abort the rebuild; callers can compare write_errors vs the totals to decide whether to retry. When no Store has been wired the call is a no-op that returns blackboard_configured=false so HTTP callers can detect the deployment shape from the response rather than guessing. The new HTTP route POST /orchestration/runs/{run_id}/blackboard/rebuild is declared on the existing orchestration handler interface, gets a fake binding in the handler tests, and ships through swag + the TypeScript SDK regenerated by openapi-ts. Integration coverage: - TestIntegrationRebuildBlackboardRecoversAfterStoreWipe wipes the in-memory store mid-run, calls RebuildBlackboard, and asserts both run-context and task-result entries reappear with the orchestrator writer identity. - TestIntegrationRebuildBlackboardWithoutStoreReturnsConfiguredFalse pins the no-store contract so deploys without NATS keep returning the same negative signal.

…ointer Walk PLAN.md through what actually shipped: namespace layout, writer ownership, frozen InputManifest capture at dispatch, post-commit result.summary publish, and the Postgres-backed RebuildBlackboard with its admin endpoint and SDK binding. Call out the two open follow-ups (JetStream wipe-and-recover blackbox harness, artifact/verifier rebuild) without conflating them with the contract that just landed, and roll the Next Step pointer forward to Stage 3 env sessions since Stage 2 CAS now unblocks moving control transitions onto the bus.

Lay down the five tables that anchor the env session runtime per RFC.md: env_resources describe leasable templates (container image, browser context, etc.) tenant-scoped with a capacity, env_sessions track concrete leased instances and carry lease_epoch fencing analogous to attempt claim_epoch, env_lease_reservations carry the admission ticket through reserve/commit/abort so the scheduler can queue when capacity is saturated, env_bindings map a session to the task/attempt currently using it (with a partial unique index that lets a session outlive a single attempt only when held for HITL resume), and env_snapshots record point-in-time captures keyed by session for verifier replay and drift detection. Indexes back the lookups the runtime layer will need next: per-resource status scans for capacity decisions, lease expiry scans for the reclaim loop, attempt/run lookups for binding inspection, and the prioritised pending queue on reservations. The active-binding partial unique index encodes the invariant that an active or held session has at most one binding at a time. The incremental migration in 0082_add_orchestration_env_runtime is mirrored in 0001_init.up.sql per the project rule. Both up and down were exercised end-to-end against a fresh postgres database (memoh-server migrate up / down / up). sqlc generated Go bindings for each new table and a focused query set covering create / get / list / update lifecycle plus the lease expiry sweep and the pending reservation queue. No runtime code uses these tables yet. S3-B introduces the internal/env package and the EnvManager interface that maps these rows into reserve/commit/abort/bind/snapshot/release semantics.

Add the package that wraps the Stage 3 env_* tables behind a thin runtime surface. The Manager owns the durable state machine — register a Resource, AcquireSession (which writes a reservation row, inserts a session row in 'reserved' state, calls a Backend.Allocate out of band, then commits both), CreateBinding / HoldBinding / ResumeBinding / ReleaseBinding for task→session attachment, and CaptureSnapshot / RenewSessionLease / ReleaseSession / ReclaimExpiredSessions for the rest of the lifecycle. Every state transition validates a (lease_token, lease_epoch) tuple against the session row, mirroring the orchestration kernel's claim_epoch model so a stale holder cannot mutate state behind the kernel's back. Backend is the small driver-facing contract (Kind / Allocate / Snapshot / Release). Backends register against a BackendRegistry by kind; the manager refuses to allocate against an unregistered kind (ErrBackendUnavailable) so a misconfigured deployment fails loud at the dispatch boundary rather than silently fall back to an in-memory shim. NoopBackend ships in this commit for tests and for single-process deployments that have no real env runtime wired in yet — Stage 3-C and 3-D add the container and browser drivers. Notable design choices documented in code: - Acquire is two-phase against Postgres (reservation+session locked in step 1, backend.Allocate runs out of band, commit lands in step 2) so a slow backend never holds row locks. A crash between steps leaves rows the reclaim sweep can finalise. - ResumeBinding bumps lease_epoch and rotates lease_token — the fence that separates a held HITL attempt from its resumed successor. - Capacity is enforced with a SELECT count + CHECK at the reservation step. Stage 4 admission queueing will swap that for fairness scheduling; today the call returns ErrCapacityExceeded and the caller decides whether to back off. - Bindings have a partial unique index on (session_id) WHERE status IN ('active','held'), so the schema enforces the invariant that an active or held session owns at most one binding at a time. Renamed the package directory to internal/orchestrationenv after discovering env/ is .gitignored at the repo root for python venv conventions; the import path matches the Stage 1 / Stage 2 lineage (orchestrationbus, orchestrationblackboard). Tests cover happy-path acquire/release, capacity exceeded, stale lease rejection on release, hold/resume epoch+token rotation, snapshot capture and listing, expired-session reclaim, and duplicate binding rejection. Each test creates its own postgres database and runs the full migration set so coverage exercises the real schema, including the deferred root_task_id FK on orchestration_runs.

Wire the orchestrationenv.Backend implementation for KindContainer resources. The backend depends on a small Runtime interface (PullImage / CreateContainer / StartContainer / StopContainer / DeleteContainer / CommitSnapshot) that mirrors the subset of internal/container.Service used at the env-session layer. Keeping the dependency narrow means this package stays unit-testable with a fake runtime — the cmd/agent wiring (Stage 3-E) is what binds it to containerd, docker, or apple drivers in production. Behaviour highlights: - Allocate respects an image_pull_policy (always / if_not_present / never) before CreateContainer, then starts the workload. A start failure triggers best-effort DeleteContainer so a half-created container does not leak the capacity slot. - Container IDs and storage keys are derived deterministically from the env_session_id ("envs-<id>"/"envs-rw-<id>") so retries reattach to the same container/storage and operators can correlate containers back to env sessions at a glance via the memoh.orchestration_env.* labels stamped onto every container. - Snapshot maps onto SnapshotService.CommitSnapshot. Backends that return container.ErrNotSupported (apple virtualization today) are surfaced as runtime-ref-only results with an "unsupported": true flag so the env manager can still record metadata for inspector views; real failures propagate to the caller. - Release stops the container, tolerates stop failures (the runtime may have already exited), and surfaces delete failures so the manager can decide whether to retry. Container.ErrNotFound at either step is treated as success since the runtime intent (gone) is satisfied. ResourceConfig fields the backend honours today: image / image_ref, storage_driver / snapshotter, cmd, env, workdir, user. Stage 3-E will extend this as the kernel adds mount and network requirements. Tests cover the happy path, missing-image rejection, start-failure cleanup, snapshot success and unsupported and error paths, release under stop failure / delete failure / not-found, and the never-pull policy.

Wire the orchestrationenv.Backend implementation for KindBrowser resources. Allocate POSTs to the existing apps/browser (Bun/Elysia/Playwright) gateway's /session endpoint and persists the returned ws_endpoint, gateway_session_id, and session_token into the env session's runtime handle so the worker can drive Playwright remotely through the gateway. Release closes the gateway session via DELETE /session/:id?token=... The package depends on a small Gateway interface so unit tests stay independent of the real Bun service. NewHTTPGateway ships the production client and matches the on-the-wire schema documented in apps/browser/src/modules/session.ts; tests round-trip against a real httptest.Server to confirm the request/response shapes match. Snapshot is intentionally minimal in this stage. The browser gateway exposes no native snapshot endpoint — capturing browser state means playing Playwright scripts against the live session, which is a worker-side action rather than a backend primitive. The backend returns a stable bookkeeping ref (gateway_session_id, ws_endpoint, snapshot kind) marked unsupported=true so the env manager still records snapshot rows for inspector views; Stage 3-I replaces this with real cookie / storage / screenshot capture once drift detection is designed. Resource config knobs honoured today: core (chromium / firefox), ttl_ms, context_config (passed through verbatim). bot_id sent to the gateway is derived from env_session_id with a configurable "envs-" prefix so env-managed sessions stay distinguishable from real bot sessions in gateway logs and per-bot quotas. Tests cover happy-path allocate, default-core fallback, gateway error propagation, snapshot bookkeeping, release with present and empty handles, release error propagation, plus HTTP gateway round-trip and HTTP error status handling.

…anifests Add two JSONB columns that Stage 3-E uses to drive env session reservation. orchestration_tasks.env_preconditions captures what the planner declared for the task ("does this task need a container or browser session, and which kind?"). orchestration_input_manifests. captured_env_preconditions captures what the kernel actually resolved at dispatch time, so a verifier replay can rebuild the same lease context the worker saw. Both columns default to '{"required": false}' so the column stays NOT NULL without breaking existing rows or creating a NULL special case in readers. defaultEnvPreconditionsJSON feeds that sentinel into every CreateOrchestrationTask and CreateOrchestrationInputManifest call site (root task, planner-emitted child tasks, dispatch manifest, plus all integration test seeds), so behaviour is unchanged in this commit — the columns exist, sqlc has binding code for them, and nothing reads them yet. Subsequent S3-E commits add the planner contract, the dispatch hook that calls envManager.AcquireSession / CreateBinding, the manifest hash inclusion, the cmd/agent FX wiring, and the admin HTTP CRUD surface for env resources.

Promote env_preconditions from a deferred Stage 3 idea to a mandatory field on the planner contract. Every child task the start-run planner or replanner emits must now declare whether it depends on a leasable runtime environment. Required=false marks pure-LLM steps and remains the common case; Required=true demands kind ∈ {container, browser} and a non-empty resource_name so the kernel can later resolve the operator-managed env_resource and reserve a session before dispatch. Domain side: - internal/orchestration/types.go gains an EnvPreconditions struct (with Required/Kind/ResourceName/Mode/EffectClass/Metadata) plus EnvPreconditionsKind* and EnvPreconditionsEffect* constants that mirror the orchestration_env_resources / action ledger CHECK vocabulary. PlannedTaskSpec and Task now carry it as a first-class field, and toTask projects the env_preconditions JSONB column back into the struct. - runtime.go threads the planner-supplied value through the materialise path (replacing the default sentinel from S3-E.1), through the replacement_plan payload, and through plannedChildTasksFromSpecs / plannedChildTasksFromReplacementPlan. decodeEnvPreconditionsObject and normalizeEnvPreconditions centralise the structural validation: required=false strips every optional field, required=true enforces kind/resource_name and rejects unknown keys. Planner contract side: - internal/orchestrationexec/start_run_planner.go now requires env_preconditions on every emitted child task, with requiredPlannerEnvPreconditions / decodePlannerEnvPreconditions performing the same validation as the kernel-side decoder. - The system prompts for both the start-run planner and the replanner explicitly describe the env_preconditions schema and remind the model to mark required=true only for tasks that touch an external runtime. Tests: - New llm_runtime_test.go cases cover the happy path, the env-bound container shape, and three rejection paths (missing field, invalid kind, required=true without resource_name). - TestDecodeStartRunPlannerPayloadValidatesChildTasks and TestDecodeReplanPlannerPayloadUsesStrictChildTaskSchema now pass env_preconditions through so they exercise the new decoder. - The blackbox harness LLM stub emits env_preconditions=false on every fake child task it returns, keeping the runtime smoke tests green under the tighter contract. Replan payloads that pre-date Stage 3-E may still omit the field; plannedChildTaskEnvPreconditions falls back to required=false in that case so historic replans stay replayable without a data migration.

…mpletion The kernel now drives the env session runtime end to end. When a planner emits env_preconditions.required=true the dispatcher resolves the resource_name on the EnvManager, acquires a session, creates a primary binding for the attempt, and persists the captured envelope into captured_env_preconditions on the dispatch manifest. CompleteAttempt releases both the binding and the underlying session once the attempt reaches a terminal state, and dispatch aborts run a best-effort compensation against the env manager so a failed dispatch does not strand a session. The reclaim sweep remains the authoritative backstop for any state that loses the race. EnvManager is exposed as a primitive-typed interface in internal/orchestration so the kernel does not depend on the orchestrationenv package, which lets cmd/agent (S3-E.4) plug in the real Manager without a circular import and lets unit tests substitute a fake. The dispatch path threads an internal envCapture struct through the manifest so a verifier replay can fence stale callers without re-resolving the planner-supplied resource. Tasks with required=false take a fast path that never touches the env manager, keeping pure-LLM dispatches byte-identical with the column default. Integration coverage exercises both branches against a real database with a fake EnvManager, asserting the call sequence, the lease fencing tuple, the manifest capture, and the post-completion release.

…sweeps The orchestration kernel now sees a real environment session manager when it runs as part of the unified server. cmd/agent constructs the env backend registry from the same container service and browser gateway the rest of the server uses, instantiates orchestrationenv.Manager from the shared Postgres pool, and adapts it through KernelAdapter so the kernel keeps consuming the primitive orchestration.EnvManager interface declared in S3-E.2. A periodic reclaim loop sweeps expired sessions on a 30-second cadence so that lease TTLs alone do not strand env_resources after a worker crashes between dispatch and release. The default lease TTL stays at the kernel's 30-minute floor and operators can override it via MEMOH_ORCHESTRATION_ENV_LEASE_TTL_SECONDS without rebuilding. KernelAdapter lives in internal/orchestrationenv so the kernel never has to import the env package. The container backend opts in via a runtime type assertion against the existing ctr.Service value: deployments whose container service does not satisfy the wider env runtime surface skip container env resources rather than refusing to boot, and deployments without a browser gateway skip the browser backend the same way. Missing backends surface as ErrBackendUnavailable on dispatch, which keeps the misconfiguration visible at the call site instead of papering over it.

Expose tenant-scoped env resource CRUD so operators can manage runtime templates before planner dispatch depends on them.

Classify orchestration actions and attach env session plus snapshot references so env-backed attempts are auditable from dispatch through release.

Hold env bindings across resume_held_env checkpoints and reattach them to the next attempt with a rotated lease so HITL resumes keep the same runtime safely.

Flatten planner and completion tool calls so LLM outputs fail in controlled validation paths, and keep browser automation modeled as an env with context/exclusive modes.

Add the orchestration sidebar, env resource pages, image management views, and updated run details so the frontend matches the expanded orchestration APIs.

Reuse the chat tool-call presentation in the run inspector so orchestration actions stream into a familiar, compact Act tab.

Use orchestration intent terminology across storage, runtime, API contracts, and generated clients so control intents are no longer modeled as planner-only lifecycle state.

Replace the monolithic orchestration run view with Vue Flow based DAG and timeline views, including lane-local time compression and clearer status metadata.

Resolve conflicts between the display/browser cleanup on main and orchestration's run graph and runtime work, including migration renumbering and regenerated API clients.

Use theme-aware neutrals for orchestration chrome while keeping selected tasks readable with a subtle primary accent.

Move orchestration control-plane loops into a dedicated orchestrator process so the API server no longer owns NATS-backed runtime coordination.

sheepbox8646 added the size-xl label Apr 29, 2026

Fodesu force-pushed the orchestration branch from a94af88 to f278a82 Compare May 4, 2026 12:33

Fodesu marked this pull request as ready for review May 4, 2026 15:10

Fodesu requested review from chen-ran and sheepbox8646 May 4, 2026 15:10

Fodesu added 25 commits May 7, 2026 14:15

feat(orchestration): implement orchestration control plane

487892e

feat(orchestration): implement runtime replan and workerd loop

71b1b5d

feat(orchestration): add verifier runtime and worker loop

9e42ce6

refactor(orchestration): refine attempt binding runtime

1cf2f66

fix(orchestration): harden run barrier sibling pause handling

d36ed09

feat(orchestration): add run barrier resume and API coverage

333165b

feat(orchestration): add HTTP runtime blackbox coverage

f84a860

feat(orchestration): add llm runtime execution and action ledger

d79f4a7

feat(agent): expose orchestration tool and control APIs

3cac523

feat(deploy): bundle orchestration workers in docker

e106666

feat(web): add orchestration run inspector

91716be

feat(orchestration): add run watch, hint, and retry controls

0e792d1

fix(orchestration): tighten retry and hint runtime guards

5f7547d

feat(orchestration): add LLM start-run planner

5db5d02

docs(orchestration): add runtime plane plan

8471565

feat(orchestration): tighten executor runtime boundary

44f2236

feat(orchestration): tighten planner output contract

a119714

feat(orchestration): add LLM replanning runtime

577f5b2

feat(orchestration): enforce planner runtime limits

0f03b30

fix(orchestration): harden executor replay recovery

739cd55

feat(orchestration): add verifier retry policy

372ce3c

feat(orchestration): delay automatic retry dispatch

4078da6

fix(agent): honor canceled mock generation context

e40e149

fix(orchestration): recover expired planning intents

3a97570

fix(orchestration): fix kernel migration rollback order

91e2c53

Fodesu added 30 commits May 10, 2026 15:05

feat(orchestration): add env resource admin API

1faa010

Expose tenant-scoped env resource CRUD so operators can manage runtime templates before planner dispatch depends on them.

feat(orchestration): record env actions in ledger

234fc4c

Classify orchestration actions and attach env session plus snapshot references so env-backed attempts are auditable from dispatch through release.

feat(orchestration): resume held env sessions after HITL

96d8d95

Hold env bindings across resume_held_env checkpoints and reattach them to the next attempt with a rotated lease so HITL resumes keep the same runtime safely.

feat(orchestration): harden env execution and tool flows

809e5e7

Flatten planner and completion tool calls so LLM outputs fail in controlled validation paths, and keep browser automation modeled as an env with context/exclusive modes.

feat(orchestration): refresh run and env management UI

4842194

Add the orchestration sidebar, env resource pages, image management views, and updated run details so the frontend matches the expanded orchestration APIs.

feat(orchestration): show node actions like chat tools

8e3fe10

Reuse the chat tool-call presentation in the run inspector so orchestration actions stream into a familiar, compact Act tab.

refactor(orchestration): rename planning intents

cc2490f

Use orchestration intent terminology across storage, runtime, API contracts, and generated clients so control intents are no longer modeled as planner-only lifecycle state.

feat(orchestration): rebuild run graph views

3765621

Replace the monolithic orchestration run view with Vue Flow based DAG and timeline views, including lane-local time compression and clearer status metadata.

merge main into orchestration

f0fa684

Resolve conflicts between the display/browser cleanup on main and orchestration's run graph and runtime work, including migration renumbering and regenerated API clients.

style(orchestration): soften run view accents

585ba82

Use theme-aware neutrals for orchestration chrome while keeping selected tasks readable with a subtle primary accent.

refactor(orchestration): split control plane daemon

8325243

Move orchestration control-plane loops into a dedicated orchestrator process so the API server no longer owns NATS-backed runtime coordination.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestration): introduce orchestration system#408

feat(orchestration): introduce orchestration system#408
Fodesu wants to merge 58 commits intomemohai:mainfrom
Fodesu:orchestration

Fodesu commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fodesu commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fodesu commented Apr 26, 2026 •

edited

Loading