Skip to content

Releases: j-zuilkowski/merlin

Merlin v2.4.0

12 Jun 12:23

Choose a tag to compare

Merlin v2.4.0

Release gate status: gates #1-#16 completed.

This release publishes the v2.4.0 evidence-backed build. The attached evidence report summarizes the full release battery, and the repository now includes the public release notes and README screenshot assets on main.

Electronics/KiCad boundary: the electronics domain is released as evidence-gated workflow infrastructure. It includes deterministic KiCad generation, routing, DRC/SPICE/fab gates, and visual KiCad evidence. It is not a blanket fabrication-ready claim for every generated board; high-stakes signoff remains explicitly gated.

Screenshots

Merlin multi-project workspace with the electronics domain active

Provider configuration in Merlin Settings

Provider slot routing with the electronics domain selected

Generated KiCad schematic opened in KiCad Schematic Editor

Generated KiCad PCB opened in KiCad PCB Editor

Generated KiCad board opened in KiCad 3D Viewer

Generated routed board layer composite

Documentation

Attached Assets

Release assets include REPORT.md, RELEASE-RUN.md, the Merlin UI screenshots, and the KiCad screenshots.

v2.2.5 — Repetition-stall escalation rung + E2E robustness

20 May 03:57

Choose a tag to compare

Patch release. The escalation feature shipped this round adds a new
capability-failure rung — EscalationReason.repetitionStall — that
detects a model emitting the same response verbatim (now including
identical tool-call signatures) across a 6-turn window and routes
straight to the designated stronger provider, skipping refinement. The
fingerprint is conservative: a productive model varies either its
narration or its tool-call args, so only a genuine loop trips it.

Five other defects fixed alongside, each caught by S1's end-to-end run:

  • EvalShell.run had no timeout. A transient filesystem stall once
    hung the proving suite for 40 minutes; now bounded by a watchdog and
    SIGKILLed on timeout.
  • LiveShellRunner deadlocked on its pipe (read after wait) AND had no
    timeout, so xcodebuild test hung the critic for the full 1800 s
    test window. Drains on a background queue with a 300 s deadline.
  • Fixture extraction no longer chdir's into ~/Documents via git -C;
    uses --git-dir from a temp cwd, sidestepping the TCC getcwd wedge
    on a freshly-rebuilt ad-hoc-signed test host.
  • cannotDecompose on a preflight overflow now routes only to a
    provider whose usableInputTokens actually fits minContextRequired,
    instead of the strongest capability target regardless of budget.
  • consecutiveCriticFailures bumps only when the escalation truly
    gives up, not on every routed-provider retry exhaustion — fixes the
    circuit-breaker double-counting.

Plus a documented local-signing strategy: MerlinTests-Live test
invocations now use the project's Merlin Dev Signing identity so the
macOS TCC Full Disk Access grant survives rebuilds; compile gates and
CI keep CODE_SIGNING_ALLOWED=NO as before. See CLAUDE.md and
merlin-eval/HANDOFF.md for the runbook.

Verified against the full proving suite: S1 passes legitimately in
1240 s (preflightOverflow → DeepSeek handoff fixes TaskBoard, its
xcodebuild test green at the end). All 1828 unit tests pass; both
schemes compile clean.

v2.2.4 — Context-overflow fix, tool detection, vision launchpad

16 May 03:13

Choose a tag to compare

Merlin v2.2.4

Summary

v2.2.4 makes the provider context-overflow class of failures structurally
impossible, adds first-use detection of missing external tools, lets you target a
specific loaded local model per role slot, and introduces vision.md as the first
artifact of the Project Discipline pipeline.

What's new

  • Context-overflow HTTP 400s are fixed at the source. Three layers, end to end:
    tool output (run_shell, read_file) is capped before it can enter the model
    context (phase 284); the per-request budget is discovered from the active model's
    real context window — queried live for local runners and OpenRouter, learned from
    the first 400 and persisted for commercial providers (phase 285); and every LLM
    request on every engine path — planner, critic, subagents, summariser, memory,
    KAG, vision — is sized to fit the provider window before it is sent, not just the
    main turn loop (phase 286).
  • Local model picker. When a local runner has several models loaded, each can be
    assigned to a role slot directly from the chat HUD and the slot picker (phase 283).
  • Missing-tool detection. When a feature needs an external CLI tool that is not
    installed, Merlin detects it on first use and offers a one-click brew install for
    the Homebrew-safe tools, or shows the install command/URL for the rest — instead of
    a raw "command not found" (phase 287).
  • Vision launchpad. vision.md is now the first artifact of the discipline
    pipeline — vision → architecture → phase → code. project:init seeds it,
    project:adopt incorporates an existing one, project:revise grows and promotes
    ideas from it (phase 288).

Internal changes

  • New types: ToolOutput, ContextBudgetResolver / ContextBudgetStore,
    PreflightGuard, ToolRequirement / ToolRequirements / ToolRequirementChecker.
  • All 14 provider.complete send sites now route through PreflightGuard.
  • Learned context windows persist to ProviderConfig.budget in providers.json
    the same field a manually-entered budget uses.

Migration

None. No configuration changes are required; context-budget discovery and tool
detection are automatic.

v2.2.3 — Built-in Skill Installation Fix

15 May 21:43

Choose a tag to compare

Merlin v2.2.3 — Built-in Skill Installation Fix

Released: 2026-05-15

Summary

v2.2.3 fixes built-in skill installation. The Merlin/Skills/Builtin/ directory is now
bundled inside the app, so a fresh install ships every skill and installs them to
~/.merlin/skills/ on first launch — on any machine, not just the machine the app was
built on.

What's new

  • All 13 built-in skills now ship inside the app bundle: the 8 core skills (commit,
    debug, explain, plan, refactor, review, summarise, test) and the 5
    project:* discipline skills (project:init, project:phase, project:revise,
    project:release, project:adopt).
  • installBuiltinSkills() copies any missing skill to ~/.merlin/skills/ at launch;
    skills already present — including ones you have customised — are left untouched.

Internal changes

  • project.yml adds Merlin/Skills/Builtin as a folder-reference resource on the
    Merlin target, so the directory is copied into Merlin.app/Contents/Resources/Builtin/.
    Previously the directory was excluded from the target and never bundled —
    installBuiltinSkills() only resolved its input via a build-machine #filePath
    fallback, so a distributed build installed no skills at all.
  • The 5 project:* SKILL.md files are now version-controlled in
    Merlin/Skills/Builtin/ rather than living only in ~/.merlin/ and in phase files.

Migration

  • No user data migration required. installBuiltinSkills() skips any skill folder that
    already exists in ~/.merlin/skills/, so existing and customised skills are preserved.

v2.2.2 — Project Discipline: CI Readiness & Regression Fixes

15 May 20:27

Choose a tag to compare

Merlin v2.2.2 — Project Discipline: CI Readiness & Regression Fixes

Released: 2026-05-15

Summary

v2.2.2 makes the v2.2 Project Discipline subsystem real and the test suite green on a
headless runner. It wires the discipline engine and pending-attention chip into the
running app, gates environment-dependent engine tests behind an opt-in so GitHub CI
passes, and fixes two genuine engine regressions found in code review. It also adds a
full external-dependency inventory.

What's new

  • The Project Discipline subsystem is now wired into the running app: DisciplineEngine
    is constructed in AppState, the pending-attention chip/panel appear in ChatView,
    the SessionStart hook surfaces findings, and a scan runs after each turn.
  • Live-environment test gate: engine tests that need a real LLM endpoint are gated
    behind RUN_LIVE_TESTS=1 (skipUnlessLiveEnvironment()), so CI and headless sandboxes
    run green; developers opt in for full coverage.
  • Requirements.md — a complete external-dependency inventory (toolchain, providers,
    local runners, models, LoRA, KiCad, doc tools, services, MCP, frameworks) with a
    source link for every dependency.

Internal changes

  • Fixed the pending-attention chip showing stale data — the view model now reads through
    the shared DisciplineEngine instead of a separate queue instance.
  • Fixed an unbounded context-overrun retry: EscalationHandler now consumes its
    per-turn budget on every escalation attempt, closing a loop that retried ~199 times
    without a terminal event.
  • Fixed parseSteps silently dropping a planner step (and a downstream crash):
    ComplexityTier now decodes high_stakes / highStakes / high-stakes and falls
    back to .standard for unknown values.
  • Removed the dead TelemetryRecorder / TelemetrySink / TelemetryEmitter.sink test
    seam; telemetry tests use the file-based resetForTesting / flushForTesting API via
    a shared readTelemetryEvents(fromFile:) helper.
  • CI workflow: the build step now uses set -o pipefail so a failed build fails the job.

Migration

  • No user data migration required.
  • The v2.2.1 tag remains at the Phase 273b commit as an unreleased intermediate;
    v2.2.2 is the published successor to v2.2.0.

v2.2.0 — Project Discipline Subsystem

15 May 20:27

Choose a tag to compare

Merlin v2.2.0 — Project Discipline Subsystem

Released: 2026-05-14

What's New

Project Discipline Subsystem (v2.2.0) — 25 phase pairs (241a–265b) building the
construction-discipline layer directly into Merlin.

Adapter System (241–242)

  • AdapterRegistry + ProjectAdapter — per-language/per-toolchain configuration consumed
    by every discipline component. Seed adapters for Swift/Xcode and Rust/Cargo.
  • .merlin/project.toml + ProjectConfigLoader — per-project adapter selection and
    decaying-baseline configuration.

Phase Validation (243)

  • PhaseScanner — reads phases/ and cross-checks declared surfaces against the current
    codebase. Four-colour drift report: green / yellow / red / orange.

Pending Attention Queue (244)

  • PendingAttentionQueue — persisted, deduplicated queue of discipline findings.
    Finding, FindingCategory, Severity types.

DisciplineEngine (245)

  • DisciplineEngine actor — central coordinator. Runs all scanners, accumulates findings,
    integrates with the hook engine. Circuit breaker: 3 consecutive failures disable the
    engine for the session.

Hook Integration (246–248)

  • SessionStart hook event + system-reminder injection — top-3 findings surfaced at
    session open.
  • UserPromptSubmit discipline check — flags unscoped feature requests without phase files.
  • GitHookInstaller — post-commit and pre-push hook installer / uninstaller.

Manual Coverage (249–250)

  • ManualCoverageScanner — enumerates user-facing surfaces via adapter regex patterns;
    reads <!-- covers: ... --> doc blocks; returns gaps.
  • ManualBaselineManager + ManualSectionTemplateWriter — decaying baseline enforcement;
    template section writer for uncovered surfaces.

Doc Reference Graph (251)

  • DocReferenceGraph automatic mode — greps doc files for symbol-shaped identifiers;
    cross-checks against source symbol index; returns stale references.

API & Guide Generation (252–253)

  • APIDocGenerator — drives DocC (Swift) or rustdoc (Rust) for API doc regeneration.
  • DevGuideGenerator — regenerates mechanical sections of developer-guide.md from
    the adapter; preserves prose outside <!-- dev-guide:begin/end --> markers.

WHY-Comment Enforcement (254–255)

  • WhyCommentScanner — trigger-pattern scanning with ±3-line comment check.
    rationale-not-needed: annotation suppresses individual triggers.
  • WHYCommentGate + OverrideAnnotationParser — pre-commit gate blocks on missing
    WHY comments; parses override annotations.

Prose Readability (256–257)

  • ProseReadabilityChecker — Vale integration; dry-run mode for tests.
  • ValeStyleWriter — writes Merlin Vale style files (readability, accept, passive-voice,
    weasel).
  • ProseGate — pre-commit gate blocks doc files exceeding target Flesch-Kincaid grade.

Override Audit (258)

  • OverrideAuditLog — JSONL override log; weekly review adds
    overrideAuditAccumulation finding when any category exceeds 5 overrides/week.

Project Skills (259–263)

  • /project:init — scaffold a new project with full discipline support.
  • /project:phase — build an NNa/NNb phase pair with structured questioning.
  • /project:revise — scan for drift, present findings, apply patches.
  • /project:release — consolidated release gate with 14-check checklist.
  • /project:adopt — apply discipline to an existing project; first target: Merlin itself.

Discipline UI (264)

  • PendingAttentionViewModel@MainActor ObservableObject backed by the queue.
  • PendingAttentionChipView — compact count chip in the chat toolbar.
  • PendingAttentionPanelView — expandable panel with per-finding dismiss affordances.

Known Issues

  • DocReferenceGraph automatic mode has a false-positive rate on short identifiers (< 4
    characters). Mitigated by minimum length heuristic; explicit mode (future) will be more
    precise.
  • ProseReadabilityChecker requires vale to be installed as a dev tool. Graceful
    degradation: checker returns grade 0 (always passes) when vale is not found.
  • WhyCommentScanner does not yet scan Rust test files — restricted to *.swift and
    *.rs in non-test directories.
  • Skill files (259–263) require the ~/.merlin/skills/ directory to be writable. On
    sandboxed deployments the skills cannot be installed.

Upgrade Notes

From v2.1.0: No breaking changes to existing v2.1.0 APIs. The v2.2 subsystem is additive.

To activate the Project Discipline Subsystem on your project:

  1. Run /project:adopt in a Merlin session with your project open.
  2. Follow the adoption report recommendations.
  3. Run /project:revise to start working through the backlog.

The discipline subsystem is opt-in at the project level (.merlin/project.toml must exist).
Sessions on projects without .merlin/project.toml are unaffected.

Build number: 17 (was 16 in v2.1.0)

v2.1.0 — Budget-Aware Execution

14 May 22:52

Choose a tag to compare

Release v2.1.0 - Budget-Aware Execution

Summary

Budget-Aware Execution. Merlin now sizes every request to the active provider's input window,
decomposes oversized work, and stops cleanly on unrecoverable overflow. Works regardless of
provider/model/context.

What's new

  • Per-provider ProviderBudget registered as configuration data.
  • Pre-flight estimator gates every LLM call.
  • Working-set caps for system prompt, RAG, recent turns, and tool bursts.
  • Adaptive RAG injection sized to the active budget.
  • Enriched PlanStep with token budget, success criteria, critic mode, and minimum context.
  • PlannerEngine.refineStep(...) as the single decomposition entry point.
  • EscalationHandler as the single bounded retry and escalation policy. No recursion anywhere.
  • Critic gating by skill frontmatter, per-step policy, and deterministic short-circuit.
  • Decompose-first overflow handling with cross-provider routing as the last-resort fallback.
  • New telemetry: engine.preflight.*, engine.escalation.*, planner.refine.*,
    engine.rag.selected, critic.stage1.short_circuit.

Internal changes

  • PlanStep.successCriteria now uses [StepCriterion]. The decoder still accepts the legacy
    single-string form, so existing serialized plans continue to load.
  • AgenticEngine no longer uses contextLengthRetryCount, maxContextOverrunRecoveryAttempts,
    or contextOverrunRecoveryDirective. Recovery now flows through EscalationHandler.
  • New .cleanStop case on AgentEvent. Existing UI consumers can keep falling through to the
    .systemNote rendering path until a dedicated affordance ships.

Migration

  • Existing skills without critic: frontmatter continue to use the heuristic unchanged.
  • Existing config without ProviderBudget falls through to the conservative default
    (maxInputTokens: 32_000, reservedOutputTokens: 4_096).
  • No user data migration is required.

v2.0.0 — Electronics Domain, Multi-Domain Sessions, Memory Backend

14 May 15:15

Choose a tag to compare

Merlin 2.0.0

New in this release

  • Electronics / KiCad Domain — full electronics workflow via merlin-kicad-mcp: schematic ingestion, KiCad project generation, FreeRouting autoroute, ERC/DRC/SPICE/fab verification gates, BOM and order workflows. High-stakes signoff boundaries block irreversible manufacturing actions.
  • Multi-Domain Sessions — activate multiple domains simultaneously (e.g. software + electronics); DomainRegistry scopes tool sets and task types per session.
  • Local Memory Backend — project-scoped vector search via MemoryBackendPlugin with search(query:topK:projectPath:) overload.
  • Session HardeningLiveSession.lifecycleTasks startup sequence, isClosed double-teardown guard, AuthMemory chmod 0600.
  • Provider Reliability — per-provider ephemeral URLSession, 4-attempt retry with 5/10/20s backoff, context-length auto-recovery.

Bug fixes (phases 219b / 220b / 221b)

  • ContextLengthRecoveryTests: fixed wrong systemNote format check and case-sensitivity issues.
  • MCPHTTPTransport: JSON decode errors now throw typed MCPTransportError.decodeError instead of escaping as raw NSError.
  • MCPSSETransportTests: fixed raw-string-literal \n syntax bug.
  • DomainRegistry.taskTypes(): now mirrors activeDomain() non-software preference; fixed test inconsistency.

v1.9.1 — Native tool call collapse, window resize fix

13 May 15:30

Choose a tag to compare

v1.9.1 — Native tool call collapse, resize fix, prompt compression

UI fixes

  • Tool call rows now use native <details>/<summary> HTML elements — no JavaScript onclick handlers, arrow indicator via CSS ::before
  • Fixed duplicate bubble bug: removed addMessage fallback from appendChunk JS that created phantom second bubbles during streaming
  • Fixed window resize reflow: dispatches JS resize event on WKWebView frame change; added width: 100% to CSS body
  • Fixed content order: tool groups render above assistant text in the bubble

Prompt compression (three-layer, phases 205–207)

  • Mid-loop compactionContextManager tracks tokens after every tool result; compacts automatically at 40,000 tokens inside the execute loop (before the next LLM call) to keep per-turn cost linear
  • LLM summarisation — mid-loop compaction now calls the active provider once to produce a short narrative digest of removed exchanges rather than inserting a static truncation marker
  • Instruction distillation — compact built-in core system prompt (~80 tokens vs ~350); optional CLAUDE.md compression via Settings → Agent → Prompt Compression (cached on SHA256 hash, re-distils only on file change)

Config

Enable CLAUDE.md distillation: prompt_compression_enabled = true in ~/.merlin/config.toml or Settings → Agent → Prompt Compression.

v1.9.0 — Performance optimizations

11 May 18:10

Choose a tag to compare

What's new

Stable system prompt prefix cache — The stable portion of the system prompt is now cached and reused across loop iterations. llama.cpp's KV prefix cache gets a consistent byte-identical prefix every turn, eliminating redundant prefill work. Invalidates automatically when CLAUDE.md, memories, standing instructions, permission mode, or working directory change.

Async batch tool dispatch — All tool calls from a single LLM response are now dispatched in one parallel batch via ToolRouter's existing TaskGroup, rather than sequentially one at a time. Reading 4 files now takes the time of 1.

Parallel worker execution — spawn_agent calls in one response now launch all subagents concurrently instead of sequentially. PlannerEngine now annotates plan steps with parallel_safe, and independent steps are grouped into parallel batches rather than forced into sequential continuation turns.