Releases: j-zuilkowski/merlin
Merlin v2.4.0
Merlin v2.4.0
Release gate status: gates #1-#16 completed.
This release publishes the v2.4.0 evidence-backed build. The attached evidence report summarizes the full release battery, and the repository now includes the public release notes and README screenshot assets on main.
Electronics/KiCad boundary: the electronics domain is released as evidence-gated workflow infrastructure. It includes deterministic KiCad generation, routing, DRC/SPICE/fab gates, and visual KiCad evidence. It is not a blanket fabrication-ready claim for every generated board; high-stakes signoff remains explicitly gated.
Screenshots
Documentation
Attached Assets
Release assets include REPORT.md, RELEASE-RUN.md, the Merlin UI screenshots, and the KiCad screenshots.
v2.2.5 — Repetition-stall escalation rung + E2E robustness
Patch release. The escalation feature shipped this round adds a new
capability-failure rung — EscalationReason.repetitionStall — that
detects a model emitting the same response verbatim (now including
identical tool-call signatures) across a 6-turn window and routes
straight to the designated stronger provider, skipping refinement. The
fingerprint is conservative: a productive model varies either its
narration or its tool-call args, so only a genuine loop trips it.
Five other defects fixed alongside, each caught by S1's end-to-end run:
EvalShell.runhad no timeout. A transient filesystem stall once
hung the proving suite for 40 minutes; now bounded by a watchdog and
SIGKILLed on timeout.LiveShellRunnerdeadlocked on its pipe (read after wait) AND had no
timeout, soxcodebuild testhung the critic for the full 1800 s
test window. Drains on a background queue with a 300 s deadline.- Fixture extraction no longer chdir's into
~/Documentsviagit -C;
uses--git-dirfrom a temp cwd, sidestepping the TCCgetcwdwedge
on a freshly-rebuilt ad-hoc-signed test host. cannotDecomposeon a preflight overflow now routes only to a
provider whoseusableInputTokensactually fitsminContextRequired,
instead of the strongest capability target regardless of budget.consecutiveCriticFailuresbumps only when the escalation truly
gives up, not on every routed-provider retry exhaustion — fixes the
circuit-breaker double-counting.
Plus a documented local-signing strategy: MerlinTests-Live test
invocations now use the project's Merlin Dev Signing identity so the
macOS TCC Full Disk Access grant survives rebuilds; compile gates and
CI keep CODE_SIGNING_ALLOWED=NO as before. See CLAUDE.md and
merlin-eval/HANDOFF.md for the runbook.
Verified against the full proving suite: S1 passes legitimately in
1240 s (preflightOverflow → DeepSeek handoff fixes TaskBoard, its
xcodebuild test green at the end). All 1828 unit tests pass; both
schemes compile clean.
v2.2.4 — Context-overflow fix, tool detection, vision launchpad
Merlin v2.2.4
Summary
v2.2.4 makes the provider context-overflow class of failures structurally
impossible, adds first-use detection of missing external tools, lets you target a
specific loaded local model per role slot, and introduces vision.md as the first
artifact of the Project Discipline pipeline.
What's new
- Context-overflow HTTP 400s are fixed at the source. Three layers, end to end:
tool output (run_shell,read_file) is capped before it can enter the model
context (phase 284); the per-request budget is discovered from the active model's
real context window — queried live for local runners and OpenRouter, learned from
the first 400 and persisted for commercial providers (phase 285); and every LLM
request on every engine path — planner, critic, subagents, summariser, memory,
KAG, vision — is sized to fit the provider window before it is sent, not just the
main turn loop (phase 286). - Local model picker. When a local runner has several models loaded, each can be
assigned to a role slot directly from the chat HUD and the slot picker (phase 283). - Missing-tool detection. When a feature needs an external CLI tool that is not
installed, Merlin detects it on first use and offers a one-clickbrew installfor
the Homebrew-safe tools, or shows the install command/URL for the rest — instead of
a raw "command not found" (phase 287). - Vision launchpad.
vision.mdis now the first artifact of the discipline
pipeline —vision → architecture → phase → code.project:initseeds it,
project:adoptincorporates an existing one,project:revisegrows and promotes
ideas from it (phase 288).
Internal changes
- New types:
ToolOutput,ContextBudgetResolver/ContextBudgetStore,
PreflightGuard,ToolRequirement/ToolRequirements/ToolRequirementChecker. - All 14
provider.completesend sites now route throughPreflightGuard. - Learned context windows persist to
ProviderConfig.budgetinproviders.json—
the same field a manually-entered budget uses.
Migration
None. No configuration changes are required; context-budget discovery and tool
detection are automatic.
v2.2.3 — Built-in Skill Installation Fix
Merlin v2.2.3 — Built-in Skill Installation Fix
Released: 2026-05-15
Summary
v2.2.3 fixes built-in skill installation. The Merlin/Skills/Builtin/ directory is now
bundled inside the app, so a fresh install ships every skill and installs them to
~/.merlin/skills/ on first launch — on any machine, not just the machine the app was
built on.
What's new
- All 13 built-in skills now ship inside the app bundle: the 8 core skills (
commit,
debug,explain,plan,refactor,review,summarise,test) and the 5
project:*discipline skills (project:init,project:phase,project:revise,
project:release,project:adopt). installBuiltinSkills()copies any missing skill to~/.merlin/skills/at launch;
skills already present — including ones you have customised — are left untouched.
Internal changes
project.ymladdsMerlin/Skills/Builtinas a folder-reference resource on the
Merlin target, so the directory is copied intoMerlin.app/Contents/Resources/Builtin/.
Previously the directory was excluded from the target and never bundled —
installBuiltinSkills()only resolved its input via a build-machine#filePath
fallback, so a distributed build installed no skills at all.- The 5
project:*SKILL.mdfiles are now version-controlled in
Merlin/Skills/Builtin/rather than living only in~/.merlin/and in phase files.
Migration
- No user data migration required.
installBuiltinSkills()skips any skill folder that
already exists in~/.merlin/skills/, so existing and customised skills are preserved.
v2.2.2 — Project Discipline: CI Readiness & Regression Fixes
Merlin v2.2.2 — Project Discipline: CI Readiness & Regression Fixes
Released: 2026-05-15
Summary
v2.2.2 makes the v2.2 Project Discipline subsystem real and the test suite green on a
headless runner. It wires the discipline engine and pending-attention chip into the
running app, gates environment-dependent engine tests behind an opt-in so GitHub CI
passes, and fixes two genuine engine regressions found in code review. It also adds a
full external-dependency inventory.
What's new
- The Project Discipline subsystem is now wired into the running app:
DisciplineEngine
is constructed inAppState, the pending-attention chip/panel appear inChatView,
theSessionStarthook surfaces findings, and a scan runs after each turn. - Live-environment test gate: engine tests that need a real LLM endpoint are gated
behindRUN_LIVE_TESTS=1(skipUnlessLiveEnvironment()), so CI and headless sandboxes
run green; developers opt in for full coverage. Requirements.md— a complete external-dependency inventory (toolchain, providers,
local runners, models, LoRA, KiCad, doc tools, services, MCP, frameworks) with a
source link for every dependency.
Internal changes
- Fixed the pending-attention chip showing stale data — the view model now reads through
the sharedDisciplineEngineinstead of a separate queue instance. - Fixed an unbounded context-overrun retry:
EscalationHandlernow consumes its
per-turn budget on every escalation attempt, closing a loop that retried ~199 times
without a terminal event. - Fixed
parseStepssilently dropping a planner step (and a downstream crash):
ComplexityTiernow decodeshigh_stakes/highStakes/high-stakesand falls
back to.standardfor unknown values. - Removed the dead
TelemetryRecorder/TelemetrySink/TelemetryEmitter.sinktest
seam; telemetry tests use the file-basedresetForTesting/flushForTestingAPI via
a sharedreadTelemetryEvents(fromFile:)helper. - CI workflow: the build step now uses
set -o pipefailso a failed build fails the job.
Migration
- No user data migration required.
- The
v2.2.1tag remains at the Phase 273b commit as an unreleased intermediate;
v2.2.2 is the published successor to v2.2.0.
v2.2.0 — Project Discipline Subsystem
Merlin v2.2.0 — Project Discipline Subsystem
Released: 2026-05-14
What's New
Project Discipline Subsystem (v2.2.0) — 25 phase pairs (241a–265b) building the
construction-discipline layer directly into Merlin.
Adapter System (241–242)
AdapterRegistry+ProjectAdapter— per-language/per-toolchain configuration consumed
by every discipline component. Seed adapters for Swift/Xcode and Rust/Cargo..merlin/project.toml+ProjectConfigLoader— per-project adapter selection and
decaying-baseline configuration.
Phase Validation (243)
PhaseScanner— readsphases/and cross-checks declared surfaces against the current
codebase. Four-colour drift report: green / yellow / red / orange.
Pending Attention Queue (244)
PendingAttentionQueue— persisted, deduplicated queue of discipline findings.
Finding,FindingCategory,Severitytypes.
DisciplineEngine (245)
DisciplineEngineactor — central coordinator. Runs all scanners, accumulates findings,
integrates with the hook engine. Circuit breaker: 3 consecutive failures disable the
engine for the session.
Hook Integration (246–248)
SessionStarthook event + system-reminder injection — top-3 findings surfaced at
session open.UserPromptSubmitdiscipline check — flags unscoped feature requests without phase files.GitHookInstaller— post-commit and pre-push hook installer / uninstaller.
Manual Coverage (249–250)
ManualCoverageScanner— enumerates user-facing surfaces via adapter regex patterns;
reads<!-- covers: ... -->doc blocks; returns gaps.ManualBaselineManager+ManualSectionTemplateWriter— decaying baseline enforcement;
template section writer for uncovered surfaces.
Doc Reference Graph (251)
DocReferenceGraphautomatic mode — greps doc files for symbol-shaped identifiers;
cross-checks against source symbol index; returns stale references.
API & Guide Generation (252–253)
APIDocGenerator— drives DocC (Swift) or rustdoc (Rust) for API doc regeneration.DevGuideGenerator— regenerates mechanical sections ofdeveloper-guide.mdfrom
the adapter; preserves prose outside<!-- dev-guide:begin/end -->markers.
WHY-Comment Enforcement (254–255)
WhyCommentScanner— trigger-pattern scanning with ±3-line comment check.
rationale-not-needed:annotation suppresses individual triggers.WHYCommentGate+OverrideAnnotationParser— pre-commit gate blocks on missing
WHY comments; parses override annotations.
Prose Readability (256–257)
ProseReadabilityChecker— Vale integration; dry-run mode for tests.ValeStyleWriter— writes Merlin Vale style files (readability, accept, passive-voice,
weasel).ProseGate— pre-commit gate blocks doc files exceeding target Flesch-Kincaid grade.
Override Audit (258)
OverrideAuditLog— JSONL override log; weekly review adds
overrideAuditAccumulationfinding when any category exceeds 5 overrides/week.
Project Skills (259–263)
/project:init— scaffold a new project with full discipline support./project:phase— build an NNa/NNb phase pair with structured questioning./project:revise— scan for drift, present findings, apply patches./project:release— consolidated release gate with 14-check checklist./project:adopt— apply discipline to an existing project; first target: Merlin itself.
Discipline UI (264)
PendingAttentionViewModel—@MainActor ObservableObjectbacked by the queue.PendingAttentionChipView— compact count chip in the chat toolbar.PendingAttentionPanelView— expandable panel with per-finding dismiss affordances.
Known Issues
DocReferenceGraphautomatic mode has a false-positive rate on short identifiers (< 4
characters). Mitigated by minimum length heuristic; explicit mode (future) will be more
precise.ProseReadabilityCheckerrequiresvaleto be installed as a dev tool. Graceful
degradation: checker returns grade 0 (always passes) whenvaleis not found.WhyCommentScannerdoes not yet scan Rust test files — restricted to*.swiftand
*.rsin non-test directories.- Skill files (259–263) require the
~/.merlin/skills/directory to be writable. On
sandboxed deployments the skills cannot be installed.
Upgrade Notes
From v2.1.0: No breaking changes to existing v2.1.0 APIs. The v2.2 subsystem is additive.
To activate the Project Discipline Subsystem on your project:
- Run
/project:adoptin a Merlin session with your project open. - Follow the adoption report recommendations.
- Run
/project:reviseto start working through the backlog.
The discipline subsystem is opt-in at the project level (.merlin/project.toml must exist).
Sessions on projects without .merlin/project.toml are unaffected.
Build number: 17 (was 16 in v2.1.0)
v2.1.0 — Budget-Aware Execution
Release v2.1.0 - Budget-Aware Execution
Summary
Budget-Aware Execution. Merlin now sizes every request to the active provider's input window,
decomposes oversized work, and stops cleanly on unrecoverable overflow. Works regardless of
provider/model/context.
What's new
- Per-provider
ProviderBudgetregistered as configuration data. - Pre-flight estimator gates every LLM call.
- Working-set caps for system prompt, RAG, recent turns, and tool bursts.
- Adaptive RAG injection sized to the active budget.
- Enriched
PlanStepwith token budget, success criteria, critic mode, and minimum context. PlannerEngine.refineStep(...)as the single decomposition entry point.EscalationHandleras the single bounded retry and escalation policy. No recursion anywhere.- Critic gating by skill frontmatter, per-step policy, and deterministic short-circuit.
- Decompose-first overflow handling with cross-provider routing as the last-resort fallback.
- New telemetry:
engine.preflight.*,engine.escalation.*,planner.refine.*,
engine.rag.selected,critic.stage1.short_circuit.
Internal changes
PlanStep.successCriterianow uses[StepCriterion]. The decoder still accepts the legacy
single-string form, so existing serialized plans continue to load.AgenticEngineno longer usescontextLengthRetryCount,maxContextOverrunRecoveryAttempts,
orcontextOverrunRecoveryDirective. Recovery now flows throughEscalationHandler.- New
.cleanStopcase onAgentEvent. Existing UI consumers can keep falling through to the
.systemNoterendering path until a dedicated affordance ships.
Migration
- Existing skills without
critic:frontmatter continue to use the heuristic unchanged. - Existing config without
ProviderBudgetfalls through to the conservative default
(maxInputTokens: 32_000, reservedOutputTokens: 4_096). - No user data migration is required.
v2.0.0 — Electronics Domain, Multi-Domain Sessions, Memory Backend
Merlin 2.0.0
New in this release
- Electronics / KiCad Domain — full electronics workflow via
merlin-kicad-mcp: schematic ingestion, KiCad project generation, FreeRouting autoroute, ERC/DRC/SPICE/fab verification gates, BOM and order workflows. High-stakes signoff boundaries block irreversible manufacturing actions. - Multi-Domain Sessions — activate multiple domains simultaneously (e.g. software + electronics);
DomainRegistryscopes tool sets and task types per session. - Local Memory Backend — project-scoped vector search via
MemoryBackendPluginwithsearch(query:topK:projectPath:)overload. - Session Hardening —
LiveSession.lifecycleTasksstartup sequence,isCloseddouble-teardown guard,AuthMemorychmod 0600. - Provider Reliability — per-provider ephemeral
URLSession, 4-attempt retry with 5/10/20s backoff, context-length auto-recovery.
Bug fixes (phases 219b / 220b / 221b)
ContextLengthRecoveryTests: fixed wrong systemNote format check and case-sensitivity issues.MCPHTTPTransport: JSON decode errors now throw typedMCPTransportError.decodeErrorinstead of escaping as rawNSError.MCPSSETransportTests: fixed raw-string-literal\nsyntax bug.DomainRegistry.taskTypes(): now mirrorsactiveDomain()non-software preference; fixed test inconsistency.
v1.9.1 — Native tool call collapse, window resize fix
v1.9.1 — Native tool call collapse, resize fix, prompt compression
UI fixes
- Tool call rows now use native
<details>/<summary>HTML elements — no JavaScript onclick handlers, arrow indicator via CSS::before - Fixed duplicate bubble bug: removed
addMessagefallback fromappendChunkJS that created phantom second bubbles during streaming - Fixed window resize reflow: dispatches JS
resizeevent on WKWebView frame change; addedwidth: 100%to CSS body - Fixed content order: tool groups render above assistant text in the bubble
Prompt compression (three-layer, phases 205–207)
- Mid-loop compaction —
ContextManagertracks tokens after every tool result; compacts automatically at 40,000 tokens inside the execute loop (before the next LLM call) to keep per-turn cost linear - LLM summarisation — mid-loop compaction now calls the active provider once to produce a short narrative digest of removed exchanges rather than inserting a static truncation marker
- Instruction distillation — compact built-in core system prompt (~80 tokens vs ~350); optional CLAUDE.md compression via Settings → Agent → Prompt Compression (cached on SHA256 hash, re-distils only on file change)
Config
Enable CLAUDE.md distillation: prompt_compression_enabled = true in ~/.merlin/config.toml or Settings → Agent → Prompt Compression.
v1.9.0 — Performance optimizations
What's new
Stable system prompt prefix cache — The stable portion of the system prompt is now cached and reused across loop iterations. llama.cpp's KV prefix cache gets a consistent byte-identical prefix every turn, eliminating redundant prefill work. Invalidates automatically when CLAUDE.md, memories, standing instructions, permission mode, or working directory change.
Async batch tool dispatch — All tool calls from a single LLM response are now dispatched in one parallel batch via ToolRouter's existing TaskGroup, rather than sequentially one at a time. Reading 4 files now takes the time of 1.
Parallel worker execution — spawn_agent calls in one response now launch all subagents concurrently instead of sequentially. PlannerEngine now annotates plan steps with parallel_safe, and independent steps are grouped into parallel batches rather than forced into sequential continuation turns.






