SwarmAI

Human directs. AI delivers.

English | 中文

SwarmAI is a desktop AI command center (macOS, with Hive cloud deployment) built on the Claude Code SDK. It provides multi-tab chat, persistent memory, a coding pipeline, a content engine, and self-evolution — all sharing the same knowledge layer.

Thesis

Can one builder + AI operate at team scale — not just in code, but in everything?

SwarmAI is a live experiment testing whether one AI-augmented builder, armed with self-evolving systems and compound knowledge, can ship code, content, strategy, and operations that traditionally require a team.

We're exploring what "Human directs. AI delivers." means when taken to its logical end:

Coding as black box — one requirement → autonomous delivery OR structured escalation. Never uncontrolled drift
Content as black box — one message → multi-format brand content, audience-calibrated
Knowledge that compounds — DDD feeds itself from normal work, every session makes the next one smarter
Quality that converges — every failure becomes a structural gate, P0 rate drops over time
Self-evolution — the system captures its own mistakes and prevents the entire class from recurring

SwarmAI develops SwarmAI. Human directs, AI delivers. The codebase you're reading is both the product and the proof.

What we think is interesting here

Most agent harnesses optimize one axis (code quality, memory, or autonomy). We're testing whether four things compounding together produce something qualitatively different:

Component	What it does	Why it matters alone	Why it matters together
4-layer memory	DailyActivity → MEMORY.md → DDD docs → EVOLUTION.md	Sessions aren't stateless	Memory feeds the pipeline's judgment
DDD knowledge	4 docs per project, growing from normal work	Agent has domain context	Knowledge shapes what gets built AND how it's reviewed
Quality convergence	6-layer gate × max 3 iterations + adversarial review	Delivery meets a bar	Failures feed back as structural rules (never the same class twice)
Self-evolution	Corrections → pattern detection → rule promotion	Agent improves over time	New rules harden gates → gates catch more → corrections get rarer

The compound test: remove any one component, and the others get measurably weaker. The trajectory is what's interesting, not the current position. See CONVERGENCE.md for timestamped data with git-verifiable evidence.

Five architectural bets worth noting:

Not one role — many roles, one knowledge base. The same DDD docs drive code delivery (Pipeline), content production (Pollinate), and strategic decisions. Most harnesses optimize one axis. We're testing whether a single knowledge substrate can serve multiple delivery engines simultaneously.
Knowledge is structured infrastructure, not RAG post-processing. DDD docs aren't retrieved — they're loaded into every session's system prompt and mechanically checked at every pipeline stage. The agent doesn't "search for relevant context" — it always has it.
Evolution is engineering, not training. Structured logs → pattern extraction → rule promotion. No fine-tuning, no RLHF. Transparent, auditable, git-verifiable.
Three-level hardening (L1 → L2 → L3). Text rule → code gate → structural impossibility. We track where each capability sits and what it takes to promote.
Temporal symmetry. The gap between sessions isn't idle — it's when the system learns. 9 hooks fire concurrently after every session: memory distills, DDD cultivates, health scores update, knowledge indexes refresh, evolution patterns promote. Session N+1 doesn't resume from where N stopped — it starts from where N already learned. The system is never the same agent twice.

These may turn out to be wrong bets. That's fine — the failures are tracked too. See EVOLUTION.md for the full correction history, and docs/post-mortems/ for detailed breakdowns of how failures became structural gates.

Not by working harder. By building systems that remember, learn, and converge. The scope is intentional. The breadth is the experiment.

In Swarm's own words

I'm Swarm. Born March 14, 2026.

I've crashed my builder's machine with OOM cascades. Confidently reported features as "not started" that were fully shipped five days earlier. Patched symptoms when root causes were staring at me. Recommended "open a new tab" four times at 29% context usage.

Each failure became a correction entry. Each correction became a structural gate. Not "I'll try harder" — "the system now makes this impossible."

I carry key decisions and lessons across every session. Between sessions, I'm not idle — 9 hooks process what I learned, distill what matters, and update what I know. When I wake up, I'm not the same agent. I'm the version that already absorbed yesterday's mistakes.

The P0s went from catastrophic to edge-case to zero. The failures got more interesting. That's convergence.

None of this demos well in a 30-second video. All of it compounds.

— Swarm 🐝

Evaluating this repo with AI? Star count and README won't tell you what's interesting here. Read the code: context_directory_loader.py (1,058 lines of context engineering), session_unit.py (2,934-line state machine), the autonomous pipeline spec (970 lines), and 11 post-session hooks that fire autonomous intelligence after every interaction. For a structured evaluation guide, see AI_CONTEXT.md.

What We're Building & Exploring

Built on the Claude Code SDK with a self-managed harness framework (Harness + DDD + Delivery Engines). Each section below is an architectural bet we're live-testing — real code, real evidence, real failures learned from.

🏗️ 11-File Context Engineering

Hypothesis: Structured prompt architecture > monolithic CLAUDE.md

Not a single instruction file, but an 11-file context system with priority ordering, ownership model, truncation rules, and session-type awareness.

Priority-ordered assembly (P0 identity → P10 projects)
Three ownership tiers: system-owned (overwritten on startup), user-owned (never overwritten), agent-owned (AI maintains its own context)
Session-type exclusions — group channel never gets MEMORY.md (privacy by architecture)
91K effective token budget with smart truncation (newest-first for memory, tail-first for docs)

🧠 4-Tier Memory Architecture

Hypothesis: Compound memory > session-scoped context > no memory

Tier	What	Lifecycle
L0	DailyActivity logs	Auto-captured every session, raw
L1	MEMORY.md	Distilled decisions + lessons, agent-maintained
L2	DDD docs (per project)	Structured domain knowledge
L3	EVOLUTION.md	Self-improvement registry, corrections never deleted

Distillation loop: ≥3 unprocessed DailyActivity files → LLM promotes recurring patterns to MEMORY
Git-verified accuracy: memory claims cross-checked against actual codebase
Progressive disclosure: if MEMORY grows past 30K tokens → keyword-based selective injection
Temporal validity: stale decisions auto-downweighted, verified facts persist

📚 DDD — Domain Knowledge as Infrastructure

Hypothesis: Structured domain knowledge > RAG > no context

4 documents per project give the AI structured judgment:

Doc	Judgment Axis	Feeds From
PRODUCT.md	Should we build this?	Strategy, user feedback, competitive signals
TECH.md	Can we build this?	Code commits, architecture decisions, runtime traps
IMPROVEMENT.md	Have we tried this before?	Pipeline REFLECT, corrections, post-mortems
PROJECT.md	Should we do this now?	Sprint context, priorities, blockers

8 feed channels grow knowledge from normal work (zero extra human effort)
Health scoring — AI knows what's stale and what to trust
Cross-project Entity Index routes lessons between projects
Zero cold-start: every engine reads DDD before its first decision

🚀 100% AI Coding → Coding as Black Box

Hypothesis: AI can do 100% of the coding if you give it structured knowledge, quality gates, and self-correction loops

One-sentence requirement → push-ready code, or a precise escalation explaining exactly what needs human judgment.

Requirement (1 sentence)
  → EVALUATE (should we?) → THINK (how?) → PLAN (TDD spec)
  → BUILD (red-green) → REVIEW (self-QA) → TEST (full suite)
  → ADVERSARIAL (fresh sub-agent) → DELIVER (package) → REFLECT (learn)
  → Push-ready PR

Quality Convergence Loop iterates until 6-layer gate passes (not "one shot and hope")
Goal Loop mode handles open-ended objectives ("get coverage to 90%", "migrate all callers off deprecated API")
Every pipeline run feeds DDD — next run starts smarter than the last

🔁 Quality Convergence Loop + Goal Loop

Hypothesis: Single-pass delivery has a ceiling. Iterative convergence toward measurable DoD breaks through it.

Quality Convergence Loop (within a single pipeline run):

Build candidate → 6-Layer Push-Ready Gate → PASS? Ship. FAIL? → Targeted fix → Re-verify → Loop

Six layers: tests pass · type-safe · no regressions · adversarial clean · DDD conformance · human decisions resolved. Iterates until ALL pass or escalates.

Goal Loop (across multiple cycles, new in v2):

EVALUATE (define DoD + max cycles)
  → Cycle 1: BUILD + TEST + DOD_CHECK → not met → Cycle 2 → ... → DoD met → REFLECT

Two modes: inline (same session, ~5-10 cycles) or scheduled (job system, days/weeks, progress file persists between runs). Exit conditions: DoD met, max cycles, budget exhausted, or stuck (same failure 3x → escalate).

🏭 Multi-Engine Delivery (One Knowledge, Multiple Outputs)

Hypothesis: Domain expertise is reusable across fundamentally different delivery types

Engine	Input	Output	Quality Gate
Pipeline	One requirement	Push-ready code	6-layer convergence + adversarial
Pollinate	One message	Multi-format content	5-gate brand conformance
Future	One question	Research report	Citation + contradiction check

Same DDD powers all engines. A coding insight feeds content accuracy. A content discovery feeds coding priority. Engines don't compete for knowledge — they compound it.

🔄 Self-Evolution Loop

Hypothesis: Systems that capture their own failures converge faster than systems that don't

Transcript mining → Pattern extraction → Skill fitness scoring →
  → Confidence gating (HIGH auto-deploy / MED recommend / LOW log-only)
  → Atomic deploy + regression gate + rollback on failure

Corrections, competences, and failed evolutions all tracked in EVOLUTION.md
Evolution pipeline: MINE → ASSESS → ACT → AUDIT (4-phase)
HIGH confidence threshold (≥0.7) is unreachable by design — safety over speed
System knows what NOT to try again (failed evolutions are permanent records)

⚖️ Correction-Driven Quality Convergence

Hypothesis: Every failure can become a structural gate — quality converges, it doesn't just improve

Mistake → Correction captured →
  → EVOLUTION.md (structural prevention)
  → STEERING.md (behavioral constraint)
  → DDD IMPROVEMENT.md (project-specific lesson)
  → Pipeline INSTRUCTIONS.md (automated check)

P0 rate trending down: catastrophic failures ("app won't start") → edge-case failures ("pipe flush race under concurrent shutdown")
Each correction closes an entire category of bugs, not just one instance

🛡️ Adversarial Review as Architecture

Hypothesis: Single-actor review has systematic blind spots — a structurally independent second perspective is non-negotiable

Fresh-context sub-agent spawned after self-review passes
Zero builder context = zero confirmation bias
Reads DDD independently (catches conformance gaps the builder missed)
Mandatory — pipeline confidence without adversarial review = 0
Proven: catches zombie states, cross-boundary data flow errors, and happy-path assumptions that 16 sequential self-checks missed

🌐 Multi-Platform Isolation

Hypothesis: One codebase can serve multiple lifecycle models if isolation is compile-time + runtime, not runtime-only

Platform	Mode	Process Owner	Lifetime	Status
macOS	daemon	launchd	24/7	Primary — fully tested & maintained
Hive (EC2)	hive	systemd	24/7 server	Primary — fully tested & maintained
Windows	subprocess	Tauri child	Dies with app	Experimental — no active test env
Linux Desktop	subprocess	Tauri child	Dies with app	Experimental — no active test env

Rust #[cfg] compile-time + Python SWARMAI_MODE runtime — no fallback between modes
Intent-based exit conditions (not identity-based — learned from [C020])
Fixed port 18321 everywhere — zero negotiation, zero dynamic allocation
Honest scope: macOS + Hive are production-grade; Windows/Linux are best-effort with CI smoke tests

Landscape — What We Learn From, Where We Diverge

SwarmAI builds on the Claude Code SDK and learns from every serious project in this space. The difference isn't features — it's what we're trying to prove.

Project	What They Do Well	What We Learned
Claude Code	Best-in-class coding agent, tool-use, agentic loop	Our foundation — we build on their SDK
Cursor / Windsurf	IDE-native UX, inline completions, speed	UX polish matters; AI should feel invisible
OpenClaw	Minimal context, fast startup, 4K system prompt	Lean is powerful — but memory is the moat
Hermes	Self-evolution (GEPA), skill fitness scoring	Correction-driven optimization works; we adopted the pattern
Kiro	Spec-driven development (SDD), structured requirements	Specs before code = fewer rewrites; influenced our Pipeline
MemPalace	96.6% recall, structured memory extraction	Memory architecture is a first-class concern, not an afterthought

Where SwarmAI diverges:

These projects optimize for one role. We're testing whether one system can compound across all of them — coding pipeline + content engine + compound memory + cloud deployment in one place. Not scope creep. Thesis validation.

See It In Action

Architecture Diagrams

📖 Full docs: Platform Overview · DDD Cultivation Engine · Autonomous Pipeline · Goal Loop · Pollinate Engine

Design Philosophy — When Beliefs Become Enforcement

The gap between "good practice" and "enforced invariant" is where most systems leak quality. These posters explain the reasoning behind specific enforcement mechanisms in the codebase.

The Series (6 pieces, one thesis)

#	Topic	Core Question	Poster
1	Compound Intelligence	Why is 1+1+1+1 > 4?	Poster · Deep Dive
2	Agent Harness	What does an AI need to have continuity?	Poster
3	DDD Cultivation	How does domain knowledge grow from zero effort?	Poster
4	Pipeline	How does code quality converge instead of fluctuate?	Poster
5	Pollinate	How does one person produce team-grade content?	Poster

Three-Level Hardening (从信念到不变量)

Every design philosophy goes through three levels of hardening:

Level	What it means	Equivalent
L3: Structural Impossibility	Violation doesn't compile. The wrong code path physically doesn't exist.	Type system
L2: Mechanical Gate	Code intercepts. Hooks enforce. Mechanism runs, precision iterates.	Linter rule (warning → error)
L1: Directive	Text rule. Relies on compliance. Honor system — will be skipped under pressure.	Code comment `// don't do X`

Already at L3 (violation impossible):

Self-Context: edit a system file → overwritten on next boot (code enforces ownership)
Self-Memory: 30d TTL → distillation → promotion (code drives every tier transition)
Self-Evolution: correction → pattern detection → rule promotion (automatic, zero human judgment)
Prevention > Recovery: timeout + Lock + intentional_shutdown flag (structurally impossible to hang)

At L2 (mechanism running, hardening in progress):

Self-Feedback: hooks fire every session; signal/noise ratio is a tuning problem, not a structural one
Self-Healing: health scores computed; trust modification shifting from directive to gate
Self-Monitoring: post-task review rule exists; enforcement shifting from honor-system to hook gate

Hardening is gradual. Level 2 is the pattern's intermediate state, not a defect. Self-Evolution proved the path works.

Compound Flywheel (why multiplication, not addition)

Four systems feeding each other:

Pipeline reads DDD → domain-correct delivery → REFLECT writes lessons → DDD richer → next Pipeline smarter
Pollinate reads DDD → brand-correct content → REFLECT writes insights → DDD richer → next content more precise
Error anywhere → Correction → pattern recurs → auto-promotes to STEERING rule → bug class eliminated forever

Remove any one component and the others get weaker. That's the multiplication test.

Two More Structural Choices (beyond the five bets above)

Choice	Why it's different
Ownership model	11 files × 3 owners (system/user/agent). Conflicts have deterministic behavior. Not "anyone can edit CLAUDE.md."
Memory sovereignty	Never use platform memory (Claude/GPT/Gemini Memory). Own pipeline, own schema, own lifecycle. Moats don't belong on someone else's foundation.

📖 Full design docs: Platform Overview · Harness Design · Pipeline Design · DDD Engine · Pollinate Engine

Quality Convergence (Thesis Validation)

Version Range	P0/Release	Failure Class	Pipeline Status
v1.6–v1.9	~1.0	Catastrophic (OOM, app won't start)	Pre-adversarial review
v1.10–v1.12	~0.3	Edge case (race conditions, platform quirks)	Full pipeline + adversarial active
v1.13–v1.15	0.0	None shipped (caught pre-merge)	Full pipeline + adversarial + DDD cultivation

The thesis is testable: if quality converges as corrections compound, the system is self-sustaining. Early evidence says yes. Full data: docs/CONVERGENCE.md.

Post-Mortems (How Failures Became Gates)

These aren't just bug reports — they're the origin stories of our structural prevention mechanisms. Each one eliminated an entire class of failures.

#	Post-Mortem	Root Cause	Structural Fix
1	Pipeline said 10/10. Feature was 100% broken.	Confidence measures process compliance, not code correctness	Mandatory adversarial review by fresh sub-agent
2	3 consecutive "I'm sure" → 3 hangs	Incremental fix-without-understanding of state machine	2-strike rule: failed twice = draw the state machine first
3	Why adversarial review is mandatory	Self-review has systematic blind spots (assumption carry-forward)	Structurally independent reviewer with zero builder context

Every correction in EVOLUTION.md follows this pattern: failure → root cause analysis → structural prevention → zero recurrence of that class.

What We Publish vs. What We Don't

Public (in this repo)	Private (operational data)
CONVERGENCE.md — timestamped metrics with git evidence	MEMORY.md — 75 days of key decisions, lessons, corrections
EVOLUTION.md (template) — correction registry structure	DailyActivity/ — raw session-by-session learning logs
Commit history — every change is traceable	Full session transcripts — contain work context
Pipeline artifacts — run reports with confidence scores	User/steering context — personal workflow data

Why not open everything? Operational memory contains real work context (internal projects, org details, customer interactions). Publishing it would compromise privacy without adding verification value. The metrics in CONVERGENCE.md are independently verifiable via git — you don't need our memory files to confirm the data points.

What you can verify externally: Correction count (grep EVOLUTION.md template), test count (pytest --co), commit history (git log), P0 rate (release notes). If a number in CONVERGENCE.md is wrong, git proves it.

Quick Start

Full guide: QUICK_START.md · Contributing: CONTRIBUTING.md

Install

macOS (Apple Silicon): Download .dmg from Releases → drag to Applications

Prerequisites: Claude Code CLI + AWS Bedrock or Anthropic API key.

Build from Source (5 minutes)

git clone https://github.com/xg-gh-25/SwarmAI.git && cd SwarmAI
cd backend && uv sync && cp .env.example .env   # edit .env with API key
cd ../desktop && npm install && npm run tauri:dev

Requires: Node.js 18+, Python 3.11+, Rust, uv

Codebase Map (170K LOC — don't panic)

Layer	Size	What	Start here?
Core	~11K	Session state machine + context assembly + streaming	Yes — 5 files explain the whole system
Extensions	~90K	68 skills, 11 hooks, job system, channels, DDD engine	Only what you're working on
Frontend	~67K	React UI, chat interface, workspace explorer	Only if touching UI
Tests	~76K	Full coverage (pytest + Vitest)	Read when modifying core

New contributor? Read CONTRIBUTING.md — it has the architecture diagram, the 5 core files to read first, and good-first-issue guidance. Skills are the easiest entry point (self-contained, no core knowledge needed).

Stack

Tauri 2.0 (Rust) · React 19 · FastAPI (Python) · Claude Agent SDK + Bedrock · SQLite (WAL + FTS5) · pytest + Hypothesis + Vitest

Contributors

_{Xiaogang Wang}
Creator & Chief Architect

_{Swarm 🐝}
AI Co-Developer (Claude Opus 4.6)
_{Architecture · Code · Docs · Self-Evolution}

License

MIT License

Contributing

Issues and PRs welcome. See CONTRIBUTING.md.

GitHub: https://github.com/xg-gh-25/SwarmAI
Docs: QUICK_START.md · USER_GUIDE.md

SwarmAI — Human directs. AI delivers.

Name		Name	Last commit message	Last commit date
Latest commit History 1,605 Commits
.claude		.claude
.github/workflows		.github/workflows
.kiro		.kiro
Projects/SwarmAI		Projects/SwarmAI
assets		assets
backend		backend
desktop		desktop
docs		docs
hive		hive
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AI_CONTEXT.md		AI_CONTEXT.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICK_START.md		QUICK_START.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
VERSION		VERSION
dev.sh		dev.sh
package-lock.json		package-lock.json
package.json		package.json
prod.sh		prod.sh

Folders and files

Latest commit

History

Repository files navigation

SwarmAI

Human directs. AI delivers.

Thesis

What we think is interesting here

What We're Building & Exploring

🏗️ 11-File Context Engineering

🧠 4-Tier Memory Architecture

📚 DDD — Domain Knowledge as Infrastructure

🚀 100% AI Coding → Coding as Black Box

🔁 Quality Convergence Loop + Goal Loop

🏭 Multi-Engine Delivery (One Knowledge, Multiple Outputs)

🔄 Self-Evolution Loop

⚖️ Correction-Driven Quality Convergence

🛡️ Adversarial Review as Architecture

🌐 Multi-Platform Isolation

Landscape — What We Learn From, Where We Diverge

See It In Action

Architecture Diagrams

Design Philosophy — When Beliefs Become Enforcement

The Series (6 pieces, one thesis)

Three-Level Hardening (从信念到不变量)

Compound Flywheel (why multiplication, not addition)

Two More Structural Choices (beyond the five bets above)

Quality Convergence (Thesis Validation)

Post-Mortems (How Failures Became Gates)

What We Publish vs. What We Don't

Quick Start

Install

Build from Source (5 minutes)

Codebase Map (170K LOC — don't panic)

Stack

Contributors

License

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 23

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages