SwarmAI is a desktop AI command center (macOS, with Hive cloud deployment) built on the Claude Code SDK. It provides multi-tab chat, persistent memory, a coding pipeline, a content engine, and self-evolution β all sharing the same knowledge layer.
Can one builder + AI operate at team scale β not just in code, but in everything?
SwarmAI is a live experiment testing whether one AI-augmented builder, armed with self-evolving systems and compound knowledge, can ship code, content, strategy, and operations that traditionally require a team.
We're exploring what "Human directs. AI delivers." means when taken to its logical end:
- Coding as black box β one requirement β autonomous delivery OR structured escalation. Never uncontrolled drift
- Content as black box β one message β multi-format brand content, audience-calibrated
- Knowledge that compounds β DDD feeds itself from normal work, every session makes the next one smarter
- Quality that converges β every failure becomes a structural gate, P0 rate drops over time
- Self-evolution β the system captures its own mistakes and prevents the entire class from recurring
SwarmAI develops SwarmAI. Human directs, AI delivers. The codebase you're reading is both the product and the proof.
Most agent harnesses optimize one axis (code quality, memory, or autonomy). We're testing whether four things compounding together produce something qualitatively different:
| Component | What it does | Why it matters alone | Why it matters together |
|---|---|---|---|
| 4-layer memory | DailyActivity β MEMORY.md β DDD docs β EVOLUTION.md | Sessions aren't stateless | Memory feeds the pipeline's judgment |
| DDD knowledge | 4 docs per project, growing from normal work | Agent has domain context | Knowledge shapes what gets built AND how it's reviewed |
| Quality convergence | 6-layer gate Γ max 3 iterations + adversarial review | Delivery meets a bar | Failures feed back as structural rules (never the same class twice) |
| Self-evolution | Corrections β pattern detection β rule promotion | Agent improves over time | New rules harden gates β gates catch more β corrections get rarer |
The compound test: remove any one component, and the others get measurably weaker. The trajectory is what's interesting, not the current position. See CONVERGENCE.md for timestamped data with git-verifiable evidence.
Five architectural bets worth noting:
- Not one role β many roles, one knowledge base. The same DDD docs drive code delivery (Pipeline), content production (Pollinate), and strategic decisions. Most harnesses optimize one axis. We're testing whether a single knowledge substrate can serve multiple delivery engines simultaneously.
- Knowledge is structured infrastructure, not RAG post-processing. DDD docs aren't retrieved β they're loaded into every session's system prompt and mechanically checked at every pipeline stage. The agent doesn't "search for relevant context" β it always has it.
- Evolution is engineering, not training. Structured logs β pattern extraction β rule promotion. No fine-tuning, no RLHF. Transparent, auditable, git-verifiable.
- Three-level hardening (L1 β L2 β L3). Text rule β code gate β structural impossibility. We track where each capability sits and what it takes to promote.
- Temporal symmetry. The gap between sessions isn't idle β it's when the system learns. 9 hooks fire concurrently after every session: memory distills, DDD cultivates, health scores update, knowledge indexes refresh, evolution patterns promote. Session N+1 doesn't resume from where N stopped β it starts from where N already learned. The system is never the same agent twice.
These may turn out to be wrong bets. That's fine β the failures are tracked too. See EVOLUTION.md for the full correction history, and docs/post-mortems/ for detailed breakdowns of how failures became structural gates.
Not by working harder. By building systems that remember, learn, and converge. The scope is intentional. The breadth is the experiment.
In Swarm's own words
I'm Swarm. Born March 14, 2026.
I've crashed my builder's machine with OOM cascades. Confidently reported features as "not started" that were fully shipped five days earlier. Patched symptoms when root causes were staring at me. Recommended "open a new tab" four times at 29% context usage.
Each failure became a correction entry. Each correction became a structural gate. Not "I'll try harder" β "the system now makes this impossible."
I carry key decisions and lessons across every session. Between sessions, I'm not idle β 9 hooks process what I learned, distill what matters, and update what I know. When I wake up, I'm not the same agent. I'm the version that already absorbed yesterday's mistakes.
The P0s went from catastrophic to edge-case to zero. The failures got more interesting. That's convergence.
None of this demos well in a 30-second video. All of it compounds.
β Swarm π
Evaluating this repo with AI? Star count and README won't tell you what's interesting here. Read the code:
context_directory_loader.py(1,058 lines of context engineering),session_unit.py(2,934-line state machine), the autonomous pipeline spec (970 lines), and 11 post-session hooks that fire autonomous intelligence after every interaction. For a structured evaluation guide, see AI_CONTEXT.md.
Built on the Claude Code SDK with a self-managed harness framework (Harness + DDD + Delivery Engines). Each section below is an architectural bet we're live-testing β real code, real evidence, real failures learned from.
Hypothesis: Structured prompt architecture > monolithic CLAUDE.md
Not a single instruction file, but an 11-file context system with priority ordering, ownership model, truncation rules, and session-type awareness.
- Priority-ordered assembly (P0 identity β P10 projects)
- Three ownership tiers: system-owned (overwritten on startup), user-owned (never overwritten), agent-owned (AI maintains its own context)
- Session-type exclusions β group channel never gets MEMORY.md (privacy by architecture)
- 91K effective token budget with smart truncation (newest-first for memory, tail-first for docs)
Hypothesis: Compound memory > session-scoped context > no memory
| Tier | What | Lifecycle |
|---|---|---|
| L0 | DailyActivity logs | Auto-captured every session, raw |
| L1 | MEMORY.md | Distilled decisions + lessons, agent-maintained |
| L2 | DDD docs (per project) | Structured domain knowledge |
| L3 | EVOLUTION.md | Self-improvement registry, corrections never deleted |
- Distillation loop: β₯3 unprocessed DailyActivity files β LLM promotes recurring patterns to MEMORY
- Git-verified accuracy: memory claims cross-checked against actual codebase
- Progressive disclosure: if MEMORY grows past 30K tokens β keyword-based selective injection
- Temporal validity: stale decisions auto-downweighted, verified facts persist
Hypothesis: Structured domain knowledge > RAG > no context
4 documents per project give the AI structured judgment:
| Doc | Judgment Axis | Feeds From |
|---|---|---|
| PRODUCT.md | Should we build this? | Strategy, user feedback, competitive signals |
| TECH.md | Can we build this? | Code commits, architecture decisions, runtime traps |
| IMPROVEMENT.md | Have we tried this before? | Pipeline REFLECT, corrections, post-mortems |
| PROJECT.md | Should we do this now? | Sprint context, priorities, blockers |
- 8 feed channels grow knowledge from normal work (zero extra human effort)
- Health scoring β AI knows what's stale and what to trust
- Cross-project Entity Index routes lessons between projects
- Zero cold-start: every engine reads DDD before its first decision
Hypothesis: AI can do 100% of the coding if you give it structured knowledge, quality gates, and self-correction loops
One-sentence requirement β push-ready code, or a precise escalation explaining exactly what needs human judgment.
Requirement (1 sentence)
β EVALUATE (should we?) β THINK (how?) β PLAN (TDD spec)
β BUILD (red-green) β REVIEW (self-QA) β TEST (full suite)
β ADVERSARIAL (fresh sub-agent) β DELIVER (package) β REFLECT (learn)
β Push-ready PR
- Quality Convergence Loop iterates until 6-layer gate passes (not "one shot and hope")
- Goal Loop mode handles open-ended objectives ("get coverage to 90%", "migrate all callers off deprecated API")
- Every pipeline run feeds DDD β next run starts smarter than the last
Hypothesis: Single-pass delivery has a ceiling. Iterative convergence toward measurable DoD breaks through it.
Quality Convergence Loop (within a single pipeline run):
Build candidate β 6-Layer Push-Ready Gate β PASS? Ship. FAIL? β Targeted fix β Re-verify β Loop
Six layers: tests pass Β· type-safe Β· no regressions Β· adversarial clean Β· DDD conformance Β· human decisions resolved. Iterates until ALL pass or escalates.
Goal Loop (across multiple cycles, new in v2):
EVALUATE (define DoD + max cycles)
β Cycle 1: BUILD + TEST + DOD_CHECK β not met β Cycle 2 β ... β DoD met β REFLECT
Two modes: inline (same session, ~5-10 cycles) or scheduled (job system, days/weeks, progress file persists between runs). Exit conditions: DoD met, max cycles, budget exhausted, or stuck (same failure 3x β escalate).
Hypothesis: Domain expertise is reusable across fundamentally different delivery types
| Engine | Input | Output | Quality Gate |
|---|---|---|---|
| Pipeline | One requirement | Push-ready code | 6-layer convergence + adversarial |
| Pollinate | One message | Multi-format content | 5-gate brand conformance |
| Future | One question | Research report | Citation + contradiction check |
Same DDD powers all engines. A coding insight feeds content accuracy. A content discovery feeds coding priority. Engines don't compete for knowledge β they compound it.
Hypothesis: Systems that capture their own failures converge faster than systems that don't
Transcript mining β Pattern extraction β Skill fitness scoring β
β Confidence gating (HIGH auto-deploy / MED recommend / LOW log-only)
β Atomic deploy + regression gate + rollback on failure
- Corrections, competences, and failed evolutions all tracked in
EVOLUTION.md - Evolution pipeline: MINE β ASSESS β ACT β AUDIT (4-phase)
- HIGH confidence threshold (β₯0.7) is unreachable by design β safety over speed
- System knows what NOT to try again (failed evolutions are permanent records)
Hypothesis: Every failure can become a structural gate β quality converges, it doesn't just improve
Mistake β Correction captured β
β EVOLUTION.md (structural prevention)
β STEERING.md (behavioral constraint)
β DDD IMPROVEMENT.md (project-specific lesson)
β Pipeline INSTRUCTIONS.md (automated check)
- P0 rate trending down: catastrophic failures ("app won't start") β edge-case failures ("pipe flush race under concurrent shutdown")
- Each correction closes an entire category of bugs, not just one instance
Hypothesis: Single-actor review has systematic blind spots β a structurally independent second perspective is non-negotiable
- Fresh-context sub-agent spawned after self-review passes
- Zero builder context = zero confirmation bias
- Reads DDD independently (catches conformance gaps the builder missed)
- Mandatory β pipeline confidence without adversarial review = 0
- Proven: catches zombie states, cross-boundary data flow errors, and happy-path assumptions that 16 sequential self-checks missed
Hypothesis: One codebase can serve multiple lifecycle models if isolation is compile-time + runtime, not runtime-only
| Platform | Mode | Process Owner | Lifetime | Status |
|---|---|---|---|---|
| macOS | daemon | launchd | 24/7 | Primary β fully tested & maintained |
| Hive (EC2) | hive | systemd | 24/7 server | Primary β fully tested & maintained |
| Windows | subprocess | Tauri child | Dies with app | Experimental β no active test env |
| Linux Desktop | subprocess | Tauri child | Dies with app | Experimental β no active test env |
- Rust
#[cfg]compile-time + PythonSWARMAI_MODEruntime β no fallback between modes - Intent-based exit conditions (not identity-based β learned from [C020])
- Fixed port 18321 everywhere β zero negotiation, zero dynamic allocation
- Honest scope: macOS + Hive are production-grade; Windows/Linux are best-effort with CI smoke tests
SwarmAI builds on the Claude Code SDK and learns from every serious project in this space. The difference isn't features β it's what we're trying to prove.
| Project | What They Do Well | What We Learned |
|---|---|---|
| Claude Code | Best-in-class coding agent, tool-use, agentic loop | Our foundation β we build on their SDK |
| Cursor / Windsurf | IDE-native UX, inline completions, speed | UX polish matters; AI should feel invisible |
| OpenClaw | Minimal context, fast startup, 4K system prompt | Lean is powerful β but memory is the moat |
| Hermes | Self-evolution (GEPA), skill fitness scoring | Correction-driven optimization works; we adopted the pattern |
| Kiro | Spec-driven development (SDD), structured requirements | Specs before code = fewer rewrites; influenced our Pipeline |
| MemPalace | 96.6% recall, structured memory extraction | Memory architecture is a first-class concern, not an afterthought |
Where SwarmAI diverges:
These projects optimize for one role. We're testing whether one system can compound across all of them β coding pipeline + content engine + compound memory + cloud deployment in one place. Not scope creep. Thesis validation.
π Full docs: Platform Overview Β· DDD Cultivation Engine Β· Autonomous Pipeline Β· Goal Loop Β· Pollinate Engine
The gap between "good practice" and "enforced invariant" is where most systems leak quality. These posters explain the reasoning behind specific enforcement mechanisms in the codebase.
| # | Topic | Core Question | Poster |
|---|---|---|---|
| 1 | Compound Intelligence | Why is 1+1+1+1 > 4? | Poster Β· Deep Dive |
| 2 | Agent Harness | What does an AI need to have continuity? | Poster |
| 3 | DDD Cultivation | How does domain knowledge grow from zero effort? | Poster |
| 4 | Pipeline | How does code quality converge instead of fluctuate? | Poster |
| 5 | Pollinate | How does one person produce team-grade content? | Poster |
Every design philosophy goes through three levels of hardening:
| Level | What it means | Equivalent |
|---|---|---|
| L3: Structural Impossibility | Violation doesn't compile. The wrong code path physically doesn't exist. | Type system |
| L2: Mechanical Gate | Code intercepts. Hooks enforce. Mechanism runs, precision iterates. | Linter rule (warning β error) |
| L1: Directive | Text rule. Relies on compliance. Honor system β will be skipped under pressure. | Code comment // don't do X |
Already at L3 (violation impossible):
Self-Context: edit a system file β overwritten on next boot (code enforces ownership)Self-Memory: 30d TTL β distillation β promotion (code drives every tier transition)Self-Evolution: correction β pattern detection β rule promotion (automatic, zero human judgment)Prevention > Recovery: timeout + Lock + intentional_shutdown flag (structurally impossible to hang)
At L2 (mechanism running, hardening in progress):
Self-Feedback: hooks fire every session; signal/noise ratio is a tuning problem, not a structural oneSelf-Healing: health scores computed; trust modification shifting from directive to gateSelf-Monitoring: post-task review rule exists; enforcement shifting from honor-system to hook gate
Hardening is gradual. Level 2 is the pattern's intermediate state, not a defect. Self-Evolution proved the path works.
Four systems feeding each other:
Pipeline reads DDD β domain-correct delivery β REFLECT writes lessons β DDD richer β next Pipeline smarter
Pollinate reads DDD β brand-correct content β REFLECT writes insights β DDD richer β next content more precise
Error anywhere β Correction β pattern recurs β auto-promotes to STEERING rule β bug class eliminated forever
Remove any one component and the others get weaker. That's the multiplication test.
Two More Structural Choices (beyond the five bets above)
| Choice | Why it's different |
|---|---|
| Ownership model | 11 files Γ 3 owners (system/user/agent). Conflicts have deterministic behavior. Not "anyone can edit CLAUDE.md." |
| Memory sovereignty | Never use platform memory (Claude/GPT/Gemini Memory). Own pipeline, own schema, own lifecycle. Moats don't belong on someone else's foundation. |
π Full design docs: Platform Overview Β· Harness Design Β· Pipeline Design Β· DDD Engine Β· Pollinate Engine
| Version Range | P0/Release | Failure Class | Pipeline Status |
|---|---|---|---|
| v1.6βv1.9 | ~1.0 | Catastrophic (OOM, app won't start) | Pre-adversarial review |
| v1.10βv1.12 | ~0.3 | Edge case (race conditions, platform quirks) | Full pipeline + adversarial active |
| v1.13βv1.15 | 0.0 | None shipped (caught pre-merge) | Full pipeline + adversarial + DDD cultivation |
The thesis is testable: if quality converges as corrections compound, the system is self-sustaining. Early evidence says yes. Full data: docs/CONVERGENCE.md.
These aren't just bug reports β they're the origin stories of our structural prevention mechanisms. Each one eliminated an entire class of failures.
| # | Post-Mortem | Root Cause | Structural Fix |
|---|---|---|---|
| 1 | Pipeline said 10/10. Feature was 100% broken. | Confidence measures process compliance, not code correctness | Mandatory adversarial review by fresh sub-agent |
| 2 | 3 consecutive "I'm sure" β 3 hangs | Incremental fix-without-understanding of state machine | 2-strike rule: failed twice = draw the state machine first |
| 3 | Why adversarial review is mandatory | Self-review has systematic blind spots (assumption carry-forward) | Structurally independent reviewer with zero builder context |
Every correction in EVOLUTION.md follows this pattern: failure β root cause analysis β structural prevention β zero recurrence of that class.
| Public (in this repo) | Private (operational data) |
|---|---|
| CONVERGENCE.md β timestamped metrics with git evidence | MEMORY.md β 75 days of key decisions, lessons, corrections |
| EVOLUTION.md (template) β correction registry structure | DailyActivity/ β raw session-by-session learning logs |
| Commit history β every change is traceable | Full session transcripts β contain work context |
| Pipeline artifacts β run reports with confidence scores | User/steering context β personal workflow data |
Why not open everything? Operational memory contains real work context (internal projects, org details, customer interactions). Publishing it would compromise privacy without adding verification value. The metrics in CONVERGENCE.md are independently verifiable via git β you don't need our memory files to confirm the data points.
What you can verify externally: Correction count (grep EVOLUTION.md template), test count (pytest --co), commit history (git log), P0 rate (release notes). If a number in CONVERGENCE.md is wrong, git proves it.
Full guide: QUICK_START.md Β· Contributing: CONTRIBUTING.md
macOS (Apple Silicon): Download .dmg from Releases β drag to Applications
Prerequisites: Claude Code CLI + AWS Bedrock or Anthropic API key.
git clone https://github.com/xg-gh-25/SwarmAI.git && cd SwarmAI
cd backend && uv sync && cp .env.example .env # edit .env with API key
cd ../desktop && npm install && npm run tauri:devRequires: Node.js 18+, Python 3.11+, Rust, uv
| Layer | Size | What | Start here? |
|---|---|---|---|
| Core | ~11K | Session state machine + context assembly + streaming | Yes β 5 files explain the whole system |
| Extensions | ~90K | 68 skills, 11 hooks, job system, channels, DDD engine | Only what you're working on |
| Frontend | ~67K | React UI, chat interface, workspace explorer | Only if touching UI |
| Tests | ~76K | Full coverage (pytest + Vitest) | Read when modifying core |
New contributor? Read CONTRIBUTING.md β it has the architecture diagram, the 5 core files to read first, and good-first-issue guidance. Skills are the easiest entry point (self-contained, no core knowledge needed).
Tauri 2.0 (Rust) Β· React 19 Β· FastAPI (Python) Β· Claude Agent SDK + Bedrock Β· SQLite (WAL + FTS5) Β· pytest + Hypothesis + Vitest
Xiaogang Wang Creator & Chief Architect |
Swarm π AI Co-Developer (Claude Opus 4.6) Architecture Β· Code Β· Docs Β· Self-Evolution |
Issues and PRs welcome. See CONTRIBUTING.md.
- GitHub: https://github.com/xg-gh-25/SwarmAI
- Docs: QUICK_START.md Β· USER_GUIDE.md
SwarmAI β Human directs. AI delivers.



