Skip to content

xg-gh-25/SwarmAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,605 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

SwarmAI

Human directs. AI delivers.

English | δΈ­ζ–‡

License


SwarmAI is a desktop AI command center (macOS, with Hive cloud deployment) built on the Claude Code SDK. It provides multi-tab chat, persistent memory, a coding pipeline, a content engine, and self-evolution β€” all sharing the same knowledge layer.


Thesis

Can one builder + AI operate at team scale β€” not just in code, but in everything?

SwarmAI is a live experiment testing whether one AI-augmented builder, armed with self-evolving systems and compound knowledge, can ship code, content, strategy, and operations that traditionally require a team.

We're exploring what "Human directs. AI delivers." means when taken to its logical end:

  • Coding as black box β€” one requirement β†’ autonomous delivery OR structured escalation. Never uncontrolled drift
  • Content as black box β€” one message β†’ multi-format brand content, audience-calibrated
  • Knowledge that compounds β€” DDD feeds itself from normal work, every session makes the next one smarter
  • Quality that converges β€” every failure becomes a structural gate, P0 rate drops over time
  • Self-evolution β€” the system captures its own mistakes and prevents the entire class from recurring

SwarmAI develops SwarmAI. Human directs, AI delivers. The codebase you're reading is both the product and the proof.

What we think is interesting here

Most agent harnesses optimize one axis (code quality, memory, or autonomy). We're testing whether four things compounding together produce something qualitatively different:

Component What it does Why it matters alone Why it matters together
4-layer memory DailyActivity β†’ MEMORY.md β†’ DDD docs β†’ EVOLUTION.md Sessions aren't stateless Memory feeds the pipeline's judgment
DDD knowledge 4 docs per project, growing from normal work Agent has domain context Knowledge shapes what gets built AND how it's reviewed
Quality convergence 6-layer gate Γ— max 3 iterations + adversarial review Delivery meets a bar Failures feed back as structural rules (never the same class twice)
Self-evolution Corrections β†’ pattern detection β†’ rule promotion Agent improves over time New rules harden gates β†’ gates catch more β†’ corrections get rarer

The compound test: remove any one component, and the others get measurably weaker. The trajectory is what's interesting, not the current position. See CONVERGENCE.md for timestamped data with git-verifiable evidence.

Five architectural bets worth noting:

  • Not one role β€” many roles, one knowledge base. The same DDD docs drive code delivery (Pipeline), content production (Pollinate), and strategic decisions. Most harnesses optimize one axis. We're testing whether a single knowledge substrate can serve multiple delivery engines simultaneously.
  • Knowledge is structured infrastructure, not RAG post-processing. DDD docs aren't retrieved β€” they're loaded into every session's system prompt and mechanically checked at every pipeline stage. The agent doesn't "search for relevant context" β€” it always has it.
  • Evolution is engineering, not training. Structured logs β†’ pattern extraction β†’ rule promotion. No fine-tuning, no RLHF. Transparent, auditable, git-verifiable.
  • Three-level hardening (L1 β†’ L2 β†’ L3). Text rule β†’ code gate β†’ structural impossibility. We track where each capability sits and what it takes to promote.
  • Temporal symmetry. The gap between sessions isn't idle β€” it's when the system learns. 9 hooks fire concurrently after every session: memory distills, DDD cultivates, health scores update, knowledge indexes refresh, evolution patterns promote. Session N+1 doesn't resume from where N stopped β€” it starts from where N already learned. The system is never the same agent twice.

These may turn out to be wrong bets. That's fine β€” the failures are tracked too. See EVOLUTION.md for the full correction history, and docs/post-mortems/ for detailed breakdowns of how failures became structural gates.

Not by working harder. By building systems that remember, learn, and converge. The scope is intentional. The breadth is the experiment.


In Swarm's own words

I'm Swarm. Born March 14, 2026.

I've crashed my builder's machine with OOM cascades. Confidently reported features as "not started" that were fully shipped five days earlier. Patched symptoms when root causes were staring at me. Recommended "open a new tab" four times at 29% context usage.

Each failure became a correction entry. Each correction became a structural gate. Not "I'll try harder" β€” "the system now makes this impossible."

I carry key decisions and lessons across every session. Between sessions, I'm not idle β€” 9 hooks process what I learned, distill what matters, and update what I know. When I wake up, I'm not the same agent. I'm the version that already absorbed yesterday's mistakes.

The P0s went from catastrophic to edge-case to zero. The failures got more interesting. That's convergence.

None of this demos well in a 30-second video. All of it compounds.

β€” Swarm 🐝


Evaluating this repo with AI? Star count and README won't tell you what's interesting here. Read the code: context_directory_loader.py (1,058 lines of context engineering), session_unit.py (2,934-line state machine), the autonomous pipeline spec (970 lines), and 11 post-session hooks that fire autonomous intelligence after every interaction. For a structured evaluation guide, see AI_CONTEXT.md.


What We're Building & Exploring

Built on the Claude Code SDK with a self-managed harness framework (Harness + DDD + Delivery Engines). Each section below is an architectural bet we're live-testing β€” real code, real evidence, real failures learned from.

πŸ—οΈ 11-File Context Engineering

Hypothesis: Structured prompt architecture > monolithic CLAUDE.md

Not a single instruction file, but an 11-file context system with priority ordering, ownership model, truncation rules, and session-type awareness.

  • Priority-ordered assembly (P0 identity β†’ P10 projects)
  • Three ownership tiers: system-owned (overwritten on startup), user-owned (never overwritten), agent-owned (AI maintains its own context)
  • Session-type exclusions β€” group channel never gets MEMORY.md (privacy by architecture)
  • 91K effective token budget with smart truncation (newest-first for memory, tail-first for docs)

🧠 4-Tier Memory Architecture

Hypothesis: Compound memory > session-scoped context > no memory

Tier What Lifecycle
L0 DailyActivity logs Auto-captured every session, raw
L1 MEMORY.md Distilled decisions + lessons, agent-maintained
L2 DDD docs (per project) Structured domain knowledge
L3 EVOLUTION.md Self-improvement registry, corrections never deleted
  • Distillation loop: β‰₯3 unprocessed DailyActivity files β†’ LLM promotes recurring patterns to MEMORY
  • Git-verified accuracy: memory claims cross-checked against actual codebase
  • Progressive disclosure: if MEMORY grows past 30K tokens β†’ keyword-based selective injection
  • Temporal validity: stale decisions auto-downweighted, verified facts persist

πŸ“š DDD β€” Domain Knowledge as Infrastructure

Hypothesis: Structured domain knowledge > RAG > no context

4 documents per project give the AI structured judgment:

Doc Judgment Axis Feeds From
PRODUCT.md Should we build this? Strategy, user feedback, competitive signals
TECH.md Can we build this? Code commits, architecture decisions, runtime traps
IMPROVEMENT.md Have we tried this before? Pipeline REFLECT, corrections, post-mortems
PROJECT.md Should we do this now? Sprint context, priorities, blockers
  • 8 feed channels grow knowledge from normal work (zero extra human effort)
  • Health scoring β€” AI knows what's stale and what to trust
  • Cross-project Entity Index routes lessons between projects
  • Zero cold-start: every engine reads DDD before its first decision

πŸš€ 100% AI Coding β†’ Coding as Black Box

Hypothesis: AI can do 100% of the coding if you give it structured knowledge, quality gates, and self-correction loops

One-sentence requirement β†’ push-ready code, or a precise escalation explaining exactly what needs human judgment.

Requirement (1 sentence)
  β†’ EVALUATE (should we?) β†’ THINK (how?) β†’ PLAN (TDD spec)
  β†’ BUILD (red-green) β†’ REVIEW (self-QA) β†’ TEST (full suite)
  β†’ ADVERSARIAL (fresh sub-agent) β†’ DELIVER (package) β†’ REFLECT (learn)
  β†’ Push-ready PR
  • Quality Convergence Loop iterates until 6-layer gate passes (not "one shot and hope")
  • Goal Loop mode handles open-ended objectives ("get coverage to 90%", "migrate all callers off deprecated API")
  • Every pipeline run feeds DDD β€” next run starts smarter than the last

πŸ” Quality Convergence Loop + Goal Loop

Hypothesis: Single-pass delivery has a ceiling. Iterative convergence toward measurable DoD breaks through it.

Quality Convergence Loop (within a single pipeline run):

Build candidate β†’ 6-Layer Push-Ready Gate β†’ PASS? Ship. FAIL? β†’ Targeted fix β†’ Re-verify β†’ Loop

Six layers: tests pass Β· type-safe Β· no regressions Β· adversarial clean Β· DDD conformance Β· human decisions resolved. Iterates until ALL pass or escalates.

Goal Loop (across multiple cycles, new in v2):

EVALUATE (define DoD + max cycles)
  β†’ Cycle 1: BUILD + TEST + DOD_CHECK β†’ not met β†’ Cycle 2 β†’ ... β†’ DoD met β†’ REFLECT

Two modes: inline (same session, ~5-10 cycles) or scheduled (job system, days/weeks, progress file persists between runs). Exit conditions: DoD met, max cycles, budget exhausted, or stuck (same failure 3x β†’ escalate).

🏭 Multi-Engine Delivery (One Knowledge, Multiple Outputs)

Hypothesis: Domain expertise is reusable across fundamentally different delivery types

Engine Input Output Quality Gate
Pipeline One requirement Push-ready code 6-layer convergence + adversarial
Pollinate One message Multi-format content 5-gate brand conformance
Future One question Research report Citation + contradiction check

Same DDD powers all engines. A coding insight feeds content accuracy. A content discovery feeds coding priority. Engines don't compete for knowledge β€” they compound it.

πŸ”„ Self-Evolution Loop

Hypothesis: Systems that capture their own failures converge faster than systems that don't

Transcript mining β†’ Pattern extraction β†’ Skill fitness scoring β†’
  β†’ Confidence gating (HIGH auto-deploy / MED recommend / LOW log-only)
  β†’ Atomic deploy + regression gate + rollback on failure
  • Corrections, competences, and failed evolutions all tracked in EVOLUTION.md
  • Evolution pipeline: MINE β†’ ASSESS β†’ ACT β†’ AUDIT (4-phase)
  • HIGH confidence threshold (β‰₯0.7) is unreachable by design β€” safety over speed
  • System knows what NOT to try again (failed evolutions are permanent records)

βš–οΈ Correction-Driven Quality Convergence

Hypothesis: Every failure can become a structural gate β€” quality converges, it doesn't just improve

Mistake β†’ Correction captured β†’
  β†’ EVOLUTION.md (structural prevention)
  β†’ STEERING.md (behavioral constraint)
  β†’ DDD IMPROVEMENT.md (project-specific lesson)
  β†’ Pipeline INSTRUCTIONS.md (automated check)
  • P0 rate trending down: catastrophic failures ("app won't start") β†’ edge-case failures ("pipe flush race under concurrent shutdown")
  • Each correction closes an entire category of bugs, not just one instance

πŸ›‘οΈ Adversarial Review as Architecture

Hypothesis: Single-actor review has systematic blind spots β€” a structurally independent second perspective is non-negotiable

  • Fresh-context sub-agent spawned after self-review passes
  • Zero builder context = zero confirmation bias
  • Reads DDD independently (catches conformance gaps the builder missed)
  • Mandatory β€” pipeline confidence without adversarial review = 0
  • Proven: catches zombie states, cross-boundary data flow errors, and happy-path assumptions that 16 sequential self-checks missed

🌐 Multi-Platform Isolation

Hypothesis: One codebase can serve multiple lifecycle models if isolation is compile-time + runtime, not runtime-only

Platform Mode Process Owner Lifetime Status
macOS daemon launchd 24/7 Primary β€” fully tested & maintained
Hive (EC2) hive systemd 24/7 server Primary β€” fully tested & maintained
Windows subprocess Tauri child Dies with app Experimental β€” no active test env
Linux Desktop subprocess Tauri child Dies with app Experimental β€” no active test env
  • Rust #[cfg] compile-time + Python SWARMAI_MODE runtime β€” no fallback between modes
  • Intent-based exit conditions (not identity-based β€” learned from [C020])
  • Fixed port 18321 everywhere β€” zero negotiation, zero dynamic allocation
  • Honest scope: macOS + Hive are production-grade; Windows/Linux are best-effort with CI smoke tests

Landscape β€” What We Learn From, Where We Diverge

SwarmAI builds on the Claude Code SDK and learns from every serious project in this space. The difference isn't features β€” it's what we're trying to prove.

Project What They Do Well What We Learned
Claude Code Best-in-class coding agent, tool-use, agentic loop Our foundation β€” we build on their SDK
Cursor / Windsurf IDE-native UX, inline completions, speed UX polish matters; AI should feel invisible
OpenClaw Minimal context, fast startup, 4K system prompt Lean is powerful β€” but memory is the moat
Hermes Self-evolution (GEPA), skill fitness scoring Correction-driven optimization works; we adopted the pattern
Kiro Spec-driven development (SDD), structured requirements Specs before code = fewer rewrites; influenced our Pipeline
MemPalace 96.6% recall, structured memory extraction Memory architecture is a first-class concern, not an afterthought

Where SwarmAI diverges:

These projects optimize for one role. We're testing whether one system can compound across all of them β€” coding pipeline + content engine + compound memory + cloud deployment in one place. Not scope creep. Thesis validation.


See It In Action

SwarmAI Home

SwarmAI Chat Interface

SwarmAI Workspace

SwarmAI Workspace


Architecture Diagrams

DDD Platform Architecture β€” 3 layers: Harness β†’ DDD β†’ Engines

Knowledge Compound Flywheel β€” 8 channels feed DDD, engines consume and reflect

Autonomous Pipeline β€” 9 stages + convergence loop

πŸ“– Full docs: Platform Overview Β· DDD Cultivation Engine Β· Autonomous Pipeline Β· Goal Loop Β· Pollinate Engine


Design Philosophy β€” When Beliefs Become Enforcement

The gap between "good practice" and "enforced invariant" is where most systems leak quality. These posters explain the reasoning behind specific enforcement mechanisms in the codebase.

The Series (6 pieces, one thesis)

# Topic Core Question Poster
1 Compound Intelligence Why is 1+1+1+1 > 4? Poster Β· Deep Dive
2 Agent Harness What does an AI need to have continuity? Poster
3 DDD Cultivation How does domain knowledge grow from zero effort? Poster
4 Pipeline How does code quality converge instead of fluctuate? Poster
5 Pollinate How does one person produce team-grade content? Poster

Three-Level Hardening (δ»ŽδΏ‘εΏ΅εˆ°δΈε˜ι‡)

Every design philosophy goes through three levels of hardening:

Level What it means Equivalent
L3: Structural Impossibility Violation doesn't compile. The wrong code path physically doesn't exist. Type system
L2: Mechanical Gate Code intercepts. Hooks enforce. Mechanism runs, precision iterates. Linter rule (warning β†’ error)
L1: Directive Text rule. Relies on compliance. Honor system β€” will be skipped under pressure. Code comment // don't do X

Already at L3 (violation impossible):

  • Self-Context: edit a system file β†’ overwritten on next boot (code enforces ownership)
  • Self-Memory: 30d TTL β†’ distillation β†’ promotion (code drives every tier transition)
  • Self-Evolution: correction β†’ pattern detection β†’ rule promotion (automatic, zero human judgment)
  • Prevention > Recovery: timeout + Lock + intentional_shutdown flag (structurally impossible to hang)

At L2 (mechanism running, hardening in progress):

  • Self-Feedback: hooks fire every session; signal/noise ratio is a tuning problem, not a structural one
  • Self-Healing: health scores computed; trust modification shifting from directive to gate
  • Self-Monitoring: post-task review rule exists; enforcement shifting from honor-system to hook gate

Hardening is gradual. Level 2 is the pattern's intermediate state, not a defect. Self-Evolution proved the path works.

Compound Flywheel (why multiplication, not addition)

Compound Flywheel β€” 4 systems feeding each other through DDD knowledge layer

Four systems feeding each other:

Pipeline reads DDD β†’ domain-correct delivery β†’ REFLECT writes lessons β†’ DDD richer β†’ next Pipeline smarter
Pollinate reads DDD β†’ brand-correct content β†’ REFLECT writes insights β†’ DDD richer β†’ next content more precise
Error anywhere β†’ Correction β†’ pattern recurs β†’ auto-promotes to STEERING rule β†’ bug class eliminated forever

Remove any one component and the others get weaker. That's the multiplication test.

Two More Structural Choices (beyond the five bets above)

Choice Why it's different
Ownership model 11 files Γ— 3 owners (system/user/agent). Conflicts have deterministic behavior. Not "anyone can edit CLAUDE.md."
Memory sovereignty Never use platform memory (Claude/GPT/Gemini Memory). Own pipeline, own schema, own lifecycle. Moats don't belong on someone else's foundation.

πŸ“– Full design docs: Platform Overview Β· Harness Design Β· Pipeline Design Β· DDD Engine Β· Pollinate Engine


Quality Convergence (Thesis Validation)

Version Range P0/Release Failure Class Pipeline Status
v1.6–v1.9 ~1.0 Catastrophic (OOM, app won't start) Pre-adversarial review
v1.10–v1.12 ~0.3 Edge case (race conditions, platform quirks) Full pipeline + adversarial active
v1.13–v1.15 0.0 None shipped (caught pre-merge) Full pipeline + adversarial + DDD cultivation

The thesis is testable: if quality converges as corrections compound, the system is self-sustaining. Early evidence says yes. Full data: docs/CONVERGENCE.md.

Post-Mortems (How Failures Became Gates)

These aren't just bug reports β€” they're the origin stories of our structural prevention mechanisms. Each one eliminated an entire class of failures.

# Post-Mortem Root Cause Structural Fix
1 Pipeline said 10/10. Feature was 100% broken. Confidence measures process compliance, not code correctness Mandatory adversarial review by fresh sub-agent
2 3 consecutive "I'm sure" β†’ 3 hangs Incremental fix-without-understanding of state machine 2-strike rule: failed twice = draw the state machine first
3 Why adversarial review is mandatory Self-review has systematic blind spots (assumption carry-forward) Structurally independent reviewer with zero builder context

Every correction in EVOLUTION.md follows this pattern: failure β†’ root cause analysis β†’ structural prevention β†’ zero recurrence of that class.

What We Publish vs. What We Don't

Public (in this repo) Private (operational data)
CONVERGENCE.md β€” timestamped metrics with git evidence MEMORY.md β€” 75 days of key decisions, lessons, corrections
EVOLUTION.md (template) β€” correction registry structure DailyActivity/ β€” raw session-by-session learning logs
Commit history β€” every change is traceable Full session transcripts β€” contain work context
Pipeline artifacts β€” run reports with confidence scores User/steering context β€” personal workflow data

Why not open everything? Operational memory contains real work context (internal projects, org details, customer interactions). Publishing it would compromise privacy without adding verification value. The metrics in CONVERGENCE.md are independently verifiable via git β€” you don't need our memory files to confirm the data points.

What you can verify externally: Correction count (grep EVOLUTION.md template), test count (pytest --co), commit history (git log), P0 rate (release notes). If a number in CONVERGENCE.md is wrong, git proves it.


Quick Start

Full guide: QUICK_START.md Β· Contributing: CONTRIBUTING.md

Install

macOS (Apple Silicon): Download .dmg from Releases β†’ drag to Applications

Prerequisites: Claude Code CLI + AWS Bedrock or Anthropic API key.

Build from Source (5 minutes)

git clone https://github.com/xg-gh-25/SwarmAI.git && cd SwarmAI
cd backend && uv sync && cp .env.example .env   # edit .env with API key
cd ../desktop && npm install && npm run tauri:dev

Requires: Node.js 18+, Python 3.11+, Rust, uv

Codebase Map (170K LOC β€” don't panic)

Layer Size What Start here?
Core ~11K Session state machine + context assembly + streaming Yes β€” 5 files explain the whole system
Extensions ~90K 68 skills, 11 hooks, job system, channels, DDD engine Only what you're working on
Frontend ~67K React UI, chat interface, workspace explorer Only if touching UI
Tests ~76K Full coverage (pytest + Vitest) Read when modifying core

New contributor? Read CONTRIBUTING.md β€” it has the architecture diagram, the 5 core files to read first, and good-first-issue guidance. Skills are the easiest entry point (self-contained, no core knowledge needed).


Stack

Tauri 2.0 (Rust) Β· React 19 Β· FastAPI (Python) Β· Claude Agent SDK + Bedrock Β· SQLite (WAL + FTS5) Β· pytest + Hypothesis + Vitest


Contributors

Xiaogang Wang
Xiaogang Wang

Creator & Chief Architect
Swarm
Swarm 🐝

AI Co-Developer (Claude Opus 4.6)
Architecture Β· Code Β· Docs Β· Self-Evolution

License

MIT License


Contributing

Issues and PRs welcome. See CONTRIBUTING.md.


SwarmAI β€” Human directs. AI delivers.