Your fast-reference for building, debugging, and extending the MUXI Runtime without losing critical context.
This repository is part of the larger MUXI ecosystem.
Complete architectural overview: See ARCHITECTURE.md in the main MUXI repo.
Deep codebase mental model: See mental-model.md for detailed architecture, data flows, module internals, and agent context.
This repo (runtime): The formation execution environment - FastAPI-based Python runtime packaged as SIF container.
- Use sub-agents:
file-analyzerfor reading files,code-analyzerfor code search/analysis,test-runnerfor any tests. - Maintain required LLM config: formations must declare
llm.modelswith atextentry (e.g.,openai/gpt-4o-mini). System aborts if missing. - Respect the formation loading order: Observability → LLM configuration → Memory systems → Document processing → Background services → Agents.
- Follow the absolute rules: no partial implementations, no dead code, no duplication, consistent naming, no over-engineering, keep concerns separated, prevent resource leaks, and study existing code before writing new logic.
- Treat tests seriously: no mocks, prefer real integrations, single-focus assertions, deterministic data, and run the right test tier (unit/integration/e2e). Every e2e run uses the scripted runner.
- Keep workflow hygiene: plan carefully, review and simplify your plan before execution, optimize for the smallest viable change, strip whitespace on blank lines, prefer
ast-grepover raw regex, and usetrashinstead ofrm. - Start CodeRabbit early (
coderabbit --prompt-only), watch its feedback, and resolve every flagged issue before moving on. - Honor reflection protocol after complex engagements: offer to update CLAUDE.md when the task warrants it.
- Never use emojies
- Always obtain the current date and time from the system using the
datecommand before writing any frontmatter or using dates in documents.
- Formation-first: YAML formations describe the entire AI system; runtime turns them into a live orchestration.
- System flow:
Formation Engine → Overlord Orchestrator → Agent Pool ↓ ↓ ↓ Validation Coordination Execution ↓ ↓ ↓ Memory Systems ← Services Layer → Tool Integration - Critical patterns:
- Provider-agnostic LLM abstraction via OneLLM.
- SOP-driven orchestration for complex workflows.
- Three-tier memory: Buffer (FIFO + vector) → Persistent (DB) → Vector (FAISSx).
- Unified services: MCP, agent-to-agent, multimodal, scheduler, observability.
- Memobase partitions guarantee multi-user isolation.
runtime/
├── src/muxi/runtime/ # Core runtime package (shares muxi namespace with SDK)
│ ├── formation/ # Formation engine, overlord, agents, workflows
│ ├── services/ # Memory, MCP, observability, scheduler
│ └── datatypes/ # Type definitions
├── tests/ # Unit + integration tests
├── e2e/ # End-to-end tests (200+ across 12 areas)
├── contributing/ # Contributor documentation
└── migrations/ # Database migrations
- Language & style: target Python 3.10+, adopt async I/O where throughput improves, format with Black (line length 100) and isort (
profile=black), lint with Ruff & Flake8 (line length 120), keep naming snake_case/PascalCase as appropriate. - Testing discipline: map fast logic to
tests/unit, service seams totests/integration, end-to-end flows toe2e/tests; rely on deterministic data or fixtures fromtests/fixtures; structure tests with arrange/act/assert and meaningful failure output. - Workflow basics: branch naming
feature/<topic>orfix/<issue>, commits are descriptive, PRs stay small with rationale, logs/screenshots, links, and pytest evidence. - Code review loop: keep CodeRabbit running, revisit its output after significant edits, and iterate until the report is clean.
- File operations: delegate file reads to the file-analyzer agent for summarized context.
- Code analysis: use the code-analyzer agent for tracing logic, vulnerability hunts, or repo-wide searches.
- Testing: pipe every test run through the test-runner agent using the provided script. Ensure every test reports a summary, and for e2e runs append the exact block below after the logs:
======================================== ### Test Result: 🎉 SUCCESS: ... ✓ ... ✓ ... ✓ ... ======================================== ### Chat transcript: User: ... System: ... User: ... System: ...
- LLM configuration:
No default model exists; missing
llm: models: - text: "openai/gpt-4o-mini" # REQUIRED - vision: "..." # optional, falls back to text - audio: "..." # optional, falls back to text
textfails fast. - Formation load order: Observability → LLM configuration → Memory systems → Document processing → Background services → Agents.
- Error handling: fail fast on critical config, log-and-continue for optional features, degrade gracefully on external outages, surface user-friendly feedback via the resilience layer.
- Multilingual stance: rely on LLM intent detection instead of regex; avoid language-specific heuristics and hardcoded command strings. Example: replace
re.match(r'^(help|assist)'...)checks with LLM-driven intent detection so guidance works in any language.
- SOP execution pipeline:
SOP search is semantic; full SOP content feeds the decomposer; execution mode varies by SOP type (template vs guide).
user_request → SOP search (FAISS) → Pass SOP to decomposer → Execute workflow
- Intent-based routing:
async def chat(self, message: str, user_id: str): intent = await self.intent_detector.analyze(message) sops = await self.sop_coordinator.search(message) if sops: agents = self.select_agents_for_sop(sops[0]) else: agent = self.select_agent(intent) context = await self.memory.get_context(user_id) response = await agent.process(message, context)
- Memory tiers:
- Working memory: always on, size-configurable.
- Buffer memory: FIFO plus vector recall for recent context.
- Persistent memory: PostgreSQL/SQLite backing.
- Vector memory: FAISSx for semantic retrieval.
- Multi-user isolation: enforced through Memobase partitioning.
- ID hierarchy:
user_id (user isolation) └── session_id (chat grouping) └── request_id (single interaction with all clarifications)user_id: top-level isolation, lowercase, "0" in single-user mode.session_id: groups related requests into a conversation, scopes buffer memory filtering.request_id: tracks ONE complete interaction including all clarifications; used as key forclarification:{request_id}.- Clarification coordination (intentional two-level lookup):
- Overlord:
_pending_clarification[session_id]→ returnsrequest_id - UnifiedClarificationSystem:
clarification:{request_id}→ clarification state ⚠️ DO NOT attempt to "fix" this two-level lookup—it's intentional and correct
- Overlord:
- Event types: 349 typed events across 5 categories (SystemEvents, ConversationEvents, ServerEvents, APIEvents, ErrorEvents) covering complete request lifecycle.
- Validation requirement: 100% validation mandatory—run
python3 scripts/validate_events.pybefore committing any observe() changes. - Event naming conventions:
- Past tense for completion:
_COMPLETED,_FAILED,_SELECTED - Present tense for progress:
_PROCESSING,_STARTED,_PLANNING - Component prefix:
OVERLORD_*,AGENT_*,WORKFLOW_*,MEMORY_*,MCP_*
- Past tense for completion:
- Adding events: prefer reusing existing events with enhanced metadata over creating new types; add new events only when semantically distinct.
- Metadata enhancement: enrich events with performance metrics, quality scores, and diagnostic fields.
- Event lifecycle: emit
_STARTEDfor initiation,_COMPLETEDfor success,_FAILEDfor errors—ensures complete traceability.
- Use real services (OpenAI, Anthropic, live MCP, actual embeddings); mocks are disallowed.
- Tests should spotlight the targeted feature and succeed when that feature works—even if ancillary services are missing.
- Design tests to expose real defects with verbose diagnostics; never submit cheater tests.
- Feature-day orientation: Days 1-3 foundation/memory/multimodal, Days 4-6 MCP/file generation/knowledge, Days 7-12 advanced workflow & resilience.
- Test Structure: All e2e tests in
e2e/tests/[area_number]_[area_name]/; 12 areas covering foundation → scheduling. - Three Test Patterns:
- Pattern 1 (Runtime Modification): Modifies formation at runtime—suitable for behavior tests.
- Pattern 2 (Shared Directory): Uses shared formation directory—suitable for tests with common config.
- Pattern 3 (Separate Formations): Each test has isolated formation—suitable for complex/specialized tests.
- Pattern Selection: Choose based on test requirements, not standardization; standalone scripts > complex abstractions for timing-sensitive tests (e.g., scheduler).
- Async Cleanup: Always use async cleanup utilities for tests creating fire-and-forget tasks; prevents RecursionError spam.
- Formation Setup: Co-locate formations with tests; use symlinks for
.keyandsecrets.encfiles. - Service Dependencies: Document required services (PostgreSQL, FAISSx, webhook server); tests may require specific ports/configs.
- Simplicity Principle: Don't over-abstract—complex base classes can introduce bugs; working tests > standardized tests.
- Missing required LLM capability 'text': ensure formation includes a
textmodel underllm.models. - Intent detection failing: verify formation LLM entry, credentials, and model capability coverage.
- Workflow not triggering: confirm
auto_decomposition: true, validate complexity threshold (default 7.0), and ensure no agent override is forcing a bypass. - Event validation failing: run
python3 scripts/validate_events.pyto identify missing event types; all observe() calls must use enum-defined events. - E2E test 'str' object is not callable: check if base class abstraction interferes; consider standalone pattern without base class.
- RecursionError spam in tests: missing async cleanup; use
ensure_async_cleanup()utility from test helpers. - Formation not loading in tests: verify symlinks to
.keyandsecrets.encuse correct relative paths. - Scheduler tests timing out/failing: avoid RUNTIME pattern; use standalone scripts with direct formation loading for precise timing control.
- 'list' object has no attribute 'get' during config load: knowledge config format mismatch. Use dict format (
knowledge: {enabled: true, sources: [...]}) or remove the field entirely if not needed. Empty listknowledge: []is also valid. - Docker AMD64 image too large: PyTorch defaults to CUDA on AMD64 (4GB+ NVIDIA libs). Fix: Install CPU-only PyTorch first via
--index-url https://download.pytorch.org/whl/cpu. - SIF build fails with "no space left": Add disk cleanup step before Apptainer install in CI workflow.
- GitHub release upload fails (>2GB): Check Docker image size—likely CUDA bloat. SIF files must be under 2GB.
- Artifact generation returns 0 artifacts (no error logged): The
_invoke_toolcatch-allexcept Exceptionsilently converts failures to{"error": ..., "status": "error"}dicts that lack_artifact, so the artifact extractor skips them. Check server logs for the "using artifact service" event—if "Executed planned step" appears at the same millisecond, the subprocess failed instantly. - Artifact temp directory vanishes between runs: macOS (and some container runtimes) clean
/var/folderstemp dirs. The_create_execution_dirmethod now re-creates the base dir on every call—do not rely on the__init__mkdir alone. - Artifact sandbox
ulimit -vkills matplotlib/numpy on Linux:ulimit -vlimits virtual address space (not RSS). matplotlib + numpy + scipy map many shared libraries that easily exceed 512MB virtual. Current limit is 2GB. If you see subprocess hangs that resolve exactly at the timeout, check the memory limit first. - SIF chart generation times out but works in Docker: Likely the
ulimit -vvirtual memory limit (Linux-only code path inartifact_service.py). Test by running the artifact service directly inside the SIF:singularity exec ... python -c "from muxi.runtime.formation.artifacts.artifact_service import ArtifactService; ...". docker system prune -afbreaks SIF deployments: This removes the runtime-runner image which contains/opt/muxi-tools(node, npx, git, curl, fonts). Rebuild withdocker build -t ghcr.io/muxi-ai/runtime-runner:latest .from the runtime-runner repo.- LLM ignores generate_file constraints: Claude Haiku may ignore soft phrasing like "Do NOT import requests". Use stronger language: "NEVER", "WILL BE REJECTED", "offline sandbox". Even so, LLMs will generate networking code when the task implies fetching data—allow networking imports rather than fighting the pattern.
- generate_file uses fpdf and fails on Unicode: fpdf's default font (Helvetica) cannot encode non-ASCII characters. The constraint prompt now says "prefer reportlab over fpdf". If fpdf is used, the code must strip or replace non-ASCII characters.
- Adding services: implement in
src/muxi/services/, wire into formation loading, register with the overlord, and update schemas when configuration is exposed. - Orchestration edits: modify
overlord.py, sync workflow integrations, preserve SOP compatibility, test with real formations. - Memory updates: touch the relevant tier, maintain partitioning, validate Memobase behavior, and confirm extraction paths remain intact.
- E2E test development:
- Review existing tests in the area to understand patterns.
- Choose appropriate pattern (runtime/shared/separate) based on test needs.
- Use symlinks for
.keyandsecrets.encfiles with correct relative paths. - Include async cleanup for proper resource management (fire-and-forget tasks).
- Test with actual services—no mocks.
- Verify formation path resolution works correctly.
- Prefer simplicity: standalone script > complex base class if test is timing-sensitive.
src/muxi/runtime/formation/formation.py— formation lifecycle management.src/muxi/runtime/formation/overlord/overlord.py— central orchestration logic.src/muxi/runtime/formation/workflow/— SOP execution pipeline.src/muxi/runtime/formation/resilience/— error recovery and user messaging.src/muxi/runtime/services/— runtime services catalog.src/muxi/runtime/datatypes/observability.py— event type definitions.scripts/validate_events.py— event validation utility.e2e/tests/— 12 test areas covering all runtime functionality.- Formation schema: see agentformation.org.
generate_fileis a built-in tool, not an MCP. Registered inagent.py, routes toartifact_service.py.- Architecture: LLM generates Python code via
_infer_tool_parameters()→ code validated via AST (_validate_code) → executed viasubprocess.runin an isolated temp directory → output files tracked and wrapped asMuxiArtifact. - Allowed imports are whitelisted in
ALLOWED_IMPORTS(artifact_service.py). Includes data science, visualization, document generation, and networking libraries. - Constraints are injected into the
_infer_tool_parameterssystem prompt forgenerate_filecalls: library preferences, backend settings, encoding notes. - Sandbox limits:
MAX_EXECUTION_TIME = 60s,MAX_MEMORY_MB = 2048(virtual, viaulimit -von Linux only). - Artifact extraction: After planning execution,
my_resultsis scanned for dicts containing_artifactkey. Only results fromgenerate_filewith a successfulMuxiArtifactwill be extracted. Errors are caught silently—check logs for "using artifact service" followed immediately by "Executed planned step" (same timestamp = instant failure). - SIF considerations: The subprocess inherits the Singularity environment. Ensure
/tmpis bind-mounted and writable. Theulimit -vcode path only runs on Linux—macOS uses directsubprocess.runwithout memory limits.
Implementation of the open Agent Skills specification:
- Skills directory:
formation/skills/{skill-name}/SKILL.md - Progressive disclosure: metadata at startup, full content on activation
- Executor container for script execution (ZeroMQ-based)
Group-based permission filtering via muxi-enterprise package:
- YAML-based group definitions with inheritance
- Agent/MCP filtering by group membership
- Runtime patching (zero changes to OSS code)
- Validate declared model capabilities vs assigned responsibilities.
- Improve performance via model instance caching.
- Build richer fallback chains for capability gaps.
- Ship tooling for configuration migrations.
- Feedback is expected: challenge assumptions, flag better approaches, call out missing conventions.
- Communicate succinctly; offer plans when detail is needed, otherwise keep responses tight.
- Skip flattery; stick to factual, skeptical, collaborative dialogue.
- Ask clarifying questions instead of guessing intent.
- No partial implementations or "temporary" simplifications.
- Eliminate dead code; reuse existing helpers/constants before introducing new ones.
- Preserve naming consistency across modules, classes, functions, and variables.
- Avoid over-engineering—choose the simplest workable solution.
- Keep concerns separated; no mixing validation, persistence, and presentation layers.
- Prevent resource leaks: close DB connections, cancel timers, remove listeners.
- Read existing code paths to understand conventions before modifying or extending.
- After any multi-step or feedback-heavy task, ask: “Would you like me to reflect on our interaction and suggest potential improvements to the active CLAUDE.md file?”
- If the user declines or stays silent, conclude normally.
- If the user agrees:
- Review relevant feedback and compare it with the active CLAUDE.md guidance.
- List the in-scope CLAUDE.md files (global + workspace) influencing the task.
- Propose concrete, actionable edits (use
replace_in_fileblocks when feasible). - Await approval before applying adjustments, then return to task completion flow.
- Secrets live beside formation YAMLs in encrypted
secrets.encfiles (withsecretstemplate showing required keys); avoid environment variables for runtime config.
- Unit tests:
python -m pytest tests/unit/ -v(standard pytest). - E2E tests: Each e2e test is a standalone script, not a pytest test. Do NOT run them with
pytest. Use the dedicated runners from thee2e/directory:- Full suite:
cd e2e && python run_all_tests.py— runs every test sequentially with per-area timeouts, early-kill on success, and crash-signal retries. Saves report toe2e/results/test_report.json. - Random sample (sniff-test):
cd e2e && python run_random_tests.py N— picks N random tests from the full pool and runs them. Defaults to 10. Saves report toe2e/results/random_test_report.json. - Single test:
cd e2e/tests/<area> && python test_<name>.py— run one test directly as a script.
- Full suite:
- Both runners handle
PYTHONPATH, timeouts, early-kill after SUCCESS markers, and retry on SIGSEGV/SIGABRT crashes.