AGENTS.md -- AReaL Agent Operations Guide

Quick reference

Tech stack: Python 3.12+ | PyTorch | FSDP2 / Megatron / Archon | SGLang / vLLM

# Environment
uv sync --extra cuda            # dependencies (or `uv sync` without CUDA)
source .venv/bin/activate        # activate venv BEFORE pre-commit or git commit if venv exists
pre-commit install               # formatting hooks (Ruff, mdformat, clang-format, nbstripout, autoflake)
pre-commit run --all-files       # lint + format everything

# Tests
uv run pytest tests/test_<topic>.py

# CLI docs
uv run python docs/generate_cli_docs.py

Hard rules -- never violate:

No wildcard imports (from x import *).
No hardcoded secrets, paths, or endpoints.
No skipping pre-commit hooks.
No guessing cluster configs or rebuilding CUDA/driver stacks.
Integration tests require multi-node hardware -- explain skips explicitly.

Always do:

Read relevant files before modifying code.
Run pre-commit run --all-files before committing.
Follow existing code patterns in the same module.
Add tests for new functionality.
Use the question tool (structured options) when asking the user for decisions or clarifications.

Ask first before:

Modifying config structures in areal/api/cli_args.py.
Adding new dependencies.
Changing launcher or scheduler logic.
Deleting or renaming public APIs.

When unsure, leave a TODO(agent) comment and note the constraint in your response.

Repository map

areal/                     Core Python package
|-- api/                   Config dataclasses, contracts, IO structs
|-- dataset/               Stateful dataset loaders (GSM8K, Geometry3K, CLEVR, ...)
|-- engine/                Training backends (FSDP2, Megatron) + inference adapters
|-- experimental/          Prototype engines/workflows (Archon MoE engine)
|-- infra/                 Launchers (Local/Ray/Slurm), schedulers, utilities
|-- models/                Model adapters (Megatron-Core, Transformers, custom heads)
|-- reward/                Built-in reward functions + math parsers
|-- tests/                 Unit/integration test suites
|-- trainer/               High-level orchestrators (PPOTrainer, SFTTrainer)
|-- utils/                 Cross-cutting helpers (logging, data, checkpoints, RL ops)
+-- workflow/              RolloutWorkflow implementations (RLVR, multi-turn, vision)

docs/                      Jupyter Book docs (https://inclusionai.github.io/AReaL/)
examples/                  Training scripts and launcher recipes

Code style & patterns

Composition over inheritance -- keep hierarchies <= 2 levels; prefer delegation.

Type	Pattern	Example
Config dataclass	`XxxConfig`	`GRPOConfig`, `FSDPEngineConfig`
Engine class	`XxxEngine`	`FSDPEngine`, `ArchonEngine`
Workflow class	`XxxWorkflow`	`RLVRWorkflow`, `MultiTurnWorkflow`
Reward function	`xxx_reward_fn`	`gsm8k_reward_fn`, `geometry3k_reward_fn`

Logging: areal.utils.logging.getLogger(name) with PascalCase names -- never print or logging.__name__. Per-rank format: [{Component} Rank {N}]. Register new loggers with color in areal/utils/logging.py.

Performance:

No GPU-CPU sync in hot paths (.item(), .tolist(), print(tensor)).
Batch ops over Python loops on tensor elements.
Explicit dtype/device; torch.Size assertions for shape validation.

Typing & imports: explicit type hints; reuse areal/api/cli_args.py dataclasses; no wildcard imports; heavy optional deps inside functions.

Async: rollout workflows must stay non-blocking (await + aiofiles); no sync I/O in arun_episode.

Domain experts & skills

Fire the appropriate expert subagent or load a skill based on what you're working on. Experts are read-only consultants with deep domain knowledge; skills are step-by-step implementation guides.

Working on...	Fire subagent	Load skill
FSDP engine code	`fsdp-expert`	--
Archon engine / new model	`archon-expert`	`add-archon-model`
Megatron engine code	`megatron-expert`	--
RL algorithms / PPO / GRPO	`algorithm-expert`	--
Launcher / scheduler / infra	`launcher-expert`	`debug-distributed`
New reward function	--	`add-reward`
New dataset loader	--	`add-dataset`
New rollout workflow	--	`add-workflow`
Unit tests	--	`add-unit-tests`
Distributed debugging	--	`debug-distributed`

How to fire an expert: task(subagent_type="fsdp-expert", load_skills=[], run_in_background=true, prompt="...")

Core concepts

Trainer orchestrator (areal/trainer/, PPOTrainer, SFTTrainer): manages the training loop, dataset loading, and workflow execution. Entry point: examples/math/gsm8k_rl.py.

Rollout workflows (areal/workflow/, RolloutWorkflow.arun_episode): define how episodes are generated. Use add-workflow skill for step-by-step guide.

Engines: Inference engines handle async generation via engine.agenerate() and manage weight updates. Training engines consume rollout tensors, compute PPO/GRPO updates, and broadcast weight versions (FSDP2, Megatron, or Archon).

Weight versioning: async workflows require version alignment via WeightUpdateMeta (areal/api/engine_api.py). Critical for correctness across distributed training.

Observability: emit metrics via stats_tracker.get(), persist artifacts under dump_dir, checkpoint via areal/utils/saver.py / recover.py.

Launcher / scheduler: training requires cluster setup (local / Ray / Slurm) via configs in areal/infra/launcher/. See launcher-expert for deployment guidance.

API & config rules

Applies to: areal/api/**

Field ordering: required -> common optional -> rare optional -> internal (_ prefix).
Validation: __post_init__ with ValueError and clear message.
Backward compat: add fields with defaults; deprecate before removing; avoid type changes.
CLI: use Literal for enum choices; all public configs need docstrings with constraints.

Distributed code rules

Applies to: areal/engine/**, areal/experimental/**

Never create global process groups at module level; always pass process_group explicitly.
dist.get_rank(group) not dist.get_rank() when group matters.
DeviceMesh dimensions must match ArchonParallelDims: dp_shard, tp, cp, ep, etp.
All-reduce: all ranks must call. Broadcast: explicit src. Barrier: debugging only.

Issue	Cause	Fix
Hang	Mismatched collective calls	All ranks call same op
Wrong results	Incorrect `ReduceOp`	Check SUM vs MEAN
OOM	Unsharded tensor on wrong device	Verify DTensor placements

Debug env vars: TORCH_DISTRIBUTED_DEBUG=DETAIL, NCCL_DEBUG=INFO, CUDA_LAUNCH_BLOCKING=1. See /debug-distributed skill for comprehensive guide.

Testing rules

Applies to: **/tests/**, test_*.py

Marker	When
`@pytest.mark.slow`	> 10s (excluded from default CI)
`@pytest.mark.slow` + `@pytest.mark.ci`	Slow but must run in CI
`@pytest.mark.asyncio`	Async tests

Naming: test_<what>_<condition>_<expected>() with Arrange/Act/Assert.
GPU: skip gracefully (@pytest.mark.skipif(not CUDA_AVAILABLE, reason="...")).
Distributed mocking: torch.distributed.fake_pg; don't mock FSDP/DTensor internals.
Assertions: torch.testing.assert_close() with explicit rtol/atol; prefer tmp_path, monkeypatch.

Suite	Command	GPU
Unit	`pytest tests/test_*.py`	No
GRPO	`pytest tests/grpo/`	Yes
FSDP	`pytest tests/test_fsdp_*.py`	Yes
Distributed	`pytest tests/torchrun/`	Multi-GPU

Collaboration & review

Branches: kebab-case (feature/multi-turn-metrics, bugfix/fsdp-weight-sync).
Commits: Conventional Commits (feat:, fix:, docs:), ~72 char subject, imperative voice. Squash WIP before PR.
Pre-merge: full pre-commit stack; doc-only edits need at least mdformat --check.
PRs: tie to issue, highlight risk areas, list test commands executed, note skipped suites with reasons.

Command / Skill	Purpose
`/create-pr`	Rebase, squash, create PR
`commit-conventions`	Commit message conventions (auto-triggers on commit)
`/review-pr`	Dynamic agent-allocated PR review

Reference material

Docs portal: https://inclusionai.github.io/AReaL/
Quickstart: docs/tutorial/quickstart.md
Architecture: docs/tutorial/gsm8k_grpo.md
Customization: docs/customization/*.md
Algorithms: docs/algorithms/*.md
Best practices: docs/best_practices/*.md
CLI reference: docs/cli_reference.md
Agent workflow: docs/customization/agent.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md -- AReaL Agent Operations Guide

Quick reference

Repository map

Code style & patterns

Domain experts & skills

Core concepts

API & config rules

Distributed code rules

Testing rules

Collaboration & review

Reference material

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md -- AReaL Agent Operations Guide

Quick reference

Repository map

Code style & patterns

Domain experts & skills

Core concepts

API & config rules

Distributed code rules

Testing rules

Collaboration & review

Reference material