Skip to content

Latest commit

 

History

History
237 lines (175 loc) · 10.5 KB

File metadata and controls

237 lines (175 loc) · 10.5 KB

AGENTS.md -- AReaL Agent Operations Guide

Quick reference

Tech stack: Python 3.12+ | PyTorch | FSDP2 / Megatron / Archon | SGLang / vLLM

# Environment
uv sync --extra cuda            # dependencies (or `uv sync` without CUDA)
source .venv/bin/activate        # activate venv BEFORE pre-commit or git commit if venv exists
pre-commit install               # formatting hooks (Ruff, mdformat, clang-format, nbstripout, autoflake)
pre-commit run --all-files       # lint + format everything

# Tests
uv run pytest tests/test_<topic>.py

# CLI docs
uv run python docs/generate_cli_docs.py

Hard rules -- never violate:

  • No wildcard imports (from x import *).
  • No hardcoded secrets, paths, or endpoints.
  • No skipping pre-commit hooks.
  • No guessing cluster configs or rebuilding CUDA/driver stacks.
  • Integration tests require multi-node hardware -- explain skips explicitly.

Always do:

  • Read relevant files before modifying code.
  • Run pre-commit run --all-files before committing.
  • Follow existing code patterns in the same module.
  • Add tests for new functionality.
  • Use the question tool (structured options) when asking the user for decisions or clarifications.

Ask first before:

  • Modifying config structures in areal/api/cli_args.py.
  • Adding new dependencies.
  • Changing launcher or scheduler logic.
  • Deleting or renaming public APIs.

When unsure, leave a TODO(agent) comment and note the constraint in your response.


Repository map

areal/                     Core Python package
|-- api/                   Config dataclasses, contracts, IO structs
|-- dataset/               Stateful dataset loaders (GSM8K, Geometry3K, CLEVR, ...)
|-- engine/                Training backends (FSDP2, Megatron) + inference adapters
|-- experimental/          Prototype engines/workflows (Archon MoE engine)
|-- infra/                 Launchers (Local/Ray/Slurm), schedulers, utilities
|-- models/                Model adapters (Megatron-Core, Transformers, custom heads)
|-- reward/                Built-in reward functions + math parsers
|-- tests/                 Unit/integration test suites
|-- trainer/               High-level orchestrators (PPOTrainer, SFTTrainer)
|-- utils/                 Cross-cutting helpers (logging, data, checkpoints, RL ops)
+-- workflow/              RolloutWorkflow implementations (RLVR, multi-turn, vision)

docs/                      Jupyter Book docs (https://inclusionai.github.io/AReaL/)
examples/                  Training scripts and launcher recipes

Code style & patterns

  • Composition over inheritance -- keep hierarchies <= 2 levels; prefer delegation.
Type Pattern Example
Config dataclass XxxConfig GRPOConfig, FSDPEngineConfig
Engine class XxxEngine FSDPEngine, ArchonEngine
Workflow class XxxWorkflow RLVRWorkflow, MultiTurnWorkflow
Reward function xxx_reward_fn gsm8k_reward_fn, geometry3k_reward_fn

Logging: areal.utils.logging.getLogger(name) with PascalCase names -- never print or logging.__name__. Per-rank format: [{Component} Rank {N}]. Register new loggers with color in areal/utils/logging.py.

Performance:

  • No GPU-CPU sync in hot paths (.item(), .tolist(), print(tensor)).
  • Batch ops over Python loops on tensor elements.
  • Explicit dtype/device; torch.Size assertions for shape validation.

Typing & imports: explicit type hints; reuse areal/api/cli_args.py dataclasses; no wildcard imports; heavy optional deps inside functions.

Async: rollout workflows must stay non-blocking (await + aiofiles); no sync I/O in arun_episode.


Domain experts & skills

Fire the appropriate expert subagent or load a skill based on what you're working on. Experts are read-only consultants with deep domain knowledge; skills are step-by-step implementation guides.

Working on... Fire subagent Load skill
FSDP engine code fsdp-expert --
Archon engine / new model archon-expert add-archon-model
Megatron engine code megatron-expert --
RL algorithms / PPO / GRPO algorithm-expert --
Launcher / scheduler / infra launcher-expert debug-distributed
New reward function -- add-reward
New dataset loader -- add-dataset
New rollout workflow -- add-workflow
Unit tests -- add-unit-tests
Distributed debugging -- debug-distributed

How to fire an expert: task(subagent_type="fsdp-expert", load_skills=[], run_in_background=true, prompt="...")


Core concepts

Trainer orchestrator (areal/trainer/, PPOTrainer, SFTTrainer): manages the training loop, dataset loading, and workflow execution. Entry point: examples/math/gsm8k_rl.py.

Rollout workflows (areal/workflow/, RolloutWorkflow.arun_episode): define how episodes are generated. Use add-workflow skill for step-by-step guide.

Engines: Inference engines handle async generation via engine.agenerate() and manage weight updates. Training engines consume rollout tensors, compute PPO/GRPO updates, and broadcast weight versions (FSDP2, Megatron, or Archon).

Weight versioning: async workflows require version alignment via WeightUpdateMeta (areal/api/engine_api.py). Critical for correctness across distributed training.

Observability: emit metrics via stats_tracker.get(), persist artifacts under dump_dir, checkpoint via areal/utils/saver.py / recover.py.

Launcher / scheduler: training requires cluster setup (local / Ray / Slurm) via configs in areal/infra/launcher/. See launcher-expert for deployment guidance.


API & config rules

Applies to: areal/api/**

  • Field ordering: required -> common optional -> rare optional -> internal (_ prefix).
  • Validation: __post_init__ with ValueError and clear message.
  • Backward compat: add fields with defaults; deprecate before removing; avoid type changes.
  • CLI: use Literal for enum choices; all public configs need docstrings with constraints.

Distributed code rules

Applies to: areal/engine/**, areal/experimental/**

  • Never create global process groups at module level; always pass process_group explicitly.
  • dist.get_rank(group) not dist.get_rank() when group matters.
  • DeviceMesh dimensions must match ArchonParallelDims: dp_shard, tp, cp, ep, etp.
  • All-reduce: all ranks must call. Broadcast: explicit src. Barrier: debugging only.
Issue Cause Fix
Hang Mismatched collective calls All ranks call same op
Wrong results Incorrect ReduceOp Check SUM vs MEAN
OOM Unsharded tensor on wrong device Verify DTensor placements

Debug env vars: TORCH_DISTRIBUTED_DEBUG=DETAIL, NCCL_DEBUG=INFO, CUDA_LAUNCH_BLOCKING=1. See /debug-distributed skill for comprehensive guide.


Testing rules

Applies to: **/tests/**, test_*.py

Marker When
@pytest.mark.slow > 10s (excluded from default CI)
@pytest.mark.slow + @pytest.mark.ci Slow but must run in CI
@pytest.mark.asyncio Async tests
  • Naming: test_<what>_<condition>_<expected>() with Arrange/Act/Assert.
  • GPU: skip gracefully (@pytest.mark.skipif(not CUDA_AVAILABLE, reason="...")).
  • Distributed mocking: torch.distributed.fake_pg; don't mock FSDP/DTensor internals.
  • Assertions: torch.testing.assert_close() with explicit rtol/atol; prefer tmp_path, monkeypatch.
Suite Command GPU
Unit pytest tests/test_*.py No
GRPO pytest tests/grpo/ Yes
FSDP pytest tests/test_fsdp_*.py Yes
Distributed pytest tests/torchrun/ Multi-GPU

Collaboration & review

  • Branches: kebab-case (feature/multi-turn-metrics, bugfix/fsdp-weight-sync).
  • Commits: Conventional Commits (feat:, fix:, docs:), ~72 char subject, imperative voice. Squash WIP before PR.
  • Pre-merge: full pre-commit stack; doc-only edits need at least mdformat --check.
  • PRs: tie to issue, highlight risk areas, list test commands executed, note skipped suites with reasons.
Command / Skill Purpose
/create-pr Rebase, squash, create PR
commit-conventions Commit message conventions (auto-triggers on commit)
/review-pr Dynamic agent-allocated PR review

Reference material

  • Docs portal: https://inclusionai.github.io/AReaL/
  • Quickstart: docs/tutorial/quickstart.md
  • Architecture: docs/tutorial/gsm8k_grpo.md
  • Customization: docs/customization/*.md
  • Algorithms: docs/algorithms/*.md
  • Best practices: docs/best_practices/*.md
  • CLI reference: docs/cli_reference.md
  • Agent workflow: docs/customization/agent.md