Triton-agent / Apeinx OpEvolver

A lightweight operator self-optimization agent for low-cost, high-quality Triton kernel development.

Triton-agent = 算子注册表 + 算子契约 + PyTorch 正确性基线
             + Triton 候选变体生成 + 编译 + 验证 + 性能 Profile
             + MicroRL/Bandit 调优 + 晋升/回滚 + Evidence Replay

Quick Start

# 1. Create virtual environment
python -m venv venv
source venv/bin/activate      # Linux / WSL
# venv\Scripts\activate       # Windows

# 2. Install core (no GPU needed for development)
pip install -e ".[dev]"

# 3. Install with GPU support (Linux/WSL with NVIDIA GPU only)
pip install -e ".[gpu]"

# 4. Run unit tests to verify install
pytest test/ -q

CLI Commands

Operator workflow

Command	Description
`triton-agent init <op>`	Register an operator from a contract.yaml directory
`triton-agent optimize <op> --shape B=8,T=2048,D=4096 --dtype fp16`	Run full optimization pipeline
`triton-agent verify-correctness <op> --shape B=8,T=2048,D=4096`	Check kernel output against PyTorch reference

Inspection

Command	Description
`triton-agent leaderboard <op>`	View best configs for an operator
`triton-agent replay <episode_path>`	Replay a past optimization episode
`triton-agent compare <op> --baseline torch --variant best`	Compare best variant vs baseline
`triton-agent cross-compare [ops...]`	Compare best configs across multiple operators

Experiments

Command	Description
`triton-agent experiment run --level smoke`	13 ops, 1 shape each, correctness + latency
`triton-agent experiment run --level core`	6 ops x 3 shapes x 4 strategies
`triton-agent experiment run --level ablation`	3 ops x 5 shapes x 7 strategies x 3 seeds
`triton-agent experiment run --level replay`	Cold start vs warm start reproducibility
`triton-agent experiment ablation --name all`	5 targeted ablations (strategy / profile / reward / replay)
`triton-agent experiment report reports/<file>.json`	Generate Markdown report from benchmark JSON

Options

Flag	Values	Default
`--strategy`	grid, best_of_n, ucb, thompson, epsilon, reinforce, grpo	auto (based on search space size)
`--max-candidates`	1–1024	64
`--retry`	0–10	3
`--shape`	e.g. `B=8,T=2048,D=4096`	B=8,T=2048,D=4096
`--dtype`	fp16, bf16, fp32	fp16
`--device`	cuda	cuda

Experiment Workflow

Run these on a Linux machine with NVIDIA GPU.

Level 1 — Smoke (correctness gate)

triton-agent experiment run --level smoke
triton-agent experiment report reports/smoke_benchmark_*.json

Verifies all 13 operators pass correctness checks. Output: reports/smoke_benchmark.md.

Level 2 — Core (performance)

triton-agent experiment run --level core
triton-agent experiment report reports/core_benchmark_*.json

6 core operators x 3 shapes x 4 strategies. Output: reports/core_benchmark.md.

Level 3 — Ablation (evidence)

triton-agent experiment run --level ablation
triton-agent experiment report reports/ablation_benchmark_*.json

# Or targeted ablations:
triton-agent experiment ablation --name strategy    # Which strategy needs fewest trials?
triton-agent experiment ablation --name profile     # Correctness-only vs +latency vs full
triton-agent experiment ablation --name reward      # Reward function component ablation
triton-agent experiment ablation --name replay      # Cold start vs warm start vs transfer

Proves: strategy efficiency, MicroRL value, profile-guided selection, reward design, and reproducibility.

Operators

P0 — Core closed loop

Operator	Description
`rmsnorm`	Root Mean Square Layer Normalization
`rope`	Rotary Position Embedding
`fused_bias_gelu`	Fused bias + GELU activation

P1 — Extended operators

Operator	Description
`swiglu`	SiLU-gated linear unit
`quant_dequant`	Quantize-dequantize (INT8 simulation)
`layernorm`	Standard Layer Normalization

P2 — LLM decode hot path

Operator	Description
`kv_append`	KV cache token append
`rope_kv_append`	Fused RoPE + KV cache write
`matmul_epilogue`	Matmul + bias + GELU fusion

P3 — High-difficulty kernels

Operator	Description	Note
`quant_matmul`	INT8 quantized matrix multiplication
`paged_kv`	Paged KV cache (vLLM-style block tables)
`paged_attention`	PagedAttention decode kernel	experimental
`flash_attn_like`	FlashAttention-like tiled attention	experimental

Search Strategies

Strategy	Type	When to use
`grid`	Exhaustive	Search space < 64 combos
`best_of_n`	Random sampling	Quick exploration
`ucb`	Bandit	Balanced explore/exploit
`thompson`	Bandit	Noisy reward environments
`epsilon`	Bandit	Simple adaptive baseline
`reinforce`	Policy gradient	Medium/large search spaces
`grpo`	Policy gradient	Large search spaces, group-relative advantage

Architecture

triton_agent/
├── cli.py                 # CLI entry point (click)
├── agent/                 # Agent loop
│   ├── planner.py         # Strategy dispatch (grid / bandit / RL)
│   ├── generator.py       # Candidate generation (grid / best-of-N / bandit)
│   ├── repairer.py        # Auto-repair compile/verify failures
│   ├── selector.py        # Best-of-N selection by reward
│   └── promoter.py        # Promotion gate (min_speedup + max_variance)
├── core/                  # Infrastructure
│   ├── contract.py        # YAML contract parser + 8 dataclasses
│   ├── spec.py            # OpState / OpAction / CandidateResult
│   ├── registry.py        # Operator registry singleton
│   ├── compiler.py        # Triton JIT compile wrapper
│   ├── cuda_ext.py        # CUDA Extension fallback compiler
│   ├── compile_cache.py   # SQLite compile result cache
│   ├── verifier.py        # Numerical correctness checker
│   ├── profiler.py        # Latency p50/p90/p99 profiler
│   ├── adaptive_profile.py # Adaptive warmup + early-stop profiler
│   ├── reward.py          # 6-factor reward scoring
│   ├── storage.py         # EpisodeStore (JSONL) + LeaderboardStore (SQLite)
│   ├── replay.py          # Replay / compare / rollback
│   ├── eval_engine.py     # Parallel eval + regression detection
│   ├── checkpoint.py      # Save/resume optimization state
│   └── trend.py           # Leaderboard trend analysis
├── microrl/               # Lightweight RL layer
│   ├── bandit.py           # UCB / Thompson / Epsilon-Greedy
│   ├── reinforce_lite.py   # REINFORCE policy gradient
│   ├── grpo_lite.py        # Group Relative Policy Optimization
│   └── trainer.py          # Multi-strategy trainer + ShapeConfigStore
├── ops/                   # 13 operator families (contract + ref + templates + verify + bench)
├── experiments/            # Benchmark framework
│   ├── config.py           # Experiment matrix definitions
│   ├── runner.py           # Trial execution engine
│   ├── ablation.py         # Targeted ablation scripts
│   └── report.py           # Markdown report generator
├── integrations/           # PyTorch / torch.compile / vLLM / SGLang adapters
├── benchmarks/             # Multi-shape benchmark suites
└── test/                   # 95 unit + integration tests

Adding a New Operator

Create directory triton_agent/ops/<name>/templates/
Write contract.yaml — define inputs/outputs/tolerance/search_space
Write reference.py — PyTorch reference implementation + generate_test_inputs()
Write at least one Triton kernel template in templates/triton_v1.py
Write verify.py — compare against reference
Write benchmark.py — measure latency

The agent auto-discovers operators under ops/. No registration code needed.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest test/ -v

# Run a single test file
pytest test/test_microrl.py -v

# Coverage
pytest test/ --cov=triton_agent --cov-report=html

All Python libraries use the project's venv. GPU dependencies (torch, triton) are optional extras and only install on Linux.

Requirements

Environment	Dependencies
Development (any OS)	Python >= 3.10, click, pyyaml, numpy, rich, pytest
GPU (Linux only)	torch >= 2.0, triton >= 2.1
CUDA fallback	CUDA Toolkit, nvcc

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test		test
triton_agent		triton_agent
.gitignore		.gitignore
README.md		README.md
ceshi.txt		ceshi.txt
phase-implementation-plan.md		phase-implementation-plan.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton-agent / Apeinx OpEvolver

Quick Start

CLI Commands

Operator workflow

Inspection

Experiments

Options

Experiment Workflow

Level 1 — Smoke (correctness gate)

Level 2 — Core (performance)

Level 3 — Ablation (evidence)

Operators

P0 — Core closed loop

P1 — Extended operators

P2 — LLM decode hot path

P3 — High-difficulty kernels

Search Strategies

Architecture

Adding a New Operator

Development

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Triton-agent / Apeinx OpEvolver

Quick Start

CLI Commands

Operator workflow

Inspection

Experiments

Options

Experiment Workflow

Level 1 — Smoke (correctness gate)

Level 2 — Core (performance)

Level 3 — Ablation (evidence)

Operators

P0 — Core closed loop

P1 — Extended operators

P2 — LLM decode hot path

P3 — High-difficulty kernels

Search Strategies

Architecture

Adding a New Operator

Development

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages