Build the most powerful step function, then run it in loops to get work done.
Secondary goal: educate. Break everything into small chunks so anyone can learn how it's built and modify it for their own needs.
If we perfect the step function, the loop is trivial:
for task in dag.topological_order() {
step(task)
}
All complexity lives in step(). Everything else is orchestration.
A step is a single atomic unit of work:
step(state, task) → (state', result)
Inputs:
state- codebase, task files, contexttask- what to accomplish
Outputs:
state'- modified codebase, updated task statusresult- Success | Failed
What happens inside:
- Set up isolated environment (worktree)
- Agent works (Claude with controlled tools)
- Verify (tests, build)
- Transition (commit, merge into session branch)
- Return result
rloop is a Markov Decision Process:
MDP = (S, A, T, R, π)
S = State space (codebase × task_state × context)
A = Action space (a step execution)
T = Transition function: S × A → S'
R = Reward function: S × A → {Success, Failed}
π = Policy: task → step_config
Everything observable before a step:
- Codebase (files, git history)
- Task state (completed or not, from task files)
- Previous attempts (failure learnings accumulated in memory, session logs on disk)
A step execution. Within the step, the agent takes many micro-actions (edits, tool calls), but from the MDP's view, the whole step is one action.
Deterministic given the step outcome:
- Success → task marked completed, code merged into session branch, state updated
- Failed → task remains incomplete, learnings captured, retry possible
Sparse signal at end of step:
Success- verification passes (orcompletecalled with no verification configured), task doneFailed- verification failed after max retries, or agent gave up
How we configure each step:
struct StepConfig {
model: Model, // haiku, sonnet, opus
system_prompt: String, // task framing
tools: Vec<Tool>, // allowed tools
max_turns: u32, // budget
max_retries: u32, // verification retry limit
verification: Option<Vec<String>>, // commands to run (per-task)
}The policy maps tasks to configs. Initially from task frontmatter + project config, later learned.
Every step produces a trajectory:
τ = [(s₀, a₀), (s₁, a₁), ..., (sₙ, aₙ), reward]
We log everything:
- Full session events (Claude messages, tool calls, results)
- rloop lifecycle events (worktree create, merge, cleanup)
- Config used (model, prompt, tools)
- Metrics (tokens, duration, tool call counts, context percentage)
- Outcome (Success/Failed)
From trajectories, we improve the system:
┌─────────────────────────────────────────────────────────────────┐
│ OUTER LOOP (System Improvement) │
│ │
│ Session Logs + Rewards → Analysis → Better Policy │
│ │
│ Tune: prompts, model selection, tool sets, task specs │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ INNER LOOP (Task Execution) │
│ │
│ step(state, task) → (state', result) │
│ │
└─────────────────────────────────────────────────────────────────┘
This is the key: Every step is a data point for making the next step better.
Humans don't block the loop. Agents work continuously.
When an agent needs human input:
- Queue the request (question, assumption, approval)
- Return
Blocked - Move to next task
Humans review asynchronously:
- Batch process queue items
- Zero wait latency between items
- Answers unblock tasks for future steps
Not in v1 — human interaction is reviewing session branches and improving task specs between runs.
One step = one task attempt. Complete isolation:
- Fresh worktree
- Fresh context (no accumulated state across tasks)
- Clean verification
- Clear outcome
We control everything going into the step:
- Isolated worktree (agent can't escape working directory)
- Project-level settings loaded (CLAUDE.md, skills, agents)
- No global/user-level ambient context
- Measured context (token tracking from first message)
- Agent has self-awareness of context usage
Full visibility into the step:
- Every Claude message logged
- Every tool call and result logged
- Every rloop lifecycle event logged
- Token counts accumulated per message
- Context percentage tracked in real-time
We don't just run steps, we learn from them:
- Successful patterns → reinforce
- Failure modes → analyze and fix
- Task spec tuning based on outcomes
- Model selection based on task type success rates
- Step is solid - Consistent, reproducible, observable
- Rewards are clear - Success/Failed, no ambiguity
- Learning is possible - Full trajectories logged for analysis
- Tasks are replayable - Same specs can be run multiple times to compare results
- System improves - Better task specs and configs from trajectory analysis
Once the core loop is solid:
- Parallel DAG execution (concurrent independent tasks)
- Human queue (ask_question, log_assumption, propose_task)
- Task creation by agents (discovered work becomes new tasks)
- Daemon mode (long-running, concurrent sessions)
- Community: weekly features, streams, blog posts
- Enable others to fork and build their own loops