rloop Vision

Goal

Build the most powerful step function, then run it in loops to get work done.

Secondary goal: educate. Break everything into small chunks so anyone can learn how it's built and modify it for their own needs.

The Step is Everything

Core Insight

If we perfect the step function, the loop is trivial:

for task in dag.topological_order() {
    step(task)
}

All complexity lives in step(). Everything else is orchestration.

The Step

A step is a single atomic unit of work:

step(state, task) → (state', result)

Inputs:

state - codebase, task files, context
task - what to accomplish

Outputs:

state' - modified codebase, updated task status
result - Success | Failed

What happens inside:

Set up isolated environment (worktree)
Agent works (Claude with controlled tools)
Verify (tests, build)
Transition (commit, merge into session branch)
Return result

MDP Formulation

rloop is a Markov Decision Process:

MDP = (S, A, T, R, π)

S = State space (codebase × task_state × context)
A = Action space (a step execution)
T = Transition function: S × A → S'
R = Reward function: S × A → {Success, Failed}
π = Policy: task → step_config

State (S)

Everything observable before a step:

Codebase (files, git history)
Task state (completed or not, from task files)
Previous attempts (failure learnings accumulated in memory, session logs on disk)

Action (A)

A step execution. Within the step, the agent takes many micro-actions (edits, tool calls), but from the MDP's view, the whole step is one action.

Transition (T)

Deterministic given the step outcome:

Success → task marked completed, code merged into session branch, state updated
Failed → task remains incomplete, learnings captured, retry possible

Reward (R)

Sparse signal at end of step:

Success - verification passes (or complete called with no verification configured), task done
Failed - verification failed after max retries, or agent gave up

Policy (π)

How we configure each step:

struct StepConfig {
    model: Model,              // haiku, sonnet, opus
    system_prompt: String,     // task framing
    tools: Vec<Tool>,          // allowed tools
    max_turns: u32,            // budget
    max_retries: u32,          // verification retry limit
    verification: Option<Vec<String>>, // commands to run (per-task)
}

The policy maps tasks to configs. Initially from task frontmatter + project config, later learned.

The Learning Loop

Every step produces a trajectory:

τ = [(s₀, a₀), (s₁, a₁), ..., (sₙ, aₙ), reward]

We log everything:

Full session events (Claude messages, tool calls, results)
rloop lifecycle events (worktree create, merge, cleanup)
Config used (model, prompt, tools)
Metrics (tokens, duration, tool call counts, context percentage)
Outcome (Success/Failed)

From trajectories, we improve the system:

┌─────────────────────────────────────────────────────────────────┐
│                    OUTER LOOP (System Improvement)              │
│                                                                 │
│   Session Logs  +  Rewards  →  Analysis  →  Better Policy      │
│                                                                 │
│   Tune: prompts, model selection, tool sets, task specs        │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                    INNER LOOP (Task Execution)                  │
│                                                                 │
│              step(state, task) → (state', result)              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

This is the key: Every step is a data point for making the next step better.

Human-on-Loop (Future)

Humans don't block the loop. Agents work continuously.

When an agent needs human input:

Queue the request (question, assumption, approval)
Return Blocked
Move to next task

Humans review asynchronously:

Batch process queue items
Zero wait latency between items
Answers unblock tasks for future steps

Not in v1 — human interaction is reviewing session branches and improving task specs between runs.

Design Principles

1. The Step is Atomic

One step = one task attempt. Complete isolation:

Fresh worktree
Fresh context (no accumulated state across tasks)
Clean verification
Clear outcome

2. Controlled Environment

We control everything going into the step:

Isolated worktree (agent can't escape working directory)
Project-level settings loaded (CLAUDE.md, skills, agents)
No global/user-level ambient context
Measured context (token tracking from first message)
Agent has self-awareness of context usage

3. Observable Everything

Full visibility into the step:

Every Claude message logged
Every tool call and result logged
Every rloop lifecycle event logged
Token counts accumulated per message
Context percentage tracked in real-time

4. Rewards Drive Improvement

We don't just run steps, we learn from them:

Successful patterns → reinforce
Failure modes → analyze and fix
Task spec tuning based on outcomes
Model selection based on task type success rates

Success Criteria

Step is solid - Consistent, reproducible, observable
Rewards are clear - Success/Failed, no ambiguity
Learning is possible - Full trajectories logged for analysis
Tasks are replayable - Same specs can be run multiple times to compare results
System improves - Better task specs and configs from trajectory analysis

Beyond v1

Once the core loop is solid:

Parallel DAG execution (concurrent independent tasks)
Human queue (ask_question, log_assumption, propose_task)
Task creation by agents (discovered work becomes new tasks)
Daemon mode (long-running, concurrent sessions)
Community: weekly features, streams, blog posts
Enable others to fork and build their own loops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rloop Vision

Goal

The Step is Everything

Core Insight

The Step

MDP Formulation

State (S)

Action (A)

Transition (T)

Reward (R)

Policy (π)

The Learning Loop

Human-on-Loop (Future)

Design Principles

1. The Step is Atomic

2. Controlled Environment

3. Observable Everything

4. Rewards Drive Improvement

Success Criteria

Beyond v1

FilesExpand file tree

VISION.md

Latest commit

History

VISION.md

File metadata and controls

rloop Vision

Goal

The Step is Everything

Core Insight

The Step

MDP Formulation

State (S)

Action (A)

Transition (T)

Reward (R)

Policy (π)

The Learning Loop

Human-on-Loop (Future)

Design Principles

1. The Step is Atomic

2. Controlled Environment

3. Observable Everything

4. Rewards Drive Improvement

Success Criteria

Beyond v1