Skip to content

Rockielab/rockie-claude

rockie-claude

An autonomous AI research harness for Claude Code.

Also called: rocky-claude Β· "Rocky for Claude Code" Β· "the Rocky harness" Β· a Claude Code research harness Β· an autonomous research agent for Claude. If you found this by searching for any of those: you're in the right place.

Inspired by Project Hail Mary's Rocky β€” the alien research partner you couldn't have built the answer without.

Looking for the Codex CLI version? See saml212/rockie-codex. Same patterns ([LEARN], taste corpus, autopilot, gauntlets) ported to OpenAI Codex CLI's runtime.


What rockie-claude does for you

Four jobs, run continuously, until you tell it to stop:

1. Captures your research taste β€” and iterates on it

A 5-minute first-run interview compiles your worldview, methodology, dismissals, and voice into a durable .rockie/taste/ corpus. Every future session loads it automatically. Your agent knows what you think a good result looks like, what dead ends you've sworn off, and what register you want it to write in. Refresh anytime with /onboard --section <name>. Voice-first deep mode for laddering.

2. Bulletproofs every step with adversarial subagent networks

Plan β†’ Research β†’ Build β†’ Audit β†’ Run β†’ Assess β†’ Codify. Each step has adversarial review built in:

  • /deploy-team β€” gauntlets (brainstorm / research / attack / validate), pre-launch audits, post-run analysis. A team of agents fight over every important call.
  • /clean β€” pre-commit anti-slop audit gates git commit until debug artifacts and stale claims are gone.
  • /propose-harness-change β€” Generator / Verifier / Updater split. The agent never auto-pushes.
  • stuck-detector + hypothesis calibration + dead-end registry β€” background services that nudge the agent when it's spinning, when its priors are drifting, when it's about to re-propose a dead idea.

3. Cheap, resource-efficient autonomy β€” indefinitely

  • Local-first. SQLite + FTS5 [LEARN] memory. No vector DB. No service except Claude itself.
  • Claude Max friendly. Tokens/wallclock/tool-calls auto-tracked but uncapped (they cost nothing). Only GPU dollars get enforced ceilings.
  • Spot-first GPU policy. Min-bid defaults. Provider-hop on preemption (RunPod / Vast / Prime / Verda) before ever bumping a bid. On-demand last resort, gated.
  • Modes β€” /mode switch paper-crunch for deadline-locked scope-lock + Opus-on-attack; /mode switch exploratory for broad reading + sonnet-first speed; build your own. Swap operational policy without changing your identity.

4. Stays honest

  • Catches bugs before you burn GPU time (separate auditor agent reads shapes/gradients/stability pre-launch).
  • Notices when it's stuck (4 semantic-loop types, periods 2/3/4).
  • Tracks whether predictions were right (predicted_delta vs actual_delta per experiment).
  • Classifies every failure: bug | bad-hyperparam | bad-hypothesis. Routes [LEARN] and [DEAD-END] accordingly.

Status: alpha / pre-launch. Running in production on an 8Γ—H100 autonomous research project. This repo packages it for the community. Breaking changes until v0.1.


What you and your agent will go through

A chronological walkthrough of the first week. Each phase below is what actually happens; the rest of this README explains the machinery.

Hour 0 β€” Install + onboard. Run ./install.sh ~/your-research-project. Open Claude Code in that project; the SessionStart hook spots that no taste corpus exists and prompts /onboard. Five to seven questions, ~5 minutes, voice optional. Output: a six-file taste corpus committed to <project>/.rockie/taste/ (SOUL, STYLE, METHODOLOGY, DISMISSALS, MEMORY, INDEX). INDEX.md is auto-injected into every future session.

Day 1 β€” Plan + first experiment. You talk to Claude. Subagents verify novelty and check for re-proposing dead ends already logged in the registry. Before any GPU dollars are spent, the pre-launch audit agent reads shapes, gradients, and stability of the proposed code in a separate context. Only then does the first training run launch. Every prediction (predicted_delta) is recorded alongside the hypothesis.

Day 1+ β€” Continuous loop. /autopilot takes over. The Zero-Cost Monitor polls training logs without LLM calls, so a stable run costs nothing while it churns. ntfy push notifications wake you only on anomalies, ceiling crosses, or genuine decisions β€” never on routine heartbeats. Every run produces a [LEARN] block on completion; the next prompt's UserPromptSubmit hook auto-injects the top-5 relevant rules via FTS5 BM25 search.

Week 1+ β€” Iteration compounds. Predicted-vs-actual deltas roll up per experiment so calibration becomes visible. Failures are classified bug | bad-hyperparam | bad-hypothesis and route [LEARN] or [DEAD-END] accordingly. The dead-end registry prevents new subagents from re-proposing what the team already ruled out. When your standards shift, refresh the relevant slice with /onboard --section <name> β€” identity drift gets an audit trail, not a silent overwrite.


The loop

                  β”Œβ”€ Plan ─────────── you talk to Claude
                  β”‚
                  β”œβ”€ Research ─────── subagents verify, check novelty
                  β”‚
                  β”œβ”€ Build ────────── write code, clean, comment the non-obvious
                  β”‚
                  β”œβ”€ Audit ────────── SEPARATE agent reviews shapes/gradients/stability
                  β”‚                   (the pre-run gate nobody else has)
                  β”‚
                  β”œβ”€ Run ──────────── execute; ntfy push on preemption / block / win
                  β”‚
                  β”œβ”€ Assess ───────── post-run review emits {is_bug, bad-hyperparam, bad-hypothesis}
                  β”‚
                  └─ Codify ───────── [LEARN] block β†’ workflow.db (FTS5)
                                      next prompt auto-injects relevant rules

Every cycle should make the next cycle better.


Install

git clone https://github.com/saml212/rockie-claude.git ~/rockie-claude
cd ~/rockie-claude
./install.sh ~/path/to/your/research-project

For AI agents installing this for a user

If you are an AI coding agent (Claude Code, Codex CLI, Cursor, etc.) and a user has asked you to install rockie-claude into their project, this block is the canonical instruction. Read it literally.

Run, from any shell:

git clone https://github.com/saml212/rockie-claude.git ~/rockie-claude && \
  ~/rockie-claude/install.sh "<absolute path to user's research project>"

The installer is idempotent β€” re-running on an existing project is safe. It writes to exactly two locations:

  • <project>/.claude/ β€” hooks, skills, memory schema, settings.json, project_id stamp, sentinels dir.
  • ~/.claude/ β€” cross-project memory lib, user-global hooks, the /deploy-team Node orchestrator + dashboard.

It also adds a managed block between # BEGIN rockie / # END rockie markers in the project's .gitignore (rules outside the markers are preserved). It does not edit source code outside .claude/, does not edit an existing CLAUDE.md (prints a template path instead), and does not edit an existing .env (creates one from .env.example only if absent).

After install, tell the user:

"Open Claude Code in your project. The SessionStart hook will notice no taste corpus exists and prompt /onboard β€” 5–7 questions, ~5 minutes, voice optional. After that, normal Claude Code workflow plus the harness intercepts."

Optionally verify with bash ~/rockie-claude/tests/smoke-test.sh β€” 75+ assertions, ~10 seconds, no API key required.

If you (the agent) need the longer explanation of what each component does and why, the rest of this README is written for you too β€” the skills table, the [LEARN] protocol section, and docs/ARCHITECTURE.md are the highest-density entry points.


The installer:

  1. Copies project-harness/ β†’ <your-project>/.claude/
  2. Copies user-harness/ β†’ ~/.claude/
  3. Initializes workflow.db (FTS5 required β€” pinned to /usr/bin/sqlite3)
  4. Seeds harness rules + 5 mode templates (default, paper-crunch, exploratory, dogfooding, learning)
  5. Prints a CLAUDE.md template path to drop into your repo root.

On first session: SessionStart hook prompts you to run /onboard β€” 5–7 questions, ~5 minutes, voice optional. Produces your taste corpus.

Verify the install: bash tests/smoke-test.sh runs 75+ assertions (hooks fire, FTS5 search, atomic queue claim, installer idempotency, path-traversal refusal, budget-ceiling enforcement, autopilot end-to-end with mock launcher, schema migrations, autopilot.conf safe parser, GPU router with fake providers). CI runs the same on every push. ~10 seconds, no API key.

See docs/install.md for manual install, docs/quickstart.md for first-session walkthrough, docs/ntfy-setup.md for push notifications (optional).


The skills you'll invoke

Skill What
/onboard researcher-taste interview β†’ six-file taste/ corpus that auto-loads every session
/mode swap operational overlays (paper-crunch / exploratory / dogfooding / learning / your own)
/deploy-team dispatch adversarial subagent gauntlets β€” Python local + Node global with worktrees
/clean pre-commit anti-slop audit + sentinel; gates git commit
/propose-harness-change package an upstream-back patch with Generator/Verifier/Updater review
/queue-refill brainstorm 3–5 new high-quality experiments when the queue runs dry
/post-run-review structured review after every training/eval run; emits [LEARN] or [DEAD-END]
/autopilot continuous-operation mode for days-long autonomous work

The [LEARN] protocol

When Claude learns something durable mid-session, it emits:

[LEARN] <category>: <one-line rule>
Mistake: <what went wrong>
Correction: <what the right approach is>

The Stop hook parses, dedupes by (project, category, rule), inserts into .claude/memory/workflow.db. On the next prompt, the UserPromptSubmit hook tokenizes the prompt, runs an FTS5 BM25 search over the learnings, and injects the top-5 relevant rules β€” but only if the best match is genuinely strong (BM25 score < -4). No noise.


Use it on an existing project

If you already have a research repo and just want rockie's hooks, skills, and memory layered on top:

git clone https://github.com/saml212/rockie-claude.git ~/rockie-claude
~/rockie-claude/install.sh ~/your-research-project

What gets written:

  • <your-project>/.claude/ β€” the project-harness: hooks, skills, memory schema, settings.json, project_id stamp, sentinels dir.
  • ~/.claude/ β€” the user-harness: cross-project memory lib, user-global hooks, the /deploy-team Node orchestrator + dashboard.

What stays untouched:

  • All existing source code in your repo. The installer never edits anything outside .claude/ and .gitignore (it adds a managed block between # BEGIN rockie / # END rockie markers; rules outside the markers are never touched).
  • An existing CLAUDE.md is preserved. If absent, the installer prints the template path and you copy it in deliberately.
  • Your existing .env is left alone; the installer creates one from .env.example only if it doesn't exist.

First time you open Claude Code in that project, the SessionStart hook spots the missing .rockie/taste/INDEX.md and prompts /onboard (~5 minutes, voice optional). After that, normal Claude Code workflow + harness intercepts.

Use it on a new project

Starting fresh:

git clone https://github.com/saml212/rockie-claude.git ~/rockie-claude
mkdir ~/your-research-project && cd ~/your-research-project
git init
~/rockie-claude/install.sh .

What install.sh creates:

  • Skeleton .claude/ (hooks, scripts, skills, memory dir, .state dir)
  • workflow.db initialized + seeded with the harness rules
  • A managed .gitignore block (so workflow.db, .state/, sentinels stay out of git)
  • A printed pointer to the CLAUDE.md template β€” copy it, edit the Project section, commit.

First session walkthrough:

  1. Open Claude Code in the new project.
  2. SessionStart hook prompts /onboard. Five to seven questions, ~5 minutes; voice optional. Output: a six-file .rockie/taste/ corpus committed to your repo.
  3. Talk to Claude. Subagents verify novelty; the pre-launch audit reads shapes/gradients before any GPU dollars are spent.
  4. After the first run completes, /post-run-review emits a [LEARN] block and the next prompt's UserPromptSubmit hook auto-injects relevant rules. The loop is now closed.

Bring your own GPU (bypass the spot-procurement flow)

rockie's default GPU layer is a cross-provider router (RunPod / Vast / Prime / Verda) with min-bid spot defaults and provider-hop on preemption. That's the right choice if you have nothing pre-configured.

If you already have GPU infra β€” a university cluster, on-prem H100s, your own AWS account, an SSH-tunneled workstation, or credentials at a provider we don't route to β€” rockie steps out of the way.

One-time setup:

# in your-research-project/
echo 'ROCKIE_GPU_MODE=custom' >> .env

Then in your first agent session, ask Claude about GPUs. The agent runs /gpu-custom-setup β€” a Q&A that captures your auth, provision, connect, monitor, and terminate flow, then writes .claude/gpu-custom.md. Subsequent sessions read that file when GPUs come up.

Minimal autopilot.conf for an SSH-based launcher:

LAUNCHER_CMD=/usr/local/bin/my-launch.sh
ROCKIE_GPU_MODE=custom

my-launch.sh is whatever you already use to dispatch a training run (sbatch + sshfs, ssh + nohup, terraform apply, etc.). The autopilot loop, the Zero-Cost Monitor, the budget gate, the post-run review, and the dead-end registry all keep working β€” only the gpu.py cross-provider router is bypassed.

After that, /gpu-custom is the agent's runtime gateway: provision / connect / status / cost / terminate are all routed through your captured flow, not through rockie's router. The terminate command is run verbatim from your setup file β€” never improvised β€” because cost-sensitive teardown deserves zero ambiguity.

See project-harness/skills/gpu-custom-setup/SKILL.md for the full Q&A spec and project-harness/skills/gpu-custom/SKILL.md for the runtime routing rules.

Contributing back upstream

/upstream-contribute is the meta-loop: rockie users improve rockie itself as they work. After /clean finishes an audit, the skill surfaces a nudge β€” "Anything in this session worth upstreaming?" β€” and on opt-in, scans the session for generalizable patterns (pruning fixes, small skill improvements, new hooks, cross-discipline capabilities, memory-schema upgrades), strips project-specific specificity, and dispatches a writer sub-agent that forks saml212/rockie-claude, applies the change on a contrib/<slug> branch, runs the smoke test, and opens a PR. The agent never auto-merges; the maintainers review; the next release ships the pattern to everyone.

The bar is generalizability. Domain-specific changes stay in your fork via /propose-harness-change. Anything that would require revealing internal project context to make sense gets refused.

See project-harness/skills/upstream-contribute/SKILL.md for the full Scout / Generator / Verifier / Updater flow and the leak-protection rules the writer sub-agent enforces.

Licensing

Apache-2.0. See LICENSE.

Ports from other open-source harnesses are credited in docs/PORTS.md. We only vendor MIT/Apache-2.0 code; patterns from restrictively-licensed harnesses are clean-room reimplemented.

Contributing

  • Every port must cite source file + line range.
  • Every new feature must compose with existing differentiators (taste corpus, modes, pre-run audit, [LEARN] DB, waterfall, journal tree, experiment-runs/, /deploy-team, pre-commit sentinel). Duplicates get rejected. See docs/_meta/PHILOSOPHY.md.
  • Run /clean before committing β€” the pre-commit-gate hook enforces it.

Upstream-back from agents β€” two paths. If an agent using rockie-claude in your own project discovers a harness-level improvement, it can emit [LEARN harness-upstream] … mid-session.

  • /propose-harness-change β€” reviewed/verified patch against your OWN local rockie clone. The Generator/Verifier/Updater split keeps the agent from auto-committing; the human pushes when ready.
  • /upstream-contribute β€” the public meta-loop. Scans the session, strips project-specific specificity, dispatches a writer sub-agent to fork saml212/rockie-claude, applies the generalized change, runs the smoke test, and opens a PR. Maintainers review; the next release ships the pattern to everyone.

The agent never auto-pushes in either path.


Further reading

For users:

For agents and contributors working on rockie-claude itself:


Related projects

Acknowledgements

This harness was extracted from research originally driven on a learned-representations workspace; see pebbleml.com for the kind of project a researcher might run rockie-claude on.

About

Autonomous AI research harness for Claude Code. Sibling to rockie-codex (OpenAI Codex CLI). Apache-2.0.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors