Skip to content

feat: ProgramBench reverse-engineering environment#1351

Draft
sethkarten wants to merge 29 commits into
mainfrom
feat/programbench-env
Draft

feat: ProgramBench reverse-engineering environment#1351
sethkarten wants to merge 29 commits into
mainfrom
feat/programbench-env

Conversation

@sethkarten
Copy link
Copy Markdown

@sethkarten sethkarten commented May 12, 2026

Summary

ProgramBench reverse-engineering environment for verifiers eval framework.

Status: DRAFT — do not merge without Seth's explicit sign-off.

What's implemented

  • `environments/programbench/programbench.py` — full eval environment
  • Per-language Docker images (Go, C, C++, Rust) + pytest setup in sandbox
  • Binary upload from HuggingFace with `chmod 111` (execute-only) — agent can run but not decompile
  • Hidden test suite revealed only at scoring time
  • JUnit XML pytest scoring → fractional reward

Harness: mini-SWE-agent (paper-faithful)

Switched from verifiers RLM to `MiniSWEAgent` harness, matching the scaffold used in the ProgramBench paper.

Information leakage fix

Agent now gets only: README/docs, binary file type/size, and the binary path. All disassembly hints removed from prompt.

Infrastructure fixes

  • Worker heartbeat timeout: 30s → 1800s
  • `command_timeout` capped at 900s (Prime Intellect API limit)
  • `AGENT_TIMEOUT_SECONDS` = command_timeout − 60s (agent self-exits before hard kill)
  • `environment_timeout` raised from 120s → 600s (large Rust crates need long per-command timeout)
  • C sandbox lifetime: 10 min → 20 min (setup + apt-get takes ~5 min, leaving 15 min for agent)
  • Toolchain PATH injected into mini-swe-agent environment via `/etc/profile.d/pb_toolchain.sh`

Eval results (gpt-4.1-mini via Prime Intellect, 191/195 tasks, 1 rollout)

Language n avg reward solved (>0) errors
C 29 0.151 21/29 (72%) 0
C++ 11 0.084 4/11 (36%) 0
Go 46 0.194 27/46 (59%) 1
Rust 105 0.260 63/105 (60%) 12
All 191 0.217 115/191 (60%) 13

4/195 tasks dropped at harness shutdown (last batch); 191 results written.

Error note: All 13 errors are exit 124 from per-command timeout (environment.timeout=120s killing `cargo build`/`go build`). Fixed in `9b45becd` (environment_timeout 120s→600s). A clean rerun would recover those 13 tasks.

Agent-timeout note (C5b validation): Tasks that hit `AGENT_TIMEOUT_SECONDS=840s` exit with code 124 and are scored as 0. This is expected — the agent ran its full budget without solving. Distinct from the per-command errors above.

Stop conditions: `command_completed: 93.2%, has_error: 6.8%`

Dataset

195/200 tasks on HuggingFace (`PrimeIntellect/programbench-processed`):

  • Rust: 106, Go: 46, C: 32, C++: 11
  • 180/195 have binaries; 15 without binary work from docs only

Test plan

  • C smoke test: reward=0.41, stop=command_completed
  • Go smoke test: reward=0.0, pipeline ran end-to-end
  • Rust smoke test: reward=0.158, stop=command_completed
  • C++ smoke test: completes (json-tui TUI timeout is task-specific)
  • C5b validation (5 tasks): confirms C sandbox timeout fix, no SandboxNotRunningError
  • Full 191-task eval: avg=0.217, 60% solve rate
  • Clean rerun of 13 errored tasks with environment_timeout=600s (pending Seth sign-off)
  • Seth signs off before marking ready

🤖 Generated with Claude Code

Seth and others added 13 commits May 11, 2026 17:46
Verifiers.v1 environment where an RLM agent reconstructs source code
from compiled binaries, scored by fraction of pytest tests passed.

Key implementation decisions:
- golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom
  Rust image with pre-warmed cargo registry for Rust tasks
- apt-get update + pip --break-system-packages required for pytest on
  Debian 12 (PEP 668 restriction)
- Test archives hidden from agent at setup; uploaded at scoring time
  following the rlm_swe_v1 pattern
- Scoring re-runs compile.sh with PATH prepended for toolchain binaries
  so agents that omit PATH export still get scored correctly
- go_subset.jsonl contains 5 smoke-test tasks; full dataset on
  PrimeIntellect/programbench-processed (HF private)

Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on
antonmedv__fx.86d0d34, confirming end-to-end pipeline works.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ures

Some go.mod files specify newer patch versions (e.g. ariga/atlas requires
go1.24.11) which causes Go to try fetching the toolchain at build time,
failing in the subprocess environment. GOTOOLCHAIN=local forces use of
the installed Go version instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When `go build ./...` fails with no Go files at root (e.g. gdu, pixterm,
hostctl where main is under cmd/), grep for `package main` declarations
and build from the containing directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_branches stores filenames as e.g. '1b991a57d4e9.tar' but actual
HF files are '1b991a57d4e9.tar.gz'. Appending .tar.gz directly produced
a double-extension path that 404'd. Normalise the stem first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the version installed on the build node and satisfies go.mod
constraints that require >= 1.26 (e.g. cheat/cheat).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
primeintellect/programbench-toolchain needs Docker Hub creds to push.
Using rust:latest as fallback — the setup script installs pytest/tmux
inside the container so the pipeline still works, just without pre-warmed
cargo registry (slower for full evals).

TODO: push custom image and revert to primeintellect/programbench-toolchain:latest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /usr/local/cargo/bin to PATH (rust:latest puts cargo there, not /root/.cargo/bin)
- Update CARGO_HOME to /usr/local/cargo to match rust:latest layout
- Fix Dockerfile warmup block: heredoc-in-RUN fails Docker parser; use COPY instead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30s is too short for long-running agentic sandbox tasks; the worker event
loop blocks on concurrent prime_sandboxes API calls and misses heartbeats,
causing spurious restarts and lost in-flight results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the
ZMQ client, crashing the whole eval run. Record 0-reward outputs for the
affected group and continue so the eval can complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the
ZMQ client, crashing the whole eval run. Skip the failed group and log
the error; it will be retried on the next --resume invocation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sethkarten
Copy link
Copy Markdown
Author

Eval run results (132/193 tasks, gpt-4.1-mini)

Overall avg reward: 0.103

Language n Non-zero Avg reward Max
C 23 7 (30%) 0.131 0.937
C++ 10 0 (0%) 0.000 0.000
Go 34 11 (32%) 0.089 0.778
Rust 65 12 (18%) 0.117 1.000

Reward distribution: 102 zeros, 17 in (0, 0.5), 7 in [0.5, 1), 6 perfect 1.0

61/193 tasks incomplete due to prime_sandboxes container lifetime limit (~12-17 min per container on box cmp0f64qu0044alfqcjbax5ou). Tasks involving slow Rust/C++ compilation consistently exceed the container window. Fix requires longer-lived sandbox containers or running on a different host.

C++ 0% rate: needs investigation

All 10 C++ tasks scored 0. This may be a pipeline issue (g++ invocation, missing headers in gcc:13-bookworm, or test harness mismatch) or simply model failure. Worth a targeted smoke test.

Fixes committed this session

  • environments/programbench/programbench.py: Rust PATH fix (/usr/local/cargo/bin + CARGO_HOME=/usr/local/cargo)
  • verifiers/serve/server/env_router.py + env_server.py: worker heartbeat timeout 30s → 1800s
  • verifiers/envs/environment.py: catch per-group exceptions, skip+retry instead of crashing eval
  • environments/programbench/docker/: Dockerfile heredoc fix, warmup crate added

Results file: environments/programbench/outputs/evals/programbench--openai--gpt-4.1-mini/cce96e04/results.jsonl

Seth and others added 16 commits May 12, 2026 08:40
…leakage

Switch from verifiers RLM to mini-SWE-agent harness, matching the ProgramBench
paper scaffold. Remove all leaking fields from the agent prompt: nm_output,
strings_output, objdump_head, compile_hint, and explicit language hint. Apply
chmod 111 (execute-only) to the reference binary so the agent can run but not
read/decompile it. Retained fields in info{} (prefixed _) for internal logging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mini-SWE-agent bash -lc doesn't inherit the Docker image's ENV PATH, so
Go/Cargo weren't accessible. Move toolchain env vars from get_env_vars()
into the per-task program.env dict, which gets merged into the command
process environment before mini-swe-agent's run script executes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PATH

mini-swe-agent runs bash -lc which sources /etc/profile and may reset PATH,
dropping Docker's ENV-injected toolchain dirs. Write a profile.d snippet in
setup so the bash login shell always has Go/Cargo/Rust in PATH.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i-swe-agent

DEFAULT_CLI_SANDBOX has command_timeout=900s which kills mini-swe-agent
before it can finish apt-get install + agent turns + compile (takes 15-17min).
Set command_timeout = timeout_minutes * 60 per language in sandbox_config()
so the full sandbox lifetime is available for the CLI command.

Also revert temporary logger.warning diagnostics back to logger.info.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The execute_command API enforces a 900s max timeout. timeout_min*60 for Go
(1200s) and Rust (2700s) exceeded this, causing HTTP 400 errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…meout hard kill

The sandbox API caps command_timeout at 900s. Without AGENT_TIMEOUT_SECONDS,
mini-swe-agent runs up to its 3600s default and gets killed at 900s with an error.
Set AGENT_TIMEOUT_SECONDS = command_timeout - 60 so the agent exits cleanly
~60s before the hard wall, allowing the harness to collect artifacts normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
120s per-command timeout kills large Rust cargo build runs (exit 124).
600s matches the sandbox command_timeout safety margin and lets
complex crates compile without hitting the timeout wall.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 min was too short: setup (apt-get + mini-swe-agent install) takes ~3-5 min,
leaving only 5 min for the agent. Tasks ran for 12+ min and hit
SandboxNotRunningError (pod killed at sandbox lifetime boundary).
20 min matches Go/C++ and leaves adequate headroom.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use `is not None` for dataset_name/dataset_split (consistent with
  other optional params; `or` was semantically wrong for str | None)
- Check exit codes on mkdir and profile.d write in setup (log warnings)
- Extract _hf_download() helper to deduplicate two near-identical
  hf_hub_download try-except blocks in _upload_binary/_setup_tests
- Fix state ordering in _setup_tests: set _pb_test_branch only after
  successful extraction (was set before upload/extract for hide=False)
- Add error handling for upload/extract in hide_tests_from_agent=False path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes from agent review of PR #1351:
- vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix
- _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s)
- Drop unused _compile_hint from info dict
- Fill README.md placeholders

Pre-commit hooks:
- lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md
- oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md
  (only on environments/, set SKIP_OO_CHECK=1 to bypass)
Skill logic lives in markdown, not committed Python.

Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin rust:latest → rust:1.92 for reproducibility (paper §4 constants)
- Add strace, ltrace, and no-wrap prohibition to SYSTEM_PROMPT (paper §3)
- Use max(test_branches) instead of [0] to pick most comprehensive branch
- Emit almost (≥95% tests) and resolved (100% tests) as state signals (paper §4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est exposure

Drop nm_output, strings_output, objdump_head from info entirely — these are
reverse-engineering artifacts that have no place in the task dict. example_io
(expected I/O pairs) was already excluded from _build_instruction; add a comment
making this explicit.

Add logger.warning when hide_tests_from_agent=False so anyone running eval with
tests visible gets a loud reminder that this violates paper §3.

The agent's full context is now: readme + docs (documentation) + binary path.
No source code, no test content, no analysis artifacts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The verifiers runtime injects OPENAI_BASE_URL + OPENAI_API_KEY into the
container pointing at its own proxy. OPENAI_API_KEY in the outer environment
is not needed when running via prime eval run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
huggingface_hub's get_token() checks env var, then ~/.cache/huggingface/token,
then huggingface-cli login cache. Passing token=None to hf_hub_download lets
it use that same chain instead of only os.environ.get('HF_TOKEN').

On remote nodes: either export HF_TOKEN=... or run huggingface-cli login once.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HF_TOKEN is required for private HuggingFace dataset + test archives.
Usage: cp .env.example .env, fill in token, then source .env before eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paper allows up to 1000 steps per rollout. mini-SWE-agent defaults to 0
(unlimited). Passing agent.step_limit=1000 via extra_config_specs caps
steps at the paper spec. Wall clock (AGENT_TIMEOUT_SECONDS) remains the
binding constraint at our resource level but step limit is now explicit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant