feat: ProgramBench reverse-engineering environment by sethkarten · Pull Request #1351 · PrimeIntellect-ai/verifiers

sethkarten · 2026-05-12T00:47:17Z

Summary

ProgramBench reverse-engineering environment for verifiers eval framework.

Status: DRAFT — do not merge without Seth's explicit sign-off.

What's implemented

`environments/programbench/programbench.py` — full eval environment
Per-language Docker images (Go, C, C++, Rust) + pytest setup in sandbox
Binary upload from HuggingFace with `chmod 111` (execute-only) — agent can run but not decompile
Hidden test suite revealed only at scoring time
JUnit XML pytest scoring → fractional reward

Harness: mini-SWE-agent (paper-faithful)

Switched from verifiers RLM to `MiniSWEAgent` harness, matching the scaffold used in the ProgramBench paper.

Information leakage fix

Agent now gets only: README/docs, binary file type/size, and the binary path. All disassembly hints removed from prompt.

Infrastructure fixes

Worker heartbeat timeout: 30s → 1800s
`command_timeout` capped at 900s (Prime Intellect API limit)
`AGENT_TIMEOUT_SECONDS` = command_timeout − 60s (agent self-exits before hard kill)
`environment_timeout` raised from 120s → 600s (large Rust crates need long per-command timeout)
C sandbox lifetime: 10 min → 20 min (setup + apt-get takes ~5 min, leaving 15 min for agent)
Toolchain PATH injected into mini-swe-agent environment via `/etc/profile.d/pb_toolchain.sh`

Eval results (gpt-4.1-mini via Prime Intellect, 191/195 tasks, 1 rollout)

Language	n	avg reward	solved (>0)	errors
C	29	0.151	21/29 (72%)	0
C++	11	0.084	4/11 (36%)	0
Go	46	0.194	27/46 (59%)	1
Rust	105	0.260	63/105 (60%)	12
All	191	0.217	115/191 (60%)	13

4/195 tasks dropped at harness shutdown (last batch); 191 results written.

Error note: All 13 errors are exit 124 from per-command timeout (environment.timeout=120s killing `cargo build`/`go build`). Fixed in `9b45becd` (environment_timeout 120s→600s). A clean rerun would recover those 13 tasks.

Agent-timeout note (C5b validation): Tasks that hit `AGENT_TIMEOUT_SECONDS=840s` exit with code 124 and are scored as 0. This is expected — the agent ran its full budget without solving. Distinct from the per-command errors above.

Stop conditions: `command_completed: 93.2%, has_error: 6.8%`

Dataset

195/200 tasks on HuggingFace (`PrimeIntellect/programbench-processed`):

Rust: 106, Go: 46, C: 32, C++: 11
180/195 have binaries; 15 without binary work from docs only

Test plan

C smoke test: reward=0.41, stop=command_completed
Go smoke test: reward=0.0, pipeline ran end-to-end
Rust smoke test: reward=0.158, stop=command_completed
C++ smoke test: completes (json-tui TUI timeout is task-specific)
C5b validation (5 tasks): confirms C sandbox timeout fix, no SandboxNotRunningError
Full 191-task eval: avg=0.217, 60% solve rate
Clean rerun of 13 errored tasks with environment_timeout=600s (pending Seth sign-off)
Seth signs off before marking ready

🤖 Generated with Claude Code

Verifiers.v1 environment where an RLM agent reconstructs source code from compiled binaries, scored by fraction of pytest tests passed. Key implementation decisions: - golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom Rust image with pre-warmed cargo registry for Rust tasks - apt-get update + pip --break-system-packages required for pytest on Debian 12 (PEP 668 restriction) - Test archives hidden from agent at setup; uploaded at scoring time following the rlm_swe_v1 pattern - Scoring re-runs compile.sh with PATH prepended for toolchain binaries so agents that omit PATH export still get scored correctly - go_subset.jsonl contains 5 smoke-test tasks; full dataset on PrimeIntellect/programbench-processed (HF private) Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on antonmedv__fx.86d0d34, confirming end-to-end pipeline works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…2a, fasttext, brotli, halite, blake3)

…ces fragile shell heredocs)

…ures Some go.mod files specify newer patch versions (e.g. ariga/atlas requires go1.24.11) which causes Go to try fetching the toolchain at build time, failing in the subprocess environment. GOTOOLCHAIN=local forces use of the installed Go version instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When `go build ./...` fails with no Go files at root (e.g. gdu, pixterm, hostctl where main is under cmd/), grep for `package main` declarations and build from the containing directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test_branches stores filenames as e.g. '1b991a57d4e9.tar' but actual HF files are '1b991a57d4e9.tar.gz'. Appending .tar.gz directly produced a double-extension path that 404'd. Normalise the stem first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Matches the version installed on the build node and satisfies go.mod constraints that require >= 1.26 (e.g. cheat/cheat). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

primeintellect/programbench-toolchain needs Docker Hub creds to push. Using rust:latest as fallback — the setup script installs pytest/tmux inside the container so the pipeline still works, just without pre-warmed cargo registry (slower for full evals). TODO: push custom image and revert to primeintellect/programbench-toolchain:latest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add /usr/local/cargo/bin to PATH (rust:latest puts cargo there, not /root/.cargo/bin) - Update CARGO_HOME to /usr/local/cargo to match rust:latest layout - Fix Dockerfile warmup block: heredoc-in-RUN fails Docker parser; use COPY instead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

30s is too short for long-running agentic sandbox tasks; the worker event loop blocks on concurrent prime_sandboxes API calls and misses heartbeats, causing spurious restarts and lost in-flight results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the ZMQ client, crashing the whole eval run. Record 0-reward outputs for the affected group and continue so the eval can complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the ZMQ client, crashing the whole eval run. Skip the failed group and log the error; it will be retried on the next --resume invocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sethkarten · 2026-05-12T15:13:49Z

Eval run results (132/193 tasks, gpt-4.1-mini)

Overall avg reward: 0.103

Language	n	Non-zero	Avg reward	Max
C	23	7 (30%)	0.131	0.937
C++	10	0 (0%)	0.000	0.000
Go	34	11 (32%)	0.089	0.778
Rust	65	12 (18%)	0.117	1.000

Reward distribution: 102 zeros, 17 in (0, 0.5), 7 in [0.5, 1), 6 perfect 1.0

61/193 tasks incomplete due to prime_sandboxes container lifetime limit (~12-17 min per container on box cmp0f64qu0044alfqcjbax5ou). Tasks involving slow Rust/C++ compilation consistently exceed the container window. Fix requires longer-lived sandbox containers or running on a different host.

C++ 0% rate: needs investigation

All 10 C++ tasks scored 0. This may be a pipeline issue (g++ invocation, missing headers in gcc:13-bookworm, or test harness mismatch) or simply model failure. Worth a targeted smoke test.

Fixes committed this session

environments/programbench/programbench.py: Rust PATH fix (/usr/local/cargo/bin + CARGO_HOME=/usr/local/cargo)
verifiers/serve/server/env_router.py + env_server.py: worker heartbeat timeout 30s → 1800s
verifiers/envs/environment.py: catch per-group exceptions, skip+retry instead of crashing eval
environments/programbench/docker/: Dockerfile heredoc fix, warmup crate added

Results file: environments/programbench/outputs/evals/programbench--openai--gpt-4.1-mini/cce96e04/results.jsonl

…leakage Switch from verifiers RLM to mini-SWE-agent harness, matching the ProgramBench paper scaffold. Remove all leaking fields from the agent prompt: nm_output, strings_output, objdump_head, compile_hint, and explicit language hint. Apply chmod 111 (execute-only) to the reference binary so the agent can run but not read/decompile it. Retained fields in info{} (prefixed _) for internal logging. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mini-SWE-agent bash -lc doesn't inherit the Docker image's ENV PATH, so Go/Cargo weren't accessible. Move toolchain env vars from get_env_vars() into the per-task program.env dict, which gets merged into the command process environment before mini-swe-agent's run script executes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… PATH mini-swe-agent runs bash -lc which sources /etc/profile and may reset PATH, dropping Docker's ENV-injected toolchain dirs. Write a profile.d snippet in setup so the bash login shell always has Go/Cargo/Rust in PATH. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…i-swe-agent DEFAULT_CLI_SANDBOX has command_timeout=900s which kills mini-swe-agent before it can finish apt-get install + agent turns + compile (takes 15-17min). Set command_timeout = timeout_minutes * 60 per language in sandbox_config() so the full sandbox lifetime is available for the CLI command. Also revert temporary logger.warning diagnostics back to logger.info. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The execute_command API enforces a 900s max timeout. timeout_min*60 for Go (1200s) and Rust (2700s) exceeded this, causing HTTP 400 errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…meout hard kill The sandbox API caps command_timeout at 900s. Without AGENT_TIMEOUT_SECONDS, mini-swe-agent runs up to its 3600s default and gets killed at 900s with an error. Set AGENT_TIMEOUT_SECONDS = command_timeout - 60 so the agent exits cleanly ~60s before the hard wall, allowing the harness to collect artifacts normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

120s per-command timeout kills large Rust cargo build runs (exit 124). 600s matches the sandbox command_timeout safety margin and lets complex crates compile without hitting the timeout wall. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

10 min was too short: setup (apt-get + mini-swe-agent install) takes ~3-5 min, leaving only 5 min for the agent. Tasks ran for 12+ min and hit SandboxNotRunningError (pod killed at sandbox lifetime boundary). 20 min matches Go/C++ and leaves adequate headroom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Use `is not None` for dataset_name/dataset_split (consistent with other optional params; `or` was semantically wrong for str | None) - Check exit codes on mkdir and profile.d write in setup (log warnings) - Extract _hf_download() helper to deduplicate two near-identical hf_hub_download try-except blocks in _upload_binary/_setup_tests - Fix state ordering in _setup_tests: set _pb_test_branch only after successful extraction (was set before upload/extract for hide=False) - Add error handling for upload/extract in hide_tests_from_agent=False path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixes from agent review of PR #1351: - vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix - _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s) - Drop unused _compile_hint from info dict - Fill README.md placeholders Pre-commit hooks: - lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md - oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md (only on environments/, set SKIP_OO_CHECK=1 to bypass) Skill logic lives in markdown, not committed Python. Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Pin rust:latest → rust:1.92 for reproducibility (paper §4 constants) - Add strace, ltrace, and no-wrap prohibition to SYSTEM_PROMPT (paper §3) - Use max(test_branches) instead of [0] to pick most comprehensive branch - Emit almost (≥95% tests) and resolved (100% tests) as state signals (paper §4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…est exposure Drop nm_output, strings_output, objdump_head from info entirely — these are reverse-engineering artifacts that have no place in the task dict. example_io (expected I/O pairs) was already excluded from _build_instruction; add a comment making this explicit. Add logger.warning when hide_tests_from_agent=False so anyone running eval with tests visible gets a loud reminder that this violates paper §3. The agent's full context is now: readme + docs (documentation) + binary path. No source code, no test content, no analysis artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The verifiers runtime injects OPENAI_BASE_URL + OPENAI_API_KEY into the container pointing at its own proxy. OPENAI_API_KEY in the outer environment is not needed when running via prime eval run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

huggingface_hub's get_token() checks env var, then ~/.cache/huggingface/token, then huggingface-cli login cache. Passing token=None to hf_hub_download lets it use that same chain instead of only os.environ.get('HF_TOKEN'). On remote nodes: either export HF_TOKEN=... or run huggingface-cli login once. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HF_TOKEN is required for private HuggingFace dataset + test archives. Usage: cp .env.example .env, fill in token, then source .env before eval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Paper allows up to 1000 steps per rollout. mini-SWE-agent defaults to 0 (unlimited). Passing agent.step_limit=1000 via extra_config_specs caps steps at the paper spec. Wall clock (AGENT_TIMEOUT_SECONDS) remains the binding constraint at our resource level but step limit is now explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Seth and others added 13 commits May 11, 2026 17:46

feat(programbench): add multi-language preprocess + build scripts

1dddfeb

fix(programbench): add language overrides for misclassified repos (jp…

7a566db

…2a, fasttext, brotli, halite, blake3)

feat(programbench): Python-based multi-language binary builder (repla…

1ffbf1d

…ces fragile shell heredocs)

build: upgrade Go to 1.26.3 in programbench-toolchain Dockerfile

0608c49

Matches the version installed on the build node and satisfies go.mod constraints that require >= 1.26 (e.g. cheat/cheat). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Seth and others added 16 commits May 12, 2026 08:40

programbench: cap command_timeout at 900s (Prime Intellect API limit)

e69e075

The execute_command API enforces a 900s max timeout. timeout_min*60 for Go (1200s) and Rust (2700s) exceeded this, causing HTTP 400 errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add .env.example for local credentials

8d4f8f8

HF_TOKEN is required for private HuggingFace dataset + test archives. Usage: cp .env.example .env, fill in token, then source .env before eval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ProgramBench reverse-engineering environment#1351

feat: ProgramBench reverse-engineering environment#1351
sethkarten wants to merge 29 commits into
mainfrom
feat/programbench-env

sethkarten commented May 12, 2026 •

edited

Loading

Uh oh!

sethkarten commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sethkarten commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's implemented

Harness: mini-SWE-agent (paper-faithful)

Information leakage fix

Infrastructure fixes

Eval results (gpt-4.1-mini via Prime Intellect, 191/195 tasks, 1 rollout)

Dataset

Test plan

Uh oh!

sethkarten commented May 12, 2026

Eval run results (132/193 tasks, gpt-4.1-mini)

C++ 0% rate: needs investigation

Fixes committed this session

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sethkarten commented May 12, 2026 •

edited

Loading