feat: ProgramBench reverse-engineering environment#1351
Draft
sethkarten wants to merge 29 commits into
Draft
Conversation
Verifiers.v1 environment where an RLM agent reconstructs source code from compiled binaries, scored by fraction of pytest tests passed. Key implementation decisions: - golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom Rust image with pre-warmed cargo registry for Rust tasks - apt-get update + pip --break-system-packages required for pytest on Debian 12 (PEP 668 restriction) - Test archives hidden from agent at setup; uploaded at scoring time following the rlm_swe_v1 pattern - Scoring re-runs compile.sh with PATH prepended for toolchain binaries so agents that omit PATH export still get scored correctly - go_subset.jsonl contains 5 smoke-test tasks; full dataset on PrimeIntellect/programbench-processed (HF private) Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on antonmedv__fx.86d0d34, confirming end-to-end pipeline works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…2a, fasttext, brotli, halite, blake3)
…ces fragile shell heredocs)
…ures Some go.mod files specify newer patch versions (e.g. ariga/atlas requires go1.24.11) which causes Go to try fetching the toolchain at build time, failing in the subprocess environment. GOTOOLCHAIN=local forces use of the installed Go version instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When `go build ./...` fails with no Go files at root (e.g. gdu, pixterm, hostctl where main is under cmd/), grep for `package main` declarations and build from the containing directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_branches stores filenames as e.g. '1b991a57d4e9.tar' but actual HF files are '1b991a57d4e9.tar.gz'. Appending .tar.gz directly produced a double-extension path that 404'd. Normalise the stem first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the version installed on the build node and satisfies go.mod constraints that require >= 1.26 (e.g. cheat/cheat). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
primeintellect/programbench-toolchain needs Docker Hub creds to push. Using rust:latest as fallback — the setup script installs pytest/tmux inside the container so the pipeline still works, just without pre-warmed cargo registry (slower for full evals). TODO: push custom image and revert to primeintellect/programbench-toolchain:latest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /usr/local/cargo/bin to PATH (rust:latest puts cargo there, not /root/.cargo/bin) - Update CARGO_HOME to /usr/local/cargo to match rust:latest layout - Fix Dockerfile warmup block: heredoc-in-RUN fails Docker parser; use COPY instead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30s is too short for long-running agentic sandbox tasks; the worker event loop blocks on concurrent prime_sandboxes API calls and misses heartbeats, causing spurious restarts and lost in-flight results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the ZMQ client, crashing the whole eval run. Record 0-reward outputs for the affected group and continue so the eval can complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the ZMQ client, crashing the whole eval run. Skip the failed group and log the error; it will be retried on the next --resume invocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Eval run results (132/193 tasks, gpt-4.1-mini)Overall avg reward: 0.103
Reward distribution: 102 zeros, 17 in (0, 0.5), 7 in [0.5, 1), 6 perfect 1.0 61/193 tasks incomplete due to prime_sandboxes container lifetime limit (~12-17 min per container on box C++ 0% rate: needs investigationAll 10 C++ tasks scored 0. This may be a pipeline issue (g++ invocation, missing headers in gcc:13-bookworm, or test harness mismatch) or simply model failure. Worth a targeted smoke test. Fixes committed this session
Results file: |
…leakage
Switch from verifiers RLM to mini-SWE-agent harness, matching the ProgramBench
paper scaffold. Remove all leaking fields from the agent prompt: nm_output,
strings_output, objdump_head, compile_hint, and explicit language hint. Apply
chmod 111 (execute-only) to the reference binary so the agent can run but not
read/decompile it. Retained fields in info{} (prefixed _) for internal logging.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mini-SWE-agent bash -lc doesn't inherit the Docker image's ENV PATH, so Go/Cargo weren't accessible. Move toolchain env vars from get_env_vars() into the per-task program.env dict, which gets merged into the command process environment before mini-swe-agent's run script executes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PATH mini-swe-agent runs bash -lc which sources /etc/profile and may reset PATH, dropping Docker's ENV-injected toolchain dirs. Write a profile.d snippet in setup so the bash login shell always has Go/Cargo/Rust in PATH. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i-swe-agent DEFAULT_CLI_SANDBOX has command_timeout=900s which kills mini-swe-agent before it can finish apt-get install + agent turns + compile (takes 15-17min). Set command_timeout = timeout_minutes * 60 per language in sandbox_config() so the full sandbox lifetime is available for the CLI command. Also revert temporary logger.warning diagnostics back to logger.info. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The execute_command API enforces a 900s max timeout. timeout_min*60 for Go (1200s) and Rust (2700s) exceeded this, causing HTTP 400 errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…meout hard kill The sandbox API caps command_timeout at 900s. Without AGENT_TIMEOUT_SECONDS, mini-swe-agent runs up to its 3600s default and gets killed at 900s with an error. Set AGENT_TIMEOUT_SECONDS = command_timeout - 60 so the agent exits cleanly ~60s before the hard wall, allowing the harness to collect artifacts normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
120s per-command timeout kills large Rust cargo build runs (exit 124). 600s matches the sandbox command_timeout safety margin and lets complex crates compile without hitting the timeout wall. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 min was too short: setup (apt-get + mini-swe-agent install) takes ~3-5 min, leaving only 5 min for the agent. Tasks ran for 12+ min and hit SandboxNotRunningError (pod killed at sandbox lifetime boundary). 20 min matches Go/C++ and leaves adequate headroom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use `is not None` for dataset_name/dataset_split (consistent with other optional params; `or` was semantically wrong for str | None) - Check exit codes on mkdir and profile.d write in setup (log warnings) - Extract _hf_download() helper to deduplicate two near-identical hf_hub_download try-except blocks in _upload_binary/_setup_tests - Fix state ordering in _setup_tests: set _pb_test_branch only after successful extraction (was set before upload/extract for hide=False) - Add error handling for upload/extract in hide_tests_from_agent=False path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes from agent review of PR #1351: - vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix - _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s) - Drop unused _compile_hint from info dict - Fill README.md placeholders Pre-commit hooks: - lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md - oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md (only on environments/, set SKIP_OO_CHECK=1 to bypass) Skill logic lives in markdown, not committed Python. Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin rust:latest → rust:1.92 for reproducibility (paper §4 constants) - Add strace, ltrace, and no-wrap prohibition to SYSTEM_PROMPT (paper §3) - Use max(test_branches) instead of [0] to pick most comprehensive branch - Emit almost (≥95% tests) and resolved (100% tests) as state signals (paper §4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est exposure Drop nm_output, strings_output, objdump_head from info entirely — these are reverse-engineering artifacts that have no place in the task dict. example_io (expected I/O pairs) was already excluded from _build_instruction; add a comment making this explicit. Add logger.warning when hide_tests_from_agent=False so anyone running eval with tests visible gets a loud reminder that this violates paper §3. The agent's full context is now: readme + docs (documentation) + binary path. No source code, no test content, no analysis artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The verifiers runtime injects OPENAI_BASE_URL + OPENAI_API_KEY into the container pointing at its own proxy. OPENAI_API_KEY in the outer environment is not needed when running via prime eval run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
huggingface_hub's get_token() checks env var, then ~/.cache/huggingface/token,
then huggingface-cli login cache. Passing token=None to hf_hub_download lets
it use that same chain instead of only os.environ.get('HF_TOKEN').
On remote nodes: either export HF_TOKEN=... or run huggingface-cli login once.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HF_TOKEN is required for private HuggingFace dataset + test archives. Usage: cp .env.example .env, fill in token, then source .env before eval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paper allows up to 1000 steps per rollout. mini-SWE-agent defaults to 0 (unlimited). Passing agent.step_limit=1000 via extra_config_specs caps steps at the paper spec. Wall clock (AGENT_TIMEOUT_SECONDS) remains the binding constraint at our resource level but step limit is now explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ProgramBench reverse-engineering environment for verifiers eval framework.
Status: DRAFT — do not merge without Seth's explicit sign-off.
What's implemented
Harness: mini-SWE-agent (paper-faithful)
Switched from verifiers RLM to `MiniSWEAgent` harness, matching the scaffold used in the ProgramBench paper.
Information leakage fix
Agent now gets only: README/docs, binary file type/size, and the binary path. All disassembly hints removed from prompt.
Infrastructure fixes
Eval results (gpt-4.1-mini via Prime Intellect, 191/195 tasks, 1 rollout)
4/195 tasks dropped at harness shutdown (last batch); 191 results written.
Error note: All 13 errors are exit 124 from per-command timeout (environment.timeout=120s killing `cargo build`/`go build`). Fixed in `9b45becd` (environment_timeout 120s→600s). A clean rerun would recover those 13 tasks.
Agent-timeout note (C5b validation): Tasks that hit `AGENT_TIMEOUT_SECONDS=840s` exit with code 124 and are scored as 0. This is expected — the agent ran its full budget without solving. Distinct from the per-command errors above.
Stop conditions: `command_completed: 93.2%, has_error: 6.8%`
Dataset
195/200 tasks on HuggingFace (`PrimeIntellect/programbench-processed`):
Test plan
🤖 Generated with Claude Code