Add MLE-Bench competition environment by poofeth · Pull Request #1341 · PrimeIntellect-ai/verifiers

poofeth · 2026-05-11T14:02:59Z

Summary

Adds an installable mle-bench environment for the Prime MLE-Bench bounty.

This environment:

represents MLE-Bench competitions as v1 task rows
defaults to the low/lite split and includes the official low-complexity competition IDs
optionally enriches prompts with descriptions from the upstream mlebench registry when that package is installed
uses an OpenCode harness with sandboxed task metadata
follows the benchmark submission contract: agents must create /home/submission/submission.csv
validates submissions through /home/validate_submission.sh /home/submission/submission.csv when run in an MLE-Bench image with prepared data
avoids downloading Kaggle data or requiring credentials during import/local unit tests

/claim https://algora.io/PrimeIntellect-ai/bounties/ZC7X91RCWXHy4BL2

Validation

uv run pytest tests/test_mle_bench_environment.py -q
# 6 passed

uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
print(type(loaded).__name__, loaded.taskset.taskset_id, len(list(loaded.taskset.source())))
PY
# Env mle-bench 1

uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed

uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted

git diff --check
# no output

Note

Medium Risk
Adds a new sandboxed evaluation environment that executes filesystem checks and a validation script inside the sandbox; correctness depends on runtime images and validator output parsing.

Overview
Adds a new installable mle-bench v1 environment that models each MLE-Bench competition as a sandboxed task requiring the agent to write /home/submission/submission.csv and earn reward only when /home/validate_submission.sh accepts it.

The taskset supports low/lite/dev/all splits, optionally enriches prompts from the upstream mlebench registry when available, and records validator stdout/stderr/exit code in rollout state; includes packaging (pyproject.toml entry point), documentation, and unit tests, plus a README index update to list the new environment.

^{Reviewed by Cursor Bugbot for commit e246cbb. Bugbot is set up for automated code reviews on this repo. Configure here.}

poofeth · 2026-05-11T14:03:21Z

/claim https://algora.io/PrimeIntellect-ai/bounties/ZC7X91RCWXHy4BL2

poofeth · 2026-05-11T14:06:09Z

Follow-up commit 8e220cf strengthens the MLE-Bench handoff:

added grading_submission_row(task) and grading_submission_jsonl(task) for the mlebench grade JSONL shape
added submission_nonempty and validator_available sandbox metrics
extended tests for grader handoff rows and the new sandbox metrics

Fresh validation:

uv run pytest tests/test_mle_bench_environment.py -q
# 8 passed

uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(type(loaded).__name__, loaded.taskset.taskset_id, mle_bench.grading_submission_row(row)['competition_id'])
PY
# Env mle-bench aerial-cactus-identification

uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed

uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted

git diff --check
# no output

poofeth · 2026-05-11T14:17:35Z

Additional package-install smoke passed for the environment package metadata:

uv run --with ./environments/mle_bench python - <<'PY'
from mle_bench import load_environment, grading_submission_row
loaded = load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(loaded.taskset.taskset_id, grading_submission_row(row)['competition_id'])
PY
# mle-bench aerial-cactus-identification

poofeth · 2026-05-11T14:22:21Z

Addressed the two Cursor Bugbot findings in commit edc3966:

replaced substring matching on "valid" with exact validator success-line detection, so "invalid" can no longer be accepted accidentally
removed the unused DEFAULT_SUBMISSION_JSONL constant
added regression coverage for invalid validator output and exact success-line parsing

Fresh validation:

uv run pytest tests/test_mle_bench_environment.py -q
# 10 passed

uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(type(loaded).__name__, loaded.taskset.taskset_id, mle_bench.grading_submission_row(row)['competition_id'])
PY
# Env mle-bench aerial-cactus-identification

uv run --with ./environments/mle_bench python - <<'PY'
from mle_bench import load_environment, grading_submission_row
loaded = load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(loaded.taskset.taskset_id, grading_submission_row(row)['competition_id'])
PY
# mle-bench aerial-cactus-identification

uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed

uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted

git diff --check
# no output

poofeth · 2026-05-11T14:33:34Z

Addressed the current-head Bugbot prompt-path finding in e246cbb.

Changes:

Removed default /home/... paths from the shared BENCHMARK_INSTRUCTIONS system prompt.
build_prompt now derives the competition instructions path, dataset directory, submission path, and validation command from the configured workdir, submission_path, and validate_script.
Added test_mle_bench_prompt_uses_configured_paths to cover custom path configurations and assert the default paths are not leaked into the prompt.

Validation after the fix:

uv run pytest tests/test_mle_bench_environment.py -q -> 11 passed
uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py -> passed
uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py -> passed
uv run --with ./environments/mle_bench python ... -> mle-bench aerial-cactus-identification
git diff --check -> passed

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e246cbb. Configure here.}

poofeth · 2026-05-11T14:42:26Z

Addressed the remaining current-head Bugbot workdir finding in d2e6b9e.

Changes:

Added workdir to each MLE-Bench task record under task["info"].
valid_submission now passes the configured workdir as the validator command working directory instead of hardcoding /home.
Added test_mle_bench_valid_submission_uses_configured_workdir to cover non-default workdir, submission path, and validator path together.

Validation after the fix:

uv run pytest tests/test_mle_bench_environment.py -q -> 12 passed
uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py -> passed
uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py -> passed
PYTHONPATH=environments/mle_bench uv run python ... -> local import smoke loaded environments/mle_bench/mle_bench.py and printed mle-bench aerial-cactus-identification /home
git diff --check -> passed

Note: uv run --with ./environments/mle_bench ... reused a cached package archive for the same local package name during this session, so I used PYTHONPATH for the final smoke to prove the current checkout contents directly.

Add MLE-Bench environment

86a9b66

Improve MLE-Bench submission handoff

8e220cf

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread environments/mle_bench/mle_bench.py Outdated

Comment thread environments/mle_bench/mle_bench.py Outdated

Fix MLE-Bench validator success detection

edc3966

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread environments/mle_bench/mle_bench.py

Fix MLE-Bench prompt path configuration

e246cbb

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread environments/mle_bench/mle_bench.py Outdated

Honor MLE-Bench workdir during validation

d2e6b9e

poofeth closed this May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLE-Bench competition environment#1341

Add MLE-Bench competition environment#1341
poofeth wants to merge 5 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/mle-bench-env

poofeth commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

poofeth commented May 11, 2026

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poofeth commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

poofeth commented May 11, 2026

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poofeth commented May 11, 2026 •

edited by cursor Bot

Loading