Add MLE-Bench competition environment#1341
Conversation
|
Follow-up commit
Fresh validation: uv run pytest tests/test_mle_bench_environment.py -q
# 8 passed
uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(type(loaded).__name__, loaded.taskset.taskset_id, mle_bench.grading_submission_row(row)['competition_id'])
PY
# Env mle-bench aerial-cactus-identification
uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed
uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted
git diff --check
# no output |
|
Additional package-install smoke passed for the environment package metadata: uv run --with ./environments/mle_bench python - <<'PY'
from mle_bench import load_environment, grading_submission_row
loaded = load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(loaded.taskset.taskset_id, grading_submission_row(row)['competition_id'])
PY
# mle-bench aerial-cactus-identification |
|
Addressed the two Cursor Bugbot findings in commit
Fresh validation: uv run pytest tests/test_mle_bench_environment.py -q
# 10 passed
uv run python - <<'PY'
import sys
from pathlib import Path
sys.path.insert(0, str(Path('environments/mle_bench').resolve()))
import mle_bench
loaded = mle_bench.load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(type(loaded).__name__, loaded.taskset.taskset_id, mle_bench.grading_submission_row(row)['competition_id'])
PY
# Env mle-bench aerial-cactus-identification
uv run --with ./environments/mle_bench python - <<'PY'
from mle_bench import load_environment, grading_submission_row
loaded = load_environment(limit=1)
row = list(loaded.taskset.source())[0]
print(loaded.taskset.taskset_id, grading_submission_row(row)['competition_id'])
PY
# mle-bench aerial-cactus-identification
uv run ruff check environments/mle_bench tests/test_mle_bench_environment.py
# All checks passed
uv run ruff format --check environments/mle_bench tests/test_mle_bench_environment.py
# 2 files already formatted
git diff --check
# no output |
|
Addressed the current-head Bugbot prompt-path finding in Changes:
Validation after the fix:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e246cbb. Configure here.
|
Addressed the remaining current-head Bugbot workdir finding in Changes:
Validation after the fix:
Note: |

Summary
Adds an installable
mle-benchenvironment for the Prime MLE-Bench bounty.This environment:
mlebenchregistry when that package is installed/home/submission/submission.csv/home/validate_submission.sh /home/submission/submission.csvwhen run in an MLE-Bench image with prepared data/claim https://algora.io/PrimeIntellect-ai/bounties/ZC7X91RCWXHy4BL2
Validation
Note
Medium Risk
Adds a new sandboxed evaluation environment that executes filesystem checks and a validation script inside the sandbox; correctness depends on runtime images and validator output parsing.
Overview
Adds a new installable
mle-benchv1 environment that models each MLE-Bench competition as a sandboxed task requiring the agent to write/home/submission/submission.csvand earn reward only when/home/validate_submission.shaccepts it.The taskset supports
low/lite/dev/allsplits, optionally enriches prompts from the upstreammlebenchregistry when available, and records validator stdout/stderr/exit code in rollout state; includes packaging (pyproject.tomlentry point), documentation, and unit tests, plus a README index update to list the new environment.Reviewed by Cursor Bugbot for commit e246cbb. Bugbot is set up for automated code reviews on this repo. Configure here.