PrimeIntellect-ai · poofeth · May 11, 2026 · May 11, 2026 · May 11, 2026 · May 11, 2026
diff --git a/environments/README.md b/environments/README.md
@@ -60,6 +60,7 @@ This folder contains installable example environments that showcase common usage
   - **bfcl_v3**: BFCL v3 function-calling eval using task-local dynamic tool schemas and v1 rewards.
   - **dspy_flights**: Sandboxed DSPy flight-support `program.fn` entrypoint installed from its package `pyproject.toml` and configured against the v1 interception endpoint.
   - **hello_group_reward_v1**: Deterministic v1 reference for group updates, metrics, rewards, advantages, and cleanup.
+  - **mle_bench**: MLE-Bench competition-submission taskset using the benchmark CSV submission contract and validator.
   - **tau2_bench_v1**: `tau2-bench-v1` τ²-bench taskset/user/tool pattern on the v1 harness runtime.
 
 ### Composition
@@ -94,7 +95,7 @@ This folder contains installable example environments that showcase common usage
 - **CLI agent sandboxes**: `opencode_harbor`, `terminus_harbor`, `hello_mcp_harbor`
 - **MCP integration**: `mcp_search_env`, `hello_mcp_harbor`
 - **RLM (recursive LLM)**: `rlm_secrets`
-- **Taskset/Harness v1**: use this pattern for new environments that need reusable tasksets, reusable harnesses, framework programs, endpoint interception, or sandboxed Python/command programs. Examples include `dspy_rlm`, `openai_agents_env`, `langchain_deep_agents_wikispeedia`, `reverse_text`, `alphabet_sort`, `wiki_search`, `math_python`, `mcp_search_env`, `opencode_harbor`, `bfcl_v3`, `hello_subagent_v1`, `nested_harness_v1`, `hello_self_judge_v1`, `hello_parallel_sandbox_v1`, `hello_group_reward_v1`, `hello_rlm_v1`, `rlm_swe_v1`, `dspy_flights`, and `tau2-bench-v1`.
+- **Taskset/Harness v1**: use this pattern for new environments that need reusable tasksets, reusable harnesses, framework programs, endpoint interception, or sandboxed Python/command programs. Examples include `dspy_rlm`, `openai_agents_env`, `langchain_deep_agents_wikispeedia`, `reverse_text`, `alphabet_sort`, `wiki_search`, `math_python`, `mcp_search_env`, `opencode_harbor`, `bfcl_v3`, `hello_subagent_v1`, `nested_harness_v1`, `hello_self_judge_v1`, `hello_parallel_sandbox_v1`, `hello_group_reward_v1`, `hello_rlm_v1`, `rlm_swe_v1`, `mle_bench`, `dspy_flights`, and `tau2-bench-v1`.
   - `opencode_harbor` uses the packaged `vf.HarborTaskset` + `vf.OpenCode` boundary. These reusable implementations live under `verifiers.v1.packages` and are re-exported from `verifiers.v1`.
 - **Environment and rubric composition**: `math_group`, `math_python`, `wiki_search`
 - **Procedural datasets**: `reasoning_gym_env`

diff --git a/environments/mle_bench/README.md b/environments/mle_bench/README.md
@@ -0,0 +1,41 @@
+# mle-bench
+
+V1 taskset/harness environment for MLE-Bench competition submissions.
+
+```python
+from mle_bench import load_environment
+
+env = load_environment()
+```
+
+The taskset represents each MLE-Bench competition as a sandboxed machine-learning
+engineering task. The model receives the benchmark-level instructions plus the
+competition description and must create:
+
+```text
+/home/submission/submission.csv
+```
+
+When run in an image that has MLE-Bench data and the validation server/script
+available, the reward calls:
+
+```bash
+/home/validate_submission.sh /home/submission/submission.csv
+```
+
+and gives reward `1.0` only when the benchmark validator accepts the submission.
+This keeps the environment aligned with the official MLE-Bench submission
+contract without downloading Kaggle data during import or local unit tests.
+
+For handoff to the benchmark grader, `grading_submission_row(task)` returns the
+JSONL row expected by `mlebench grade`:
+
+```json
+{"competition_id": "spaceship-titanic", "submission_path": "/home/submission/submission.csv"}
+```
+
+By default, the environment uses the low-complexity/lite split. If the
+`mlebench` Python package is installed, descriptions are loaded from its
+registry. Otherwise the built-in competition IDs are still exposed so the
+environment can be imported and tested without the upstream repo or Kaggle
+credentials.