Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion environments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ This folder contains installable example environments that showcase common usage
- **bfcl_v3**: BFCL v3 function-calling eval using task-local dynamic tool schemas and v1 rewards.
- **dspy_flights**: Sandboxed DSPy flight-support `program.fn` entrypoint installed from its package `pyproject.toml` and configured against the v1 interception endpoint.
- **hello_group_reward_v1**: Deterministic v1 reference for group updates, metrics, rewards, advantages, and cleanup.
- **mle_bench**: MLE-Bench competition-submission taskset using the benchmark CSV submission contract and validator.
- **tau2_bench_v1**: `tau2-bench-v1` τ²-bench taskset/user/tool pattern on the v1 harness runtime.

### Composition
Expand Down Expand Up @@ -94,7 +95,7 @@ This folder contains installable example environments that showcase common usage
- **CLI agent sandboxes**: `opencode_harbor`, `terminus_harbor`, `hello_mcp_harbor`
- **MCP integration**: `mcp_search_env`, `hello_mcp_harbor`
- **RLM (recursive LLM)**: `rlm_secrets`
- **Taskset/Harness v1**: use this pattern for new environments that need reusable tasksets, reusable harnesses, framework programs, endpoint interception, or sandboxed Python/command programs. Examples include `dspy_rlm`, `openai_agents_env`, `langchain_deep_agents_wikispeedia`, `reverse_text`, `alphabet_sort`, `wiki_search`, `math_python`, `mcp_search_env`, `opencode_harbor`, `bfcl_v3`, `hello_subagent_v1`, `nested_harness_v1`, `hello_self_judge_v1`, `hello_parallel_sandbox_v1`, `hello_group_reward_v1`, `hello_rlm_v1`, `rlm_swe_v1`, `dspy_flights`, and `tau2-bench-v1`.
- **Taskset/Harness v1**: use this pattern for new environments that need reusable tasksets, reusable harnesses, framework programs, endpoint interception, or sandboxed Python/command programs. Examples include `dspy_rlm`, `openai_agents_env`, `langchain_deep_agents_wikispeedia`, `reverse_text`, `alphabet_sort`, `wiki_search`, `math_python`, `mcp_search_env`, `opencode_harbor`, `bfcl_v3`, `hello_subagent_v1`, `nested_harness_v1`, `hello_self_judge_v1`, `hello_parallel_sandbox_v1`, `hello_group_reward_v1`, `hello_rlm_v1`, `rlm_swe_v1`, `mle_bench`, `dspy_flights`, and `tau2-bench-v1`.
- `opencode_harbor` uses the packaged `vf.HarborTaskset` + `vf.OpenCode` boundary. These reusable implementations live under `verifiers.v1.packages` and are re-exported from `verifiers.v1`.
- **Environment and rubric composition**: `math_group`, `math_python`, `wiki_search`
- **Procedural datasets**: `reasoning_gym_env`
Expand Down
41 changes: 41 additions & 0 deletions environments/mle_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# mle-bench

V1 taskset/harness environment for MLE-Bench competition submissions.

```python
from mle_bench import load_environment

env = load_environment()
```

The taskset represents each MLE-Bench competition as a sandboxed machine-learning
engineering task. The model receives the benchmark-level instructions plus the
competition description and must create:

```text
/home/submission/submission.csv
```

When run in an image that has MLE-Bench data and the validation server/script
available, the reward calls:

```bash
/home/validate_submission.sh /home/submission/submission.csv
```

and gives reward `1.0` only when the benchmark validator accepts the submission.
This keeps the environment aligned with the official MLE-Bench submission
contract without downloading Kaggle data during import or local unit tests.

For handoff to the benchmark grader, `grading_submission_row(task)` returns the
JSONL row expected by `mlebench grade`:

```json
{"competition_id": "spaceship-titanic", "submission_path": "/home/submission/submission.csv"}
```

By default, the environment uses the low-complexity/lite split. If the
`mlebench` Python package is installed, descriptions are loaded from its
registry. Otherwise the built-in competition IDs are still exposed so the
environment can be imported and tested without the upstream repo or Kaggle
credentials.
Loading