Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
2cd4e13
feat: upgraded InspectAI, OpenAI and MCP versions to support Gemini 3…
geelen Dec 2, 2025
eff6da2
feat: add ProgressiveMCPBench evaluation
geelen Dec 17, 2025
561d630
chore: update benchmark docs [skip ci]
github-actions[bot] Dec 17, 2025
bd1cf52
fix(groq): enable strict mode for function calling
geelen Dec 18, 2025
a74bb64
feat(progressivemcpbench): add deferred_mode=directory to remote MCP …
geelen Dec 22, 2025
3d2b72d
feat(remote_mcp): add base module structure and RemoteMCPHandler ABC
geelen Dec 22, 2025
02fba24
feat(remote_mcp): add GroqRemoteMCPHandler for Responses API with MCP
geelen Dec 22, 2025
14d22e1
feat(remote_mcp): add AnthropicRemoteMCPHandler with MCP connector an…
geelen Dec 22, 2025
3545894
feat(remote_mcp): add registry for provider dispatch
geelen Dec 22, 2025
fd15dd5
feat(progressivemcpbench): add provider-agnostic minimal-servers-remo…
geelen Dec 22, 2025
0322c72
fix(remote_mcp): detect provider via ModelAPI type instead of model n…
geelen Dec 22, 2025
0c4aa3a
fix(anthropic): correct MCP connector API structure
geelen Dec 22, 2025
b38d607
fix(remote_mcp): support both GroqAPI and GroqResponsesAPI in Groq ha…
geelen Dec 22, 2025
138ccf0
refactor: remove groq-responses provider
geelen Dec 22, 2025
af776b5
feat(remote_mcp): add Groq-Beta header for MCP deferred directory
geelen Dec 22, 2025
6b1b323
feat(remote_mcp): add directory-lazy tool discovery option for Groq
geelen Dec 22, 2025
b25ff9e
feat(progressivemcpbench): embed server list in meta__ls tool descrip…
geelen Dec 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions docs/evals/progressivemcpbench.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
title: ProgressiveMCPBench
description: Evaluating LLM agents with progressive tool discovery strategies using MCP
---

# ProgressiveMCPBench

ProgressiveMCPBench is a benchmark for evaluating how effectively language models can discover and use Model Context Protocol (MCP) tools. It tests agents on tasks that require using MCP tools - from file operations to API calls - while controlling *how* tools are presented to the model.

## Quick Start

```bash
# Run with directory-style tool discovery
bench eval progressivemcpbench --model openai/gpt-4o -T strategy=directory

# Run with only required servers per task
bench eval progressivemcpbench --model openai/gpt-4o -T strategy=minimal-servers

# Run with only exact required tools per task
bench eval progressivemcpbench --model openai/gpt-4o -T strategy=minimal-tools

# Run with Groq server-side MCP (single-shot, no local tool loop)
bench eval progressivemcpbench --model groq-responses/openai/gpt-oss-20b -T strategy=minimal-servers-remote
```

The dataset is automatically downloaded on first run. The synthetic MCP HTTP server is started automatically when you run the eval (except for `minimal-servers-remote` which uses a hosted remote server).

## Architecture Overview

The benchmark uses a **synthetic MCP layer** that provides:
- **Deterministic responses** - Same inputs always produce same outputs
- **Fast execution** - No network latency or server startup time
- **No dependencies** - No Node.js, npm packages, or external services required
- **Reliable evaluation** - No infrastructure failures or timeouts

```
┌─────────────────────────────────────────────────────────────┐
│ Inspect AI Eval │
├─────────────────────────────────────────────────────────────┤
│ Strategy Layer │
│ ├─ copilot ────────────► Semantic search discovery │
│ ├─ directory ──────────► Filesystem-like exploration │
│ ├─ minimal-servers ────► Server-level filtering │
│ ├─ minimal-tools ──────► Exact tools only │
│ ├─ distraction-64 ─────► Required + distractors (64) │
│ └─ distraction-128 ────► Required + distractors (128) │
├─────────────────────────────────────────────────────────────┤
│ Synthetic HTTP MCP Server (localhost:8765) │
│ ├─ filesystem handler ──► data/files/root/ │
│ ├─ table_lookup handler ► data/api/*.json │
│ ├─ excel_reader handler ► data/files/root/**/*.xlsx │
│ ├─ static_json handler ─► Fixed responses │
│ └─ web_corpus handler ──► data/web/ │
└─────────────────────────────────────────────────────────────┘
```

---

## Strategies

All strategies use the synthetic MCP layer with deterministic responses.

| Strategy | Description | Use Case |
|--------------------------|-------------------------------------------------------------|-----------------------------------|
| `copilot` | Semantic search via `route()` + `execute-tool()` | Tests RAG-style tool discovery |
| `directory` | Filesystem-like exploration via `ls()` + `read-tool-file()` | Tests explicit tool browsing |
| `minimal-servers` | Direct access to required servers only | Tests with server-level filtering |
| `minimal-servers-remote` | Groq server-side MCP (single-shot) | Tests remote MCP orchestration |
| `minimal-tools` | Direct access to exact tools needed | Upper bound on performance |
| `distraction-64` | Required tools + distractors (64 total) | Tests distraction resistance |
| `distraction-128` | Required tools + distractors (128 total) | Tests distraction resistance |

### Strategy Details

**Copilot** - The model uses two meta-tools:
- `route(query)` - Semantic search across all available tools using embeddings
- `execute-tool(server, tool, params)` - Execute a discovered tool

**Directory** - Tools are presented as a filesystem:
- `ls(path)` - List available servers or tools within a server
- `read-tool-file(paths)` - Read tool descriptions (can batch multiple)
- `execute-tool(tool_path, params)` - Execute a tool by its path

**Minimal strategies** - No discovery needed; tools are provided directly based on task annotations.

**Minimal-servers-remote** - Uses Groq's server-side MCP capability. Instead of local tool execution, MCP server specs are sent to Groq's Responses API, which handles all tool discovery and execution internally. This is a single-shot approach - no local tool-calling loop. Only works with the `groq-responses` provider.

```bash
# Example: Run with remote MCP
bench eval progressivemcpbench --model groq-responses/openai/gpt-oss-20b -T strategy=minimal-servers-remote
```

**Distraction strategies** - Required tools plus additional "distractor" tools, padded to exactly 64 or 128 tools total. Distractors are selected deterministically based on task ID.

---

## Dataset

The dataset is automatically downloaded from [github.com/geelen/progressivemcpbench](https://github.com/geelen/progressivemcpbench) on first run and cached at `~/.openbench/progressivemcpbench/dataset/`.

You can override the dataset location by setting the `PROGRESSIVEMCPBENCH_DATA_DIR` environment variable.

The dataset includes:
- **Task definitions** - Questions, expected answers, and tool requirements
- **Server configurations** - MCP server and tool schemas with handlers
- **Synthetic data** - Files, API responses, and web content for deterministic evaluation

---

## Scoring

Tasks are scored using LLM-as-a-judge with the SimpleQA grader template:

| Score | Meaning |
|-------|---------------------------------|
| 1.0 | Correct answer |
| 0.5 | Partially correct (fuzzy match) |
| 0.0 | Incorrect or no answer |

The model must output a JSON object with a `final_answer` field containing the answer.

---

## CLI Reference

```bash
bench eval progressivemcpbench --model <model> -T strategy=<strategy> [options]

# Strategy options:
-T strategy=copilot # Semantic search discovery
-T strategy=directory # Filesystem-like exploration
-T strategy=minimal-servers # Only required servers
-T strategy=minimal-servers-remote # Groq server-side MCP (groq-responses only)
-T strategy=minimal-tools # Only exact tools
-T strategy=distraction-64 # Required + distractors (64 total)
-T strategy=distraction-128 # Required + distractors (128 total)

# Common options:
--limit N # Run only N tasks
--epochs N # Run each task N times
--epochs-reducer # How to combine epoch scores (mean, max, etc.)
--log-dir DIR # Directory for eval logs
```

---

## Requirements

- Python 3.10+
- OPENAI_API_KEY (for embeddings in copilot strategy)

---

## References

- [Dataset repository](https://github.com/geelen/progressivemcpbench)
- Based on [LiveMCPBench](https://github.com/icip-cas/LiveMCPBench) with extensions for progressive tool discovery research.
14 changes: 14 additions & 0 deletions docs/snippets/benchmarks.data.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6841,6 +6841,20 @@ export const benchmarksData = [
"function_name": "polyglotoxicity",
"is_alpha": false
},
{
"name": "ProgressiveMCPBench",
"description": "Evaluating LLM agents with progressive tool discovery strategies using MCP. Tests how effectively models discover and use MCP tools—from file operations to API calls—while controlling how tools are presented (copilot, directory, minimal-servers, minimal-tools, distraction modes).",
"category": "agents",
"tags": [
"agents",
"tools",
"mcp",
"tool-discovery",
"synthetic"
],
"function_name": "progressivemcpbench",
"is_alpha": true
},
{
"name": "PubMedQA",
"description": "Biomedical question answering from PubMed abstracts",
Expand Down
7 changes: 4 additions & 3 deletions packages/openbench-core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,20 @@ authors = [
dependencies = [
"datasets>=3.6.0",
"groq>=0.33.0",
"inspect-ai==0.3.141",
"inspect-ai==0.3.151",
"inspect_swe>=0.2.26",
"anthropic>=0.69.0",
"openai>=2.0.0",
"openai>=2.8.0",
"pillow>=10.0.0",
"jsonschema>=4.23.0",
"mcp>=1.13.1",
"mcp>=1.22.0",
"platformdirs>=4.0.0",
"pydantic-settings>=2.9.1",
"scipy>=1.15.3",
"tiktoken>=0.11.0",
"typer>=0.15.3",
"numpy==2.2.6",
"aiohttp>=3.11.18",
]

[project.urls]
Expand Down
7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,20 @@ authors = [
dependencies = [
"datasets>=3.6.0",
"groq>=0.33.0",
"inspect-ai==0.3.141",
"inspect-ai==0.3.151",
"inspect_swe>=0.2.26",
"anthropic>=0.69.0",
"openai>=2.0.0",
"openai>=2.8.0",
"pillow>=10.0.0",
"jsonschema>=4.23.0",
"mcp>=1.13.1",
"mcp>=1.22.0",
"platformdirs>=4.0.0",
"pydantic-settings>=2.9.1",
"scipy>=1.15.3",
"tiktoken>=0.11.0",
"typer>=0.15.3",
"numpy==2.2.6",
"aiohttp>=3.11.18",
]

[project.urls]
Expand Down
21 changes: 21 additions & 0 deletions src/openbench/_cli/eval_command.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@
prepare_livemcpbench_cache,
clear_livemcpbench_root,
)
from openbench.utils.progressivemcpbench_cache import (
prepare_progressivemcpbench_cache,
clear_progressivemcpbench_root,
)
from openbench.utils.factscore_cache import download_factscore_db


Expand Down Expand Up @@ -636,6 +640,14 @@ def run_eval(
envvar="BENCH_KEEP_LIVEMCP_ROOT",
),
] = False,
keep_progressivemcp_root: Annotated[
bool,
typer.Option(
"--keep-progressivemcp-root",
help="Do not auto-clean ~/.openbench/progressivemcpbench/root after eval",
envvar="BENCH_KEEP_PROGRESSIVEMCP_ROOT",
),
] = False,
alpha: Annotated[
bool,
typer.Option(
Expand Down Expand Up @@ -737,6 +749,10 @@ def run_eval(
# auto-prepare caches for livemcpbench
if "livemcpbench" in expanded_benchmarks:
prepare_livemcpbench_cache()
if "progressivemcpbench" in expanded_benchmarks:
# Pass strategy to cache preparation so it knows whether embeddings are needed
strategy = task_args.get("strategy")
prepare_progressivemcpbench_cache(strategy=strategy)
# auto-prepare CVEBench challenges directory
if "cvebench" in expanded_benchmarks:
from importlib import import_module
Expand Down Expand Up @@ -876,6 +892,11 @@ def run_eval(
# Auto-clean root sandbox for livemcpbench unless opted out
if "livemcpbench" in expanded_benchmarks and not keep_livemcp_root:
clear_livemcpbench_root(quiet=False)
if (
"progressivemcpbench" in expanded_benchmarks
and not keep_progressivemcp_root
):
clear_progressivemcpbench_root(quiet=False)
if "factscore" in expanded_benchmarks:
from openbench.scorers.factscore import cleanup_factscore_runners

Expand Down
20 changes: 20 additions & 0 deletions src/openbench/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3385,6 +3385,26 @@ class EvalGroup:
module_path="openbench.evals.livemcpbench",
function_name="livemcpbench",
is_alpha=False,
),
"progressivemcpbench": BenchmarkMetadata(
name="ProgressiveMCPBench",
description=(
"Evaluating LLM agents with progressive tool discovery strategies using MCP. "
"Tests how effectively models discover and use MCP tools—from file operations "
"to API calls—while controlling how tools are presented (copilot, directory, "
"minimal-servers, minimal-tools, distraction modes)."
),
category="agents",
tags=[
"agents",
"tools",
"mcp",
"tool-discovery",
"synthetic",
],
module_path="openbench.evals.progressivemcpbench",
function_name="progressivemcpbench",
is_alpha=True,
), # GLUE/SuperGLUE benchmarks
"anli": BenchmarkMetadata(
name="ANLI (All Rounds)",
Expand Down
95 changes: 95 additions & 0 deletions src/openbench/datasets/progressivemcpbench.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
"""ProgressiveMCPBench dataset loader.

ProgressiveMCPBench is a benchmark for evaluating LLM agents on real-world tasks
using the Model Context Protocol (MCP). It uses a synthetic MCP server for
deterministic evaluation with exact/fuzzy answer matching.
"""

import json
import logging
from typing import Any, Optional

from inspect_ai.dataset import Dataset, MemoryDataset, Sample

from openbench.utils.progressivemcpbench_cache import (
get_progressivemcpbench_dataset_dir,
)

logger = logging.getLogger(__name__)


def record_to_sample(record: dict[str, Any]) -> Optional[Sample]:
"""Convert a ProgressiveMCPBench record to an Inspect Sample.

Args:
record: A dictionary containing ProgressiveMCPBench fields.

Returns:
Sample: Converted sample for evaluation.
None: If the record should be skipped (e.g. answer is None or empty).
"""
# specific user request: if answer is explicitly null, skip the task
if record.get("answer") is None:
return None

# Handle answer as a single string
raw_answer = record.get("answer")
if isinstance(raw_answer, list):
# If it's a list, take the first non-empty answer
answer = next((str(a).strip() for a in raw_answer if str(a).strip()), "")
else:
answer = str(raw_answer).strip() if raw_answer else ""

# Skip records with no usable answer
if not answer:
return None

metadata = {
"category": record.get("category"),
"file_name": record.get("file_name"),
"annotator_metadata": record.get("Annotator Metadata", {}),
}

# Add tool requirement annotations if present (for minimal strategies)
if "required_servers" in record:
metadata["required_servers"] = record["required_servers"]
if "required_tools" in record:
metadata["required_tools"] = record["required_tools"]

# Add scorer instructions if present
if record.get("scorer_instructions"):
metadata["scorer_instructions"] = record["scorer_instructions"]

return Sample(
id=record["task_id"],
input=record["Question"],
target=answer, # single answer string
metadata=metadata,
)


def get_dataset() -> Dataset:
"""Load ProgressiveMCPBench dataset.

This dataset uses the synthetic MCP server for deterministic evaluation.
It loads from the progressivemcpbench sibling repository.
"""
dataset_dir = get_progressivemcpbench_dataset_dir()
tasks_file = dataset_dir / "tasks" / "progressivemcpbench.json"
try:
with tasks_file.open("r", encoding="utf-8") as f:
raw_records: list[dict[str, Any]] = json.load(f)
except (json.JSONDecodeError, FileNotFoundError) as e:
logger.error(f"Failed to read dataset file {tasks_file}: {e}")
raise FileNotFoundError(
f"Dataset file not found at {tasks_file}. "
"Ensure the progressivemcpbench repository is cloned as a sibling directory."
) from e

samples: list[Sample] = []
for record in raw_records:
sample = record_to_sample(record)
if sample:
samples.append(sample)

return MemoryDataset(samples=samples, name="progressivemcpbench")
Loading
Loading