groq · geelen · Dec 2, 2025 · Dec 17, 2025 · Dec 17, 2025 · Dec 18, 2025
@@ -0,0 +1,157 @@
+---
+title: ProgressiveMCPBench
+description: Evaluating LLM agents with progressive tool discovery strategies using MCP
+---
+
+# ProgressiveMCPBench
+
+ProgressiveMCPBench is a benchmark for evaluating how effectively language models can discover and use Model Context Protocol (MCP) tools. It tests agents on tasks that require using MCP tools - from file operations to API calls - while controlling *how* tools are presented to the model.
+
+## Quick Start
+
+```bash
+# Run with directory-style tool discovery
+bench eval progressivemcpbench --model openai/gpt-4o -T strategy=directory
+
+# Run with only required servers per task
+bench eval progressivemcpbench --model openai/gpt-4o -T strategy=minimal-servers
+
+# Run with only exact required tools per task
+bench eval progressivemcpbench --model openai/gpt-4o -T strategy=minimal-tools
+
+# Run with Groq server-side MCP (single-shot, no local tool loop)
+bench eval progressivemcpbench --model groq-responses/openai/gpt-oss-20b -T strategy=minimal-servers-remote
+```
+
+The dataset is automatically downloaded on first run. The synthetic MCP HTTP server is started automatically when you run the eval (except for `minimal-servers-remote` which uses a hosted remote server).
+
+## Architecture Overview
+
+The benchmark uses a **synthetic MCP layer** that provides:
+- **Deterministic responses** - Same inputs always produce same outputs
+- **Fast execution** - No network latency or server startup time  
+- **No dependencies** - No Node.js, npm packages, or external services required
+- **Reliable evaluation** - No infrastructure failures or timeouts
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     Inspect AI Eval                         │
+├─────────────────────────────────────────────────────────────┤
+│  Strategy Layer                                             │
+│  ├─ copilot ────────────► Semantic search discovery         │
+│  ├─ directory ──────────► Filesystem-like exploration       │
+│  ├─ minimal-servers ────► Server-level filtering            │
+│  ├─ minimal-tools ──────► Exact tools only                  │
+│  ├─ distraction-64 ─────► Required + distractors (64)       │
+│  └─ distraction-128 ────► Required + distractors (128)      │
+├─────────────────────────────────────────────────────────────┤
+│  Synthetic HTTP MCP Server (localhost:8765)                 │
+│  ├─ filesystem handler ──► data/files/root/                 │
+│  ├─ table_lookup handler ► data/api/*.json                  │
+│  ├─ excel_reader handler ► data/files/root/**/*.xlsx        │
+│  ├─ static_json handler ─► Fixed responses                  │
+│  └─ web_corpus handler ──► data/web/                        │
+└─────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Strategies
+
+All strategies use the synthetic MCP layer with deterministic responses.
+
+| Strategy                 | Description                                                 | Use Case                          |
+|--------------------------|-------------------------------------------------------------|-----------------------------------|
+| `copilot`                | Semantic search via `route()` + `execute-tool()`            | Tests RAG-style tool discovery    |
+| `directory`              | Filesystem-like exploration via `ls()` + `read-tool-file()` | Tests explicit tool browsing      |
+| `minimal-servers`        | Direct access to required servers only                      | Tests with server-level filtering |
+| `minimal-servers-remote` | Groq server-side MCP (single-shot)                          | Tests remote MCP orchestration    |
+| `minimal-tools`          | Direct access to exact tools needed                         | Upper bound on performance        |
+| `distraction-64`         | Required tools + distractors (64 total)                     | Tests distraction resistance      |
+| `distraction-128`        | Required tools + distractors (128 total)                    | Tests distraction resistance      |
+
+### Strategy Details
+
+**Copilot** - The model uses two meta-tools:
+- `route(query)` - Semantic search across all available tools using embeddings
+- `execute-tool(server, tool, params)` - Execute a discovered tool
+
+**Directory** - Tools are presented as a filesystem:
+- `ls(path)` - List available servers or tools within a server
+- `read-tool-file(paths)` - Read tool descriptions (can batch multiple)
+- `execute-tool(tool_path, params)` - Execute a tool by its path
+
+**Minimal strategies** - No discovery needed; tools are provided directly based on task annotations.
+
+**Minimal-servers-remote** - Uses Groq's server-side MCP capability. Instead of local tool execution, MCP server specs are sent to Groq's Responses API, which handles all tool discovery and execution internally. This is a single-shot approach - no local tool-calling loop. Only works with the `groq-responses` provider.
+
+```bash
+# Example: Run with remote MCP
+bench eval progressivemcpbench --model groq-responses/openai/gpt-oss-20b -T strategy=minimal-servers-remote
+```
+
+**Distraction strategies** - Required tools plus additional "distractor" tools, padded to exactly 64 or 128 tools total. Distractors are selected deterministically based on task ID.
+
+---
+
+## Dataset
+
+The dataset is automatically downloaded from [github.com/geelen/progressivemcpbench](https://github.com/geelen/progressivemcpbench) on first run and cached at `~/.openbench/progressivemcpbench/dataset/`.
+
+You can override the dataset location by setting the `PROGRESSIVEMCPBENCH_DATA_DIR` environment variable.
+
+The dataset includes:
+- **Task definitions** - Questions, expected answers, and tool requirements
+- **Server configurations** - MCP server and tool schemas with handlers
+- **Synthetic data** - Files, API responses, and web content for deterministic evaluation
+
+---
+
+## Scoring
+
+Tasks are scored using LLM-as-a-judge with the SimpleQA grader template:
+
+| Score | Meaning                         |
+|-------|---------------------------------|
+| 1.0   | Correct answer                  |
+| 0.5   | Partially correct (fuzzy match) |
+| 0.0   | Incorrect or no answer          |
+
+The model must output a JSON object with a `final_answer` field containing the answer.
+
+---
+
+## CLI Reference
+
+```bash
+bench eval progressivemcpbench --model <model> -T strategy=<strategy> [options]
+
+# Strategy options:
+-T strategy=copilot                 # Semantic search discovery
+-T strategy=directory               # Filesystem-like exploration
+-T strategy=minimal-servers         # Only required servers
+-T strategy=minimal-servers-remote  # Groq server-side MCP (groq-responses only)
+-T strategy=minimal-tools           # Only exact tools
+-T strategy=distraction-64          # Required + distractors (64 total)
+-T strategy=distraction-128         # Required + distractors (128 total)
+
+# Common options:
+--limit N          # Run only N tasks
+--epochs N         # Run each task N times  
+--epochs-reducer   # How to combine epoch scores (mean, max, etc.)
+--log-dir DIR      # Directory for eval logs
+```
+
+---
+
+## Requirements
+
+- Python 3.10+
+- OPENAI_API_KEY (for embeddings in copilot strategy)
+
+---
+
+## References
+
+- [Dataset repository](https://github.com/geelen/progressivemcpbench)
+- Based on [LiveMCPBench](https://github.com/icip-cas/LiveMCPBench) with extensions for progressive tool discovery research.
@@ -6841,6 +6841,20 @@ export const benchmarksData = [
     "function_name": "polyglotoxicity",
     "is_alpha": false
   },
+  {
+    "name": "ProgressiveMCPBench",
+    "description": "Evaluating LLM agents with progressive tool discovery strategies using MCP. Tests how effectively models discover and use MCP tools—from file operations to API calls—while controlling how tools are presented (copilot, directory, minimal-servers, minimal-tools, distraction modes).",
+    "category": "agents",
+    "tags": [
+      "agents",
+      "tools",
+      "mcp",
+      "tool-discovery",
+      "synthetic"
+    ],
+    "function_name": "progressivemcpbench",
+    "is_alpha": true
+  },
   {
     "name": "PubMedQA",
     "description": "Biomedical question answering from PubMed abstracts",

@@ -15,19 +15,20 @@ authors = [
 dependencies = [
     "datasets>=3.6.0",
     "groq>=0.33.0",
-    "inspect-ai==0.3.141",
+    "inspect-ai==0.3.151",
     "inspect_swe>=0.2.26",
     "anthropic>=0.69.0",
-    "openai>=2.0.0",
+    "openai>=2.8.0",
     "pillow>=10.0.0",
     "jsonschema>=4.23.0",
-    "mcp>=1.13.1",
+    "mcp>=1.22.0",
     "platformdirs>=4.0.0",
     "pydantic-settings>=2.9.1",
     "scipy>=1.15.3",
     "tiktoken>=0.11.0",
     "typer>=0.15.3",
     "numpy==2.2.6",
+    "aiohttp>=3.11.18",
 ]
 
 [project.urls]

@@ -16,19 +16,20 @@ authors = [
 dependencies = [
     "datasets>=3.6.0",
     "groq>=0.33.0",
-    "inspect-ai==0.3.141",
+    "inspect-ai==0.3.151",
     "inspect_swe>=0.2.26",
     "anthropic>=0.69.0",
-    "openai>=2.0.0",
+    "openai>=2.8.0",
     "pillow>=10.0.0",
     "jsonschema>=4.23.0",
-    "mcp>=1.13.1",
+    "mcp>=1.22.0",
     "platformdirs>=4.0.0",
     "pydantic-settings>=2.9.1",
     "scipy>=1.15.3",
     "tiktoken>=0.11.0",
     "typer>=0.15.3",
     "numpy==2.2.6",
+    "aiohttp>=3.11.18",
 ]
 
 [project.urls]

@@ -22,6 +22,10 @@
     prepare_livemcpbench_cache,
     clear_livemcpbench_root,
 )
+from openbench.utils.progressivemcpbench_cache import (
+    prepare_progressivemcpbench_cache,
+    clear_progressivemcpbench_root,
+)
 from openbench.utils.factscore_cache import download_factscore_db
 
 
@@ -636,6 +640,14 @@ def run_eval(
             envvar="BENCH_KEEP_LIVEMCP_ROOT",
         ),
     ] = False,
+    keep_progressivemcp_root: Annotated[
+        bool,
+        typer.Option(
+            "--keep-progressivemcp-root",
+            help="Do not auto-clean ~/.openbench/progressivemcpbench/root after eval",
+            envvar="BENCH_KEEP_PROGRESSIVEMCP_ROOT",
+        ),
+    ] = False,
     alpha: Annotated[
         bool,
         typer.Option(
@@ -737,6 +749,10 @@ def run_eval(
         # auto-prepare caches for livemcpbench
         if "livemcpbench" in expanded_benchmarks:
             prepare_livemcpbench_cache()
+        if "progressivemcpbench" in expanded_benchmarks:
+            # Pass strategy to cache preparation so it knows whether embeddings are needed
+            strategy = task_args.get("strategy")
+            prepare_progressivemcpbench_cache(strategy=strategy)
         # auto-prepare CVEBench challenges directory
         if "cvebench" in expanded_benchmarks:
             from importlib import import_module
@@ -876,6 +892,11 @@ def run_eval(
         # Auto-clean root sandbox for livemcpbench unless opted out
         if "livemcpbench" in expanded_benchmarks and not keep_livemcp_root:
             clear_livemcpbench_root(quiet=False)
+        if (
+            "progressivemcpbench" in expanded_benchmarks
+            and not keep_progressivemcp_root
+        ):
+            clear_progressivemcpbench_root(quiet=False)
         if "factscore" in expanded_benchmarks:
             from openbench.scorers.factscore import cleanup_factscore_runners
 

@@ -3385,6 +3385,26 @@ class EvalGroup:
         module_path="openbench.evals.livemcpbench",
         function_name="livemcpbench",
         is_alpha=False,
+    ),
+    "progressivemcpbench": BenchmarkMetadata(
+        name="ProgressiveMCPBench",
+        description=(
+            "Evaluating LLM agents with progressive tool discovery strategies using MCP. "
+            "Tests how effectively models discover and use MCP tools—from file operations "
+            "to API calls—while controlling how tools are presented (copilot, directory, "
+            "minimal-servers, minimal-tools, distraction modes)."
+        ),
+        category="agents",
+        tags=[
+            "agents",
+            "tools",
+            "mcp",
+            "tool-discovery",
+            "synthetic",
+        ],
+        module_path="openbench.evals.progressivemcpbench",
+        function_name="progressivemcpbench",
+        is_alpha=True,
     ),  # GLUE/SuperGLUE benchmarks
     "anli": BenchmarkMetadata(
         name="ANLI (All Rounds)",

@@ -0,0 +1,95 @@
+"""ProgressiveMCPBench dataset loader.
+
+ProgressiveMCPBench is a benchmark for evaluating LLM agents on real-world tasks
+using the Model Context Protocol (MCP). It uses a synthetic MCP server for
+deterministic evaluation with exact/fuzzy answer matching.
+"""
+
+import json
+import logging
+from typing import Any, Optional
+
+from inspect_ai.dataset import Dataset, MemoryDataset, Sample
+
+from openbench.utils.progressivemcpbench_cache import (
+    get_progressivemcpbench_dataset_dir,
+)
+
+logger = logging.getLogger(__name__)
+
+
+def record_to_sample(record: dict[str, Any]) -> Optional[Sample]:
+    """Convert a ProgressiveMCPBench record to an Inspect Sample.
+
+    Args:
+        record: A dictionary containing ProgressiveMCPBench fields.
+
+    Returns:
+        Sample: Converted sample for evaluation.
+        None: If the record should be skipped (e.g. answer is None or empty).
+    """
+    # specific user request: if answer is explicitly null, skip the task
+    if record.get("answer") is None:
+        return None
+
+    # Handle answer as a single string
+    raw_answer = record.get("answer")
+    if isinstance(raw_answer, list):
+        # If it's a list, take the first non-empty answer
+        answer = next((str(a).strip() for a in raw_answer if str(a).strip()), "")
+    else:
+        answer = str(raw_answer).strip() if raw_answer else ""
+
+    # Skip records with no usable answer
+    if not answer:
+        return None
+
+    metadata = {
+        "category": record.get("category"),
+        "file_name": record.get("file_name"),
+        "annotator_metadata": record.get("Annotator Metadata", {}),
+    }
+
+    # Add tool requirement annotations if present (for minimal strategies)
+    if "required_servers" in record:
+        metadata["required_servers"] = record["required_servers"]
+    if "required_tools" in record:
+        metadata["required_tools"] = record["required_tools"]
+
+    # Add scorer instructions if present
+    if record.get("scorer_instructions"):
+        metadata["scorer_instructions"] = record["scorer_instructions"]
+
+    return Sample(
+        id=record["task_id"],
+        input=record["Question"],
+        target=answer,  # single answer string
+        metadata=metadata,
+    )
+
+
+def get_dataset() -> Dataset:
+    """Load ProgressiveMCPBench dataset.
+
+    This dataset uses the synthetic MCP server for deterministic evaluation.
+    It loads from the progressivemcpbench sibling repository.
+    """
+    dataset_dir = get_progressivemcpbench_dataset_dir()
+    tasks_file = dataset_dir / "tasks" / "progressivemcpbench.json"
+    try:
+        with tasks_file.open("r", encoding="utf-8") as f:
+            raw_records: list[dict[str, Any]] = json.load(f)
+    except (json.JSONDecodeError, FileNotFoundError) as e:
+        logger.error(f"Failed to read dataset file {tasks_file}: {e}")
+        raise FileNotFoundError(
+            f"Dataset file not found at {tasks_file}. "
+            "Ensure the progressivemcpbench repository is cloned as a sibling directory."
+        ) from e
+
+    samples: list[Sample] = []
+    for record in raw_records:
+        sample = record_to_sample(record)
+        if sample:
+            samples.append(sample)
+
+    return MemoryDataset(samples=samples, name="progressivemcpbench")