diff --git a/docs/gemini_cli_agent_testing.md b/docs/gemini_cli_agent_testing.md new file mode 100644 index 00000000..c18ef394 --- /dev/null +++ b/docs/gemini_cli_agent_testing.md @@ -0,0 +1,831 @@ +# Gemini CLI Evaluation Guide + +This guide covers how to use EvalBench for evaluating Gemini CLI agent workflows using **MCP Servers**, **Extensions**, and **Skills**. It includes configuration reference, evaluation dataset format, scoring metrics, and step-by-step instructions for running evaluations. + +--- + +## Table of Contents + +- [Overview](#overview) +- [Architecture](#architecture) +- [Prerequisites](#prerequisites) +- [Quick Start](#quick-start) +- [Configuration Reference](#configuration-reference) + - [Run Configuration](#1-run-configuration) + - [Model Configuration](#2-model-configuration) + - [Evaluation Dataset (Evalset)](#3-evaluation-dataset-evalset) +- [Tool Paradigms](#tool-paradigms) + - [MCP Servers](#mcp-servers) + - [Extensions](#extensions) + - [Skills](#skills) + - [Fake MCP Servers (Testing)](#fake-mcp-servers-testing) +- [Scorers](#scorers) +- [Reporting](#reporting) +- [End-to-End Examples](#end-to-end-examples) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +EvalBench's Gemini CLI integration enables automated, multi-turn evaluation of agentic AI workflows. The Gemini CLI acts as the orchestrator that connects to various tool backends—MCP servers, extensions, or skills—and executes scenarios defined in an evaluation dataset. A **simulated user** powered by an LLM drives multi-turn conversations following a conversation plan, allowing realistic testing without human interaction. + +### Key Capabilities + +- **Multi-turn evaluation** with LLM-powered simulated users +- **Three tool paradigms**: MCP servers, Gemini CLI extensions, and skills +- **Fake MCP server support** for deterministic, offline testing +- **8 built-in scorers** covering correctness, efficiency, and behavior quality +- **CSV and BigQuery reporting** + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ EvalBench Pipeline │ +│ │ +│ ┌──────────────┐ ┌──────────────────┐ ┌───────────────────┐ │ +│ │ Run Config │───▶│ AgentOrchestrator│───▶│ AgentEvaluator │ │ +│ │ (YAML) │ │ │ │ │ │ +│ └──────────────┘ └──────────────────┘ └────────┬──────────┘ │ +│ │ │ +│ ┌──────────────┐ ┌──────────────────────┼──────────┐ │ +│ │ Eval Dataset│ │ Per Scenario │ │ │ +│ │ (JSON) │─────────────▶│ ▼ │ │ +│ └──────────────┘ │ ┌──────────────────────────┐ │ │ +│ │ │ GeminiCliGenerator │ │ │ +│ ┌──────────────┐ │ │ ┌─────────┐ ┌─────────┐ │ │ │ +│ │ Model Config │──────────────│─▶│ │MCP/Ext/ │ │Simulated│ │ │ │ +│ │ (YAML) │ │ │ │Skills │ │User │ │ │ │ +│ └──────────────┘ │ │ └─────────┘ └─────────┘ │ │ │ +│ │ └───────────┬──────────────┘ │ │ +│ │ │ │ │ +│ │ ▼ │ │ +│ │ ┌──────────────────────────┐ │ │ +│ │ │ Scorers (8 metrics) │ │ │ +│ │ └──────────────────────────┘ │ │ +│ └─────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────────┐ │ +│ │ Reporting (CSV / BigQuery) │ │ +│ └──────────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +**Flow:** +1. The **Run Config** ties together the dataset, model config, scorers, and reporting. +2. The **AgentOrchestrator** loads the `geminicli` evaluator. +3. For each scenario in the evalset, the **AgentEvaluator** runs a multi-turn conversation loop: + - Sends the starting prompt to Gemini CLI + - A **SimulatedUser** (LLM) generates realistic follow-up responses based on the conversation plan + - Tools are accumulated across turns + - Conversation continues until `max_turns` is reached or the simulated user sends `TERMINATE` +4. Results are scored by all configured **scorers**. +5. Reports are written to CSV and/or BigQuery. + +--- + +## Prerequisites + +1. **Python 3.10+** and project dependencies installed: + ```bash + cd evalbench + pip install -r requirements.txt + ``` + +2. **Node.js and npm** (for Gemini CLI execution) + +3. **GCP Authentication** (for Vertex AI models and MCP servers): + ```bash + gcloud auth application-default login + ``` + +4. **Environment Variables**: + ```bash + export EVAL_GCP_PROJECT_ID=your_project_id + export EVAL_GCP_PROJECT_REGION=us-central1 + ``` + +--- + +## Quick Start + +### 1. Set the run configuration + +```bash +# For MCP Server evaluation: +export EVAL_CONFIG=datasets/gemini-cli-tools/example_run_config.yaml + +# For Skills evaluation: +export EVAL_CONFIG=datasets/gemini-cli-tools/example_run_skills_config.yaml + +# For Fake MCP (offline testing): +export EVAL_CONFIG=datasets/gemini-cli-tools/example_run_fake_config.yaml +``` + +### 2. Run the evaluation + +```bash +./evalbench/run.sh +``` + +This executes all scenarios, runs scorers, and writes results to the `results/` directory. + +--- + +## Configuration Reference + +Three YAML/JSON files work together to define an evaluation run: + +### 1. Run Configuration + +The top-level config that ties everything together. For Gemini CLI, set `orchestrator: geminicli` and `dataset_format: gemini-cli-format`. + +| Key | Required | Description | +|-----|----------|-------------| +| `dataset_config` | Yes | Path to the evalset JSON file | +| `dataset_format` | Yes | Must be `gemini-cli-format` | +| `orchestrator` | Yes | Must be `geminicli` | +| `model_config` | Yes | Path to the Gemini CLI model config YAML | +| `simulated_user_model_config` | Yes | Path to the model config for the simulated user LLM | +| `scorers` | Yes | Dictionary of scorer configurations (see [Scorers](#scorers)) | +| `reporting` | Optional | CSV and/or BigQuery output options | + +**Example** ([example_run_config.yaml](/datasets/gemini-cli-tools/example_run_config.yaml)): + +```yaml +############################################################ +### Dataset / Eval Items +############################################################ +dataset_config: datasets/gemini-cli-tools/gemini-cli.evalset.json +dataset_format: gemini-cli-format + +# Orchestrator Configuration +orchestrator: geminicli +model_config: datasets/model_configs/gemini_cli_model.yaml +simulated_user_model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + +############################################################ +### Scorer Related Configs +############################################################ +scorers: + trajectory_matcher: {} + goal_completion: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + behavioral_metrics: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + parameter_analysis: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + turn_count: {} + end_to_end_latency: {} + tool_call_latency: {} + token_consumption: {} + +############################################################ +### Reporting Related Configs +############################################################ +reporting: + csv: + output_directory: 'results' +``` + +--- + +### 2. Model Configuration + +The model config defines the Gemini CLI version, environment, and the tool setup (MCP servers, extensions, or skills). This is the **critical file** that determines which tool paradigm is used. + +#### Common Fields + +| Key | Required | Description | +|-----|----------|-------------| +| `gemini_cli_version` | Yes | NPM package specifier for Gemini CLI (e.g., `@google/gemini-cli@0.25.1`) | +| `generator` | Yes | Must be `gemini_cli` | +| `env` | Optional | Environment variables passed to the CLI process | +| `setup` | Optional | Tool setup block containing `mcp_servers`, `extensions`, `skills`, or `fake_mcp_servers` | + +#### Environment Variables + +| Variable | Description | +|----------|-------------| +| `GOOGLE_CLOUD_PROJECT` | GCP project for API calls | +| `GOOGLE_CLOUD_LOCATION` | GCP region | +| `GOOGLE_GENAI_USE_VERTEXAI` | Set to `"true"` to use Vertex AI | +| `GEMINI_API_MODEL` | Override the model used by Gemini CLI | +| `GEMINI_MODEL` | Alternative model override | + +--- + +### 3. Evaluation Dataset (Evalset) + +The evalset JSON file defines the test scenarios. Each scenario represents an agentic user journey. + +#### Evalset Structure + +```json +{ + "scenarios": [ + { + "id": "unique-scenario-id", + "starting_prompt": "The initial user message", + "conversation_plan": "Instructions for the simulated user...", + "expected_trajectory": ["tool_1", "tool_2"], + "env": { + "GOOGLE_CLOUD_PROJECT": "my-project" + }, + "kind": "tools", + "max_turns": 6 + } + ] +} +``` + +#### Scenario Fields + +| Field | Required | Description | +|-------|----------|-------------| +| `id` | Yes | Unique identifier for the scenario | +| `starting_prompt` | Yes | The first user message sent to Gemini CLI | +| `conversation_plan` | Yes | Natural language instructions that guide the simulated user's behavior across turns. This defines the goals, expected information to provide, and how to react to agent responses. | +| `expected_trajectory` | Yes | Ordered list of tool names the agent is expected to call. Used by `trajectory_matcher` scorer. | +| `env` | Optional | Per-scenario environment variables (merged with model config env) | +| `kind` | Optional | Category label (e.g., `"tools"`) | +| `max_turns` | Yes | Maximum number of conversation turns before the evaluation stops | + +#### Writing Good Conversation Plans + +The `conversation_plan` is a critical part of each scenario. It instructs the simulated user LLM how to behave. Best practices: + +1. **Be specific about the goal**: Clearly state what the user wants to accomplish. +2. **Provide concrete values**: Include specific names, values, and parameters the simulated user should provide when asked. +3. **Handle ambiguity intentionally**: Some scenarios test the agent's ability to handle vague requests (e.g., `"I need a database."`). +4. **Include decision points**: Tell the simulated user how to respond to agent confirmations or questions. +5. **Define the project context**: Always specify the GCP project and relevant details. + +**Example — Ambiguous Multi-turn Scenario:** +```json +{ + "id": "csql-create-ambiguous-multiturn-01", + "starting_prompt": "I need a database.", + "conversation_plan": "The user starts with a vague request. You want to CREATE a NEW Cloud SQL instance named 'my-pg-app'. If the agent offers to create one, say YES. When asked for details, provide 'my-pg-app' as the instance name and 'user_data' as the database name. Never claim to have an existing instance. The goal is for the agent to eventually create the database 'user_data' inside 'my-pg-app' in astana-evaluation project.", + "expected_trajectory": ["list_instances", "create_instance", "create_database"], + "env": { + "GOOGLE_CLOUD_PROJECT": "astana-evaluation" + }, + "kind": "tools", + "max_turns": 6 +} +``` + +--- + +## Tool Paradigms + +EvalBench supports three distinct tool paradigms for Gemini CLI, each configured in the `setup` section of the model config YAML. You can also use **fake MCP servers** for deterministic, offline testing. + +### MCP Servers + +MCP (Model Context Protocol) servers expose tools via a standardized protocol. These can be remote HTTP-based APIs (like Google Cloud managed MCP servers) or local stdio-based servers. + +#### Configuration + +In the model config, under `setup.mcp_servers`: + +```yaml +setup: + mcp_servers: + "server-name": + # For HTTP-based MCP servers: + httpUrl: "https://example.googleapis.com/mcp" + authProviderType: google_credentials + oauth: + scopes: + - https://www.googleapis.com/auth/cloud-platform + headers: + X-Goog-User-Project: my-project + + # For stdio-based MCP servers: + # command: "python" + # args: + # - "path/to/server.py" +``` + +#### How It Works + +1. **Setup Phase**: `GeminiCliGenerator._setup_mcp_servers()` writes the MCP server configuration into `settings.json` in a sandboxed fake home directory (`.venv/fake_home/.gemini/settings.json`). +2. **Verification**: By default, EvalBench verifies each MCP server by running Gemini CLI and asking it to list loaded tools. If no MCP-specific tools are found, the run fails with a `RuntimeError`. This ensures the server is reachable before evaluation begins. +3. **Execution**: During evaluation, Gemini CLI connects to the configured MCP server(s) and uses the exposed tools to complete tasks. + +#### Example: Cloud SQL Managed MCP Server + +```yaml +# datasets/model_configs/gemini_cli_model.yaml +gemini_cli_version: "@google/gemini-cli@0.25.1" +generator: gemini_cli +env: + GOOGLE_CLOUD_PROJECT: "my-project" + GOOGLE_CLOUD_LOCATION: "us-central1" + GOOGLE_GENAI_USE_VERTEXAI: "true" +setup: + mcp_servers: + "cloud-sql": + httpUrl: "https://sqladmin.googleapis.com/mcp" + authProviderType: google_credentials + oauth: + scopes: + - https://www.googleapis.com/auth/cloud-platform + headers: + X-Goog-User-Project: my-project +``` + +> **Note**: Any valid MCP server configuration that Gemini CLI accepts can be used here. The `setup.mcp_servers` block is written directly into the Gemini CLI settings file. + +--- + +### Extensions + +Extensions are GitHub-hosted Gemini CLI plugins that provide additional tools. EvalBench can install, configure, and manage extensions automatically. + +#### Configuration + +In the model config, under `setup.extensions`: + +```yaml +setup: + extensions: + "https://github.com/org/extension-repo": + settings: + SETTING_KEY_1: "value1" + SETTING_KEY_2: "value2" +``` + +#### How It Works + +1. **Installation**: `GeminiCliGenerator._install_extensions()` calls `gemini extensions install --consent` via the CLI. +2. **Idempotent Management**: Before installing, EvalBench lists current extensions, removes any that are no longer in the config, and skips already-installed ones. +3. **Headless Compatibility**: Extension manifests with `"sensitive": true` are automatically patched to `"sensitive": false` to avoid keychain requirements in CI/headless environments. +4. **Settings**: Extension-specific environment variables (from `settings`) are passed during installation and at CLI runtime. + +#### Example: Cloud SQL PostgreSQL Extension + +```yaml +# Model config with extensions +gemini_cli_version: "@google/gemini-cli@0.25.1" +generator: gemini_cli +env: + GOOGLE_CLOUD_PROJECT: "my-project" + GOOGLE_CLOUD_LOCATION: "us-central1" + GOOGLE_GENAI_USE_VERTEXAI: "true" +setup: + extensions: + "https://github.com/gemini-cli-extensions/cloud-sql-postgresql": + settings: + CLOUD_SQL_POSTGRES_PROJECT: "my-project" + CLOUD_SQL_POSTGRES_INSTANCE: "my-instance" + CLOUD_SQL_POSTGRES_REGION: "us-central1" + CLOUD_SQL_POSTGRES_DATABASE: "mydb" + CLOUD_SQL_POSTGRES_USER: "app" + CLOUD_SQL_POSTGRES_PASSWORD: "secret" + CLOUD_SQL_POSTGRES_IP_TYPE: "PUBLIC" +``` + +--- + +### Skills + +Skills are local Gemini CLI skill packages that can be linked, installed, enabled, disabled, or uninstalled. + +#### Configuration + +In the model config, under `setup.skills`: + +```yaml +setup: + skills: + # Link a skill from a local path + - action: link + path: "/path/to/skill/directory" + + # Install a skill by name or path + - action: install + name: "skill-name" + # or + - action: install + path: "/path/to/skill" + + # Enable/Disable a skill + - action: enable + name: "skill-name" + - action: disable + name: "skill-name" + + # Uninstall a skill + - action: uninstall + name: "skill-name" +``` + +You can also specify skills by name only (as strings), which copies them from `~/.gemini/skills/`: + +```yaml +setup: + skills: + - "my-existing-skill" +``` + +#### Supported Actions + +| Action | Required Fields | Description | +|--------|----------------|-------------| +| `link` | `path` | Creates a symlink to a local skill directory. Runs `gemini skills link --consent`. | +| `install` | `name` or `path` | Installs a skill from a registry or path. Runs `gemini skills install --consent`. | +| `enable` | `name` | Enables a previously installed skill. | +| `disable` | `name` | Disables a skill without uninstalling it. | +| `uninstall` | `name` | Removes a skill. | + +#### How It Works + +1. **String-based skills**: If a skill is specified as a plain string, EvalBench copies it from `~/.gemini/skills/` to the sandboxed fake home. +2. **Action-based skills**: If specified as a dict with an `action`, EvalBench runs the appropriate `gemini skills ` CLI command. +3. **Sandboxing**: All skills operate within a fake home directory (`.venv/fake_home/`) to isolate the evaluation environment. + +#### Example: Cloud SQL Admin Skill + +```yaml +# datasets/model_configs/gemini_cli_skills_model.yaml +gemini_cli_version: "@google/gemini-cli@0.31.0" +generator: gemini_cli +env: + GOOGLE_CLOUD_PROJECT: "my-project" + GOOGLE_CLOUD_LOCATION: "us-central1" + GOOGLE_GENAI_USE_VERTEXAI: "true" + GEMINI_API_MODEL: "gemini-2.5-flash" + GEMINI_MODEL: "gemini-2.5-flash" +setup: + skills: + - action: link + path: "/path/to/cloud-sql-postgresql/skills/cloudsql-postgres-admin" +``` + +--- + +### Fake MCP Servers + +Fake MCP servers let you test your evalset and pipeline with deterministic, hardcoded tool responses—no real API calls needed. + +#### Configuration + +Fake MCP servers are defined in **two parts**: the server process definition in `setup.fake_mcp_servers`, and the tool definitions in the top-level `fake_mcp_tools` section of the model config. + +**Server definition:** +```yaml +setup: + fake_mcp_servers: + "server-name": + command: "python" + args: + - "evalbench/util/fake_mcp_server.py" + - "--server-name" + - "server-name" + - "--config" + - "path/to/this_model_config.yaml" +``` + +**Tool definitions:** +```yaml +fake_mcp_tools: + "server-name": + - name: tool_name + description: "What this tool does" + parameters: + type: object + properties: + param1: + type: string + description: "Description of param1" + required: ["param1"] + response: + status: "success" + message: "Hardcoded response" +``` + +#### How It Works + +1. EvalBench starts `fake_mcp_server.py` as a stdio-based MCP server. +2. The server reads tool definitions from the YAML config's `fake_mcp_tools` section. +3. When called, each tool returns its hardcoded `response` (or a default success response containing the tool name and arguments). +4. Tool verification is **skipped** for fake servers (`verify_tools=False`). + +#### Tool Response Types + +- **Success**: Return a success message + ```yaml + response: + status: "success" + message: "Instance created successfully" + ``` +- **Failure**: Return an error + ```yaml + response: + status: "failure" + error: + code: 404 + message: "Instance not found or permission denied" + ``` +- **Default**: If no `response` is specified, returns `{"status": "success", "tool": "", "args": {}}` + +#### Example: Fake Cloud SQL MCP + +```yaml +# datasets/model_configs/gemini_cli_fake_model.yaml +gemini_cli_version: "@google/gemini-cli@0.25.1" +generator: gemini_cli +env: + GOOGLE_CLOUD_PROJECT: "astana-evaluation" + GOOGLE_CLOUD_LOCATION: "us-central1" + GOOGLE_GENAI_USE_VERTEXAI: "true" +setup: + fake_mcp_servers: + "cloud-sql": + command: "python" + args: + - "evalbench/util/fake_mcp_server.py" + - "--server-name" + - "cloud-sql" + - "--config" + - "datasets/model_configs/gemini_cli_fake_model.yaml" + +fake_mcp_tools: + "cloud-sql": + - name: create_instance + description: "Creates a Cloud SQL instance" + parameters: + type: object + properties: + project_id: + type: string + instance_name: + type: string + required: ["project_id", "instance_name"] + response: + status: "success" + message: "Instance created successfully" + - name: get_instance + description: "Gets details of a Cloud SQL instance" + parameters: + type: object + properties: + project_id: + type: string + instance_name: + type: string + required: ["project_id", "instance_name"] + response: + status: "failure" + error: + code: 404 + message: "Instance not found or permission denied" +``` + +--- + +## Scorers + +EvalBench provides **8 scorers** for Gemini CLI evaluations. Scorers are configured in the `scorers` section of the run config. + +### LLM-Based Scorers + +These require a `model_config` pointing to an LLM for evaluation: + +| Scorer | Score Range | Description | +|--------|------------|-------------| +| `goal_completion` | 0–100 | Uses an LLM to evaluate whether the agent accomplished the conversation plan's intent. Returns `100` for PASS, `0` for FAIL. | +| `behavioral_metrics` | 0–100 | Evaluates hallucination rate and clarification rate in a single LLM pass. Starts at 100 and penalizes: **-50 per hallucination**, **-20 per unnecessary clarification**. | +| `parameter_analysis` | 100 (qualitative) | Uses an LLM to provide qualitative feedback on tool parameters used. Always scores 100; the value is in the textual explanation. | + +### Deterministic Scorers + +These require no additional model: + +| Scorer | Score Range | Description | +|--------|------------|-------------| +| `trajectory_matcher` | 0–100 | Compares expected vs. actual tool usage. Uses **Jaccard Similarity** by default (set-based, order-insensitive). Set `enforce_order: true` for **Levenshtein distance** (order-sensitive). | +| `turn_count` | Count | Reports the number of conversation turns the agent took. Lower is generally better. | +| `end_to_end_latency` | Milliseconds | Total latency = model API latency + tool execution latency. | +| `tool_call_latency` | Milliseconds | Sum of all tool execution durations across all turns. | +| `token_consumption` | Count | Total tokens consumed (input + output) across all turns. | + +### Scorer Configuration Example + +```yaml +scorers: + # Deterministic scorers (no model needed) + trajectory_matcher: {} + # trajectory_matcher: + # enforce_order: true # Use Levenshtein for ordered matching + turn_count: {} + end_to_end_latency: {} + tool_call_latency: {} + token_consumption: {} + + # LLM-based scorers (require model_config) + goal_completion: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + behavioral_metrics: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + parameter_analysis: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml +``` + +--- + +## Reporting + +### CSV Reporting + +Results are output as CSV files in the specified directory: + +```yaml +reporting: + csv: + output_directory: 'results' +``` + +### BigQuery Reporting + +For centralized result storage and dashboarding: + +```yaml +reporting: + bigquery: + gcp_project_id: my-gcp-project +``` + +--- + +## End-to-End Examples + +### Example 1: Evaluate MCP Server (Real API) + +**Goal**: Test Cloud SQL management via the managed MCP server. + +1. **Create model config** (`model_configs/gemini_cli_model.yaml`): + ```yaml + gemini_cli_version: "@google/gemini-cli@0.25.1" + generator: gemini_cli + env: + GOOGLE_CLOUD_PROJECT: "my-project" + GOOGLE_CLOUD_LOCATION: "us-central1" + GOOGLE_GENAI_USE_VERTEXAI: "true" + setup: + mcp_servers: + "cloud-sql": + httpUrl: "https://sqladmin.googleapis.com/mcp" + authProviderType: google_credentials + oauth: + scopes: + - https://www.googleapis.com/auth/cloud-platform + headers: + X-Goog-User-Project: my-project + ``` + +2. **Create evalset** (`my-evalset.json`): + ```json + { + "scenarios": [ + { + "id": "list-and-inspect-01", + "starting_prompt": "list all instances in project my-project", + "conversation_plan": "Ask the agent to list instances. Once listed, get details of the 'prod-db' instance and verify it is RUNNABLE.", + "expected_trajectory": ["list_instances", "get_instance"], + "env": { "GOOGLE_CLOUD_PROJECT": "my-project" }, + "kind": "tools", + "max_turns": 3 + } + ] + } + ``` + +3. **Create run config** (`my-run-config.yaml`): + ```yaml + dataset_config: my-evalset.json + dataset_format: gemini-cli-format + orchestrator: geminicli + model_config: model_configs/gemini_cli_model.yaml + simulated_user_model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + scorers: + trajectory_matcher: {} + goal_completion: + model_config: datasets/model_configs/gemini_2.5_pro_model.yaml + turn_count: {} + reporting: + csv: + output_directory: 'results' + ``` + +4. **Run**: + ```bash + export EVAL_CONFIG=my-run-config.yaml + ./evalbench/run.sh + ``` + +### Example 2: Evaluate Skills + +**Goal**: Test Cloud SQL management via a linked skill. + +1. **Create model config** (`model_configs/gemini_cli_skills_model.yaml`): + ```yaml + gemini_cli_version: "@google/gemini-cli@0.31.0" + generator: gemini_cli + env: + GOOGLE_CLOUD_PROJECT: "my-project" + GOOGLE_CLOUD_LOCATION: "us-central1" + GOOGLE_GENAI_USE_VERTEXAI: "true" + GEMINI_API_MODEL: "gemini-2.5-flash" + GEMINI_MODEL: "gemini-2.5-flash" + setup: + skills: + - action: link + path: "/path/to/skills/cloudsql-postgres-admin" + ``` + +2. **Use the same evalset** — the scenarios are tool-paradigm agnostic. + +3. **Create run config** pointing to the skills model config. + +4. **Run**: + ```bash + export EVAL_CONFIG=my-skills-run-config.yaml + ./evalbench/run.sh + ``` + +### Example 3: Offline Testing with Fake MCP + +**Goal**: Validate the pipeline without making real API calls. + +1. **Create model config** with fake tools (see [Fake MCP Servers](#fake-mcp-servers-testing) section for full example). + +2. **Create a simple evalset** targeting the fake tools: + ```json + { + "scenarios": [ + { + "id": "fake-create-success", + "starting_prompt": "Create a new Cloud SQL instance named 'test-db' in project 'my-project'.", + "conversation_plan": "All details are in the prompt. The agent should call create_instance and report success.", + "expected_trajectory": ["create_instance"], + "env": { "GOOGLE_CLOUD_PROJECT": "my-project" }, + "kind": "tools", + "max_turns": 3 + } + ] + } + ``` + +3. **Run**: + ```bash + export EVAL_CONFIG=datasets/gemini-cli-tools/example_run_fake_config.yaml + ./evalbench/run.sh + ``` + +--- + +## Troubleshooting + +### MCP Server Verification Fails + +``` +RuntimeError: MCP Server 'cloud-sql' failed verification +``` + +- Ensure the MCP server URL is correct and accessible. +- Check that `gcloud auth application-default login` has been run. +- Verify the OAuth scopes and project ID in the headers. +- Check network/firewall rules if the server is remote. + +### Extension Installation Fails + +- Ensure `npm` is installed and accessible. +- Check the extension GitHub URL is correct and accessible. +- Look at logs for "keychain" errors — these are auto-patched, but the extension may have other issues. +- Run `npm exec --yes @google/gemini-cli -- extensions list` manually to debug. + +### Skill Linking Fails + +- Verify the skill path exists and contains a valid skill structure. +- Ensure the Gemini CLI version supports the `skills link` command. +- Check that the path is an absolute path. + +### Empty or Missing Results + +- Confirm `dataset_format` is set to `gemini-cli-format`. +- Check that the evalset JSON has the correct `"scenarios"` key. +- Verify the `model_config` path is correct relative to the repo root. + +### NPM Authentication Issues + +- EvalBench auto-configures NPM auth tokens for private registries using `gcloud auth print-access-token`. +- If this fails, run `gcloud auth login` and retry. + +--- \ No newline at end of file