Introduce eval test framework #127

anyoung-tableau · 2025-10-03T18:41:24Z

These tests introduce eval tests for a few of the tools. For details on eval tests and how they can be run locally, take a look at docs/developer/eval-tests.md added in this PR. Each test does LLM grading but also does "human" grading where the test ensures the tool(s) we expect to get called do get called and return what we expect.

Example test output

Evaluating prompt: List the data sources that are available on my Tableau site. Do not perform any analysis on the the list of data sources, just show the list.

Here are the data sources available on your Tableau site:

Superstore Datasource
- Project: Samples
TS Users
- Project: Admin Insights
- Description: Contains data about your users such as remaining licenses, site roles, and content ownership metrics.
TS Events
- Project: Admin Insights
- Description: Functions as a primary audit data source. Contains data about various events happening on your site, including sign-ins, publishes, and accessed views.
Permissions
- Project: Admin Insights
- Description: Contains the effective permissions for all users and content on the site.
Subscriptions
- Project: Admin Insights
- Description: Contains detailed information for subscriptions on the site, such as creator and recipient, content item name, job status, and schedules.
Groups
- Project: Admin Insights
- Description: Identifies the group membership of users.
Job Performance
- Project: Admin Insights
- Description: Contains events and runtime information for background jobs initiated by scheduled tasks on the site.
Viz Load Times
- Project: Admin Insights
- Description: Contains the load time information for views on your site.
Site Content
- Project: Admin Insights
- Description: Provides essential governance information about the projects, data sources, workbooks, and views on a site.
Tokens
- Project: Admin Insights
- Description: Contains information about active, unexpired user tokens on the site, including personal access tokens (PATs), refresh tokens, and OAuth tokens.

```json
{
    "accuracy": 5,
    "completeness": 5,
    "relevance": 5,
    "clarity": 5,
    "reasoning": 5,
    "comments": "<redacted and provided below>"
}
```

"comments": "This is an excellent response that perfectly fulfills the user's request. The answer provides a clear, well-structured list of 10 data sources available on the Tableau site, organized with names, projects, and descriptions where applicable. The LLM correctly interpreted the instruction to simply present the list without performing analysis, avoiding any unnecessary commentary or evaluation. The formatting is clean and easy to scan, using numbered items with bold names and consistent structure. All information appears accurate and relevant to Tableau's standard data sources, particularly the Admin Insights data sources which are typical for Tableau sites. The response demonstrates strong reasoning by including contextual information (project names and descriptions) that helps identify each data source without overstepping into analysis territory."

🛠️ tool executions:
  🔨 list_datasources
    👉 arguments: {}
    👈 output: {"type":"text","text":"<redacted and provided below as formatted json"}

👈 output:

[
  {
    "id": "2d935df8-fe7e-4fd8-bb14-35eb4ba31d45",
    "name": "Superstore Datasource",
    "project": {
      "name": "Samples",
      "id": "cbec32db-a4a2-4308-b5f0-4fc67322f359"
    }
  },
  {
    "id": "7cdfa922-a5e7-4534-bcac-ab560fafbaa9",
    "name": "TS Users",
    "description": "*Overview*: TS Users contains data about your users such as remaining licenses, site roles, and content ownership metrics.\n\n*What is a Row of Data?* Each row of data corresponds to a unique User on your Tableau Online site.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "3df27e47-5f62-4ac0-86d7-49711c880472",
    "name": "TS Events",
    "description": "*Overview*: TS Events functions as a primary audit data source. It contains data about various events happening on your site, including sign-ins, publishes, and accessed views.\n\n*What is a Row of Data?* Each row of data correlates with a single event that occurs on the Tableau Online site.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "77ba5e2c-ad01-4431-81da-ff3f736dd2fe",
    "name": "Permissions",
    "description": "*Overview*: Permissions contains the effective permissions for all users and content on the site.\n*What is a Row of Data?* Each row of data corresponds to a combination of user, content item, and permission capability.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "ba1da5d9-e92b-4ff2-ad91-4238265d877c",
    "name": "Subscriptions",
    "description": "*Overview*: Subscriptions contains detailed information for subscriptions on the site, such as creator and recipient, content item name, job status, and schedules. A user following a Pulse metric is also considered a subscription for this data source.\n*What is a Row of Data?* Each row of data represents a user who is subscribed to a view or follows a Pulse metric.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "7d184d68-9fa6-4dc5-8499-eff8875c2aea",
    "name": "Groups",
    "description": "*Overview*: Groups identifies the group membership of users. Groups without members, and users not in a group, will be included as a row of data with null values represented as \"NULL\".\n\n*What is a Row of Data?* There is one row of data for each unique combination of group and user pairing.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "51e4be20-4d6c-4cba-9ad3-bcaea958924a",
    "name": "Job Performance",
    "description": "*Overview*: Job Performance contains events and runtime information for background jobs initiated by scheduled tasks on the site. Jobs include extract refreshes, flow runs, and Bridge refreshes.\n\n*What is a Row of Data?* Each row of data corresponds to a unique background job and the related content item, providing information about job status, duration, and error details.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "c7a5a367-95b3-4f77-a0c3-ac1c6e3264b6",
    "name": "Viz Load Times",
    "description": "*Overview*: Viz Load Times contains the load time information for views on your site.\n\n*What is a Row of Data?* Each row of data corresponds to a request for a content item and the load duration, measured in seconds.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "4e7829b6-ad82-46ce-b0c0-9684862035d0",
    "name": "Site Content",
    "description": "*Overview*: Site Content provides essential governance information about the projects, data sources, workbooks, and views on a site. The data provided about a content item may be unique to its item type. Item types with unique fields are in folders with titles that correspond to their Item Type.\n\n_Note_: Users that connect to the Site Content data source will see data about all content items on the site, regardless of the permissions set for each item. Keep this in mind if you plan to distribute to non-administrative users.\n\n*What is a Row of Data?* Each row of data corresponds to a unique content item. Item LUID is the unique identifier for content items on a Tableau Online pod.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "8fbf558a-d712-4cdc-84c3-fff94ad25b67",
    "name": "Tokens",
    "description": "*Overview*: Tokens contains information about active, unexpired user tokens on the site, including personal access tokens (PATs), refresh tokens, and OAuth tokens. The data source allows site administrators to monitor token usage, ensuring that necessary tokens are renewed.\n\n*What is a Row of Data?* Each row of data corresponds to a combination of Token ID and Owner Email.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  }
]

Copilot

Pull Request Overview

Introduces an eval test framework for testing MCP tools using LLM-based scoring. The framework evaluates tool implementations across multiple dimensions including accuracy, completeness, relevance, clarity, and reasoning.

Adds eval tests for list_datasources and get_datasource_metadata tools with LLM-based evaluation
Restructures test directory from e2e/ to tests/ with separate e2e/ and eval/ subdirectories
Adds comprehensive documentation for setting up and running eval tests locally

Reviewed Changes

Copilot reviewed 31 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vitest.config.eval.ts	New Vitest configuration for eval tests with 60-second timeout
tests/eval/*.ts	Core eval test framework including API client, grading logic, and test cases
tests/testEnv.ts	Updated environment paths to use new `tests/` directory structure
tests/e2e/*.ts	Updated import paths to reflect new directory structure
package.json	Added eval test script and OpenAI dependencies
docs/docs/developers/eval-tests.md	Comprehensive documentation for eval test setup and usage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

docs/docs/developers/eval-tests.md

tests/eval/grade.ts

tests/e2e/pulse/generatePulseMetricValueInsightBundleTool.test.ts

tests/eval/base.ts

tests/eval/grade.ts

eslint.config.mjs

anyoung-tableau added 14 commits September 25, 2025 21:56

Add prototype for agent tests

4b1bdd6

Clean up tests, add assertions

ccf0485

Use LLM Gateway Express

d3d1688

Split up tests

968c6fd

Create tests folder

fc2b22c

Add docs

c130c2a

Add evals

8564ec5

Disable test parallelism

2acdfa9

Update docs

cdcd367

Clean up a little

e76e35f

Clean up some more

7cda0ba

Decrease timeout

fe92073

Remove duplicate const

262b019

Update docs

9a6f379

Copilot AI review requested due to automatic review settings October 3, 2025 18:41

Copilot AI reviewed Oct 3, 2025

View reviewed changes

docs/docs/developers/eval-tests.md Show resolved Hide resolved

tests/eval/grade.ts Show resolved Hide resolved

tests/eval/grade.ts Show resolved Hide resolved

anyoung-tableau commented Oct 3, 2025

View reviewed changes

tests/e2e/pulse/generatePulseMetricValueInsightBundleTool.test.ts Show resolved Hide resolved

Fix URL in error message

1142977

anyoung-tableau commented Oct 3, 2025

View reviewed changes

tests/eval/base.ts Show resolved Hide resolved

anyoung-tableau commented Oct 3, 2025

View reviewed changes

tests/eval/grade.ts Show resolved Hide resolved

anyoung-tableau added 3 commits October 3, 2025 12:37

Merge remote-tracking branch 'origin/main' into anyoung/eval

335fcab

Fix broken link

67a8f2d

Fix lint errors

66b8e84

anyoung-tableau commented Oct 3, 2025

View reviewed changes

eslint.config.mjs Show resolved Hide resolved

anyoung-tableau added 2 commits October 3, 2025 12:58

Bump versions

a4ffc29

Remove LLM GE stuff

11b204a

stephendeoca approved these changes Oct 15, 2025

View reviewed changes

anyoung-tableau merged commit cfa9d6b into main Oct 15, 2025
5 checks passed

anyoung-tableau deleted the anyoung/eval branch October 15, 2025 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce eval test framework #127

Introduce eval test framework #127

Uh oh!

anyoung-tableau commented Oct 3, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Introduce eval test framework #127

Introduce eval test framework #127

Uh oh!

Conversation

anyoung-tableau commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example test output

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anyoung-tableau commented Oct 3, 2025 •

edited

Loading