Skip to content

Conversation

@anyoung-tableau
Copy link
Collaborator

@anyoung-tableau anyoung-tableau commented Oct 3, 2025

These tests introduce eval tests for a few of the tools. For details on eval tests and how they can be run locally, take a look at docs/developer/eval-tests.md added in this PR. Each test does LLM grading but also does "human" grading where the test ensures the tool(s) we expect to get called do get called and return what we expect.

Example test output

Evaluating prompt: List the data sources that are available on my Tableau site. Do not perform any analysis on the the list of data sources, just show the list.

Here are the data sources available on your Tableau site:

  1. Superstore Datasource

    • Project: Samples
  2. TS Users

    • Project: Admin Insights
    • Description: Contains data about your users such as remaining licenses, site roles, and content ownership metrics.
  3. TS Events

    • Project: Admin Insights
    • Description: Functions as a primary audit data source. Contains data about various events happening on your site, including sign-ins, publishes, and accessed views.
  4. Permissions

    • Project: Admin Insights
    • Description: Contains the effective permissions for all users and content on the site.
  5. Subscriptions

    • Project: Admin Insights
    • Description: Contains detailed information for subscriptions on the site, such as creator and recipient, content item name, job status, and schedules.
  6. Groups

    • Project: Admin Insights
    • Description: Identifies the group membership of users.
  7. Job Performance

    • Project: Admin Insights
    • Description: Contains events and runtime information for background jobs initiated by scheduled tasks on the site.
  8. Viz Load Times

    • Project: Admin Insights
    • Description: Contains the load time information for views on your site.
  9. Site Content

    • Project: Admin Insights
    • Description: Provides essential governance information about the projects, data sources, workbooks, and views on a site.
  10. Tokens

    • Project: Admin Insights
    • Description: Contains information about active, unexpired user tokens on the site, including personal access tokens (PATs), refresh tokens, and OAuth tokens.
```json
{
    "accuracy": 5,
    "completeness": 5,
    "relevance": 5,
    "clarity": 5,
    "reasoning": 5,
    "comments": "<redacted and provided below>"
}
```

"comments": "This is an excellent response that perfectly fulfills the user's request. The answer provides a clear, well-structured list of 10 data sources available on the Tableau site, organized with names, projects, and descriptions where applicable. The LLM correctly interpreted the instruction to simply present the list without performing analysis, avoiding any unnecessary commentary or evaluation. The formatting is clean and easy to scan, using numbered items with bold names and consistent structure. All information appears accurate and relevant to Tableau's standard data sources, particularly the Admin Insights data sources which are typical for Tableau sites. The response demonstrates strong reasoning by including contextual information (project names and descriptions) that helps identify each data source without overstepping into analysis territory."

🛠️ tool executions:
  🔨 list_datasources
    👉 arguments: {}
    👈 output: {"type":"text","text":"<redacted and provided below as formatted json"}

👈 output:

[
  {
    "id": "2d935df8-fe7e-4fd8-bb14-35eb4ba31d45",
    "name": "Superstore Datasource",
    "project": {
      "name": "Samples",
      "id": "cbec32db-a4a2-4308-b5f0-4fc67322f359"
    }
  },
  {
    "id": "7cdfa922-a5e7-4534-bcac-ab560fafbaa9",
    "name": "TS Users",
    "description": "*Overview*: TS Users contains data about your users such as remaining licenses, site roles, and content ownership metrics.\n\n*What is a Row of Data?* Each row of data corresponds to a unique User on your Tableau Online site.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "3df27e47-5f62-4ac0-86d7-49711c880472",
    "name": "TS Events",
    "description": "*Overview*: TS Events functions as a primary audit data source. It contains data about various events happening on your site, including sign-ins, publishes, and accessed views.\n\n*What is a Row of Data?* Each row of data correlates with a single event that occurs on the Tableau Online site.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "77ba5e2c-ad01-4431-81da-ff3f736dd2fe",
    "name": "Permissions",
    "description": "*Overview*: Permissions contains the effective permissions for all users and content on the site.\n*What is a Row of Data?* Each row of data corresponds to a combination of user, content item, and permission capability.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "ba1da5d9-e92b-4ff2-ad91-4238265d877c",
    "name": "Subscriptions",
    "description": "*Overview*: Subscriptions contains detailed information for subscriptions on the site, such as creator and recipient, content item name, job status, and schedules. A user following a Pulse metric is also considered a subscription for this data source.\n*What is a Row of Data?* Each row of data represents a user who is subscribed to a view or follows a Pulse metric.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "7d184d68-9fa6-4dc5-8499-eff8875c2aea",
    "name": "Groups",
    "description": "*Overview*: Groups identifies the group membership of users. Groups without members, and users not in a group, will be included as a row of data with null values represented as \"NULL\".\n\n*What is a Row of Data?* There is one row of data for each unique combination of group and user pairing.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "51e4be20-4d6c-4cba-9ad3-bcaea958924a",
    "name": "Job Performance",
    "description": "*Overview*: Job Performance contains events and runtime information for background jobs initiated by scheduled tasks on the site. Jobs include extract refreshes, flow runs, and Bridge refreshes.\n\n*What is a Row of Data?* Each row of data corresponds to a unique background job and the related content item, providing information about job status, duration, and error details.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "c7a5a367-95b3-4f77-a0c3-ac1c6e3264b6",
    "name": "Viz Load Times",
    "description": "*Overview*: Viz Load Times contains the load time information for views on your site.\n\n*What is a Row of Data?* Each row of data corresponds to a request for a content item and the load duration, measured in seconds.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "4e7829b6-ad82-46ce-b0c0-9684862035d0",
    "name": "Site Content",
    "description": "*Overview*: Site Content provides essential governance information about the projects, data sources, workbooks, and views on a site. The data provided about a content item may be unique to its item type. Item types with unique fields are in folders with titles that correspond to their Item Type.\n\n_Note_: Users that connect to the Site Content data source will see data about all content items on the site, regardless of the permissions set for each item. Keep this in mind if you plan to distribute to non-administrative users.\n\n*What is a Row of Data?* Each row of data corresponds to a unique content item. Item LUID is the unique identifier for content items on a Tableau Online pod.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  },
  {
    "id": "8fbf558a-d712-4cdc-84c3-fff94ad25b67",
    "name": "Tokens",
    "description": "*Overview*: Tokens contains information about active, unexpired user tokens on the site, including personal access tokens (PATs), refresh tokens, and OAuth tokens. The data source allows site administrators to monitor token usage, ensuring that necessary tokens are renewed.\n\n*What is a Row of Data?* Each row of data corresponds to a combination of Token ID and Owner Email.",
    "project": {
      "name": "Admin Insights",
      "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe"
    }
  }
]

@Copilot Copilot AI review requested due to automatic review settings October 3, 2025 18:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces an eval test framework for testing MCP tools using LLM-based scoring. The framework evaluates tool implementations across multiple dimensions including accuracy, completeness, relevance, clarity, and reasoning.

  • Adds eval tests for list_datasources and get_datasource_metadata tools with LLM-based evaluation
  • Restructures test directory from e2e/ to tests/ with separate e2e/ and eval/ subdirectories
  • Adds comprehensive documentation for setting up and running eval tests locally

Reviewed Changes

Copilot reviewed 31 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vitest.config.eval.ts New Vitest configuration for eval tests with 60-second timeout
tests/eval/*.ts Core eval test framework including API client, grading logic, and test cases
tests/testEnv.ts Updated environment paths to use new tests/ directory structure
tests/e2e/*.ts Updated import paths to reflect new directory structure
package.json Added eval test script and OpenAI dependencies
docs/docs/developers/eval-tests.md Comprehensive documentation for eval test setup and usage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@anyoung-tableau anyoung-tableau merged commit cfa9d6b into main Oct 15, 2025
5 checks passed
@anyoung-tableau anyoung-tableau deleted the anyoung/eval branch October 15, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants