-
Notifications
You must be signed in to change notification settings - Fork 35
Introduce eval test framework #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Introduces an eval test framework for testing MCP tools using LLM-based scoring. The framework evaluates tool implementations across multiple dimensions including accuracy, completeness, relevance, clarity, and reasoning.
- Adds eval tests for
list_datasourcesandget_datasource_metadatatools with LLM-based evaluation - Restructures test directory from
e2e/totests/with separatee2e/andeval/subdirectories - Adds comprehensive documentation for setting up and running eval tests locally
Reviewed Changes
Copilot reviewed 31 out of 34 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vitest.config.eval.ts | New Vitest configuration for eval tests with 60-second timeout |
| tests/eval/*.ts | Core eval test framework including API client, grading logic, and test cases |
| tests/testEnv.ts | Updated environment paths to use new tests/ directory structure |
| tests/e2e/*.ts | Updated import paths to reflect new directory structure |
| package.json | Added eval test script and OpenAI dependencies |
| docs/docs/developers/eval-tests.md | Comprehensive documentation for eval test setup and usage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
These tests introduce eval tests for a few of the tools. For details on eval tests and how they can be run locally, take a look at docs/developer/eval-tests.md added in this PR. Each test does LLM grading but also does "human" grading where the test ensures the tool(s) we expect to get called do get called and return what we expect.
Example test output
Evaluating prompt: List the data sources that are available on my Tableau site. Do not perform any analysis on the the list of data sources, just show the list.
Here are the data sources available on your Tableau site:
Superstore Datasource
TS Users
TS Events
Permissions
Subscriptions
Groups
Job Performance
Viz Load Times
Site Content
Tokens
```json { "accuracy": 5, "completeness": 5, "relevance": 5, "clarity": 5, "reasoning": 5, "comments": "<redacted and provided below>" } ```"comments":"This is an excellent response that perfectly fulfills the user's request. The answer provides a clear, well-structured list of 10 data sources available on the Tableau site, organized with names, projects, and descriptions where applicable. The LLM correctly interpreted the instruction to simply present the list without performing analysis, avoiding any unnecessary commentary or evaluation. The formatting is clean and easy to scan, using numbered items with bold names and consistent structure. All information appears accurate and relevant to Tableau's standard data sources, particularly the Admin Insights data sources which are typical for Tableau sites. The response demonstrates strong reasoning by including contextual information (project names and descriptions) that helps identify each data source without overstepping into analysis territory."🛠️ tool executions: 🔨 list_datasources 👉 arguments: {} 👈 output: {"type":"text","text":"<redacted and provided below as formatted json"}👈 output:
[ { "id": "2d935df8-fe7e-4fd8-bb14-35eb4ba31d45", "name": "Superstore Datasource", "project": { "name": "Samples", "id": "cbec32db-a4a2-4308-b5f0-4fc67322f359" } }, { "id": "7cdfa922-a5e7-4534-bcac-ab560fafbaa9", "name": "TS Users", "description": "*Overview*: TS Users contains data about your users such as remaining licenses, site roles, and content ownership metrics.\n\n*What is a Row of Data?* Each row of data corresponds to a unique User on your Tableau Online site.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "3df27e47-5f62-4ac0-86d7-49711c880472", "name": "TS Events", "description": "*Overview*: TS Events functions as a primary audit data source. It contains data about various events happening on your site, including sign-ins, publishes, and accessed views.\n\n*What is a Row of Data?* Each row of data correlates with a single event that occurs on the Tableau Online site.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "77ba5e2c-ad01-4431-81da-ff3f736dd2fe", "name": "Permissions", "description": "*Overview*: Permissions contains the effective permissions for all users and content on the site.\n*What is a Row of Data?* Each row of data corresponds to a combination of user, content item, and permission capability.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "ba1da5d9-e92b-4ff2-ad91-4238265d877c", "name": "Subscriptions", "description": "*Overview*: Subscriptions contains detailed information for subscriptions on the site, such as creator and recipient, content item name, job status, and schedules. A user following a Pulse metric is also considered a subscription for this data source.\n*What is a Row of Data?* Each row of data represents a user who is subscribed to a view or follows a Pulse metric.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "7d184d68-9fa6-4dc5-8499-eff8875c2aea", "name": "Groups", "description": "*Overview*: Groups identifies the group membership of users. Groups without members, and users not in a group, will be included as a row of data with null values represented as \"NULL\".\n\n*What is a Row of Data?* There is one row of data for each unique combination of group and user pairing.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "51e4be20-4d6c-4cba-9ad3-bcaea958924a", "name": "Job Performance", "description": "*Overview*: Job Performance contains events and runtime information for background jobs initiated by scheduled tasks on the site. Jobs include extract refreshes, flow runs, and Bridge refreshes.\n\n*What is a Row of Data?* Each row of data corresponds to a unique background job and the related content item, providing information about job status, duration, and error details.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "c7a5a367-95b3-4f77-a0c3-ac1c6e3264b6", "name": "Viz Load Times", "description": "*Overview*: Viz Load Times contains the load time information for views on your site.\n\n*What is a Row of Data?* Each row of data corresponds to a request for a content item and the load duration, measured in seconds.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "4e7829b6-ad82-46ce-b0c0-9684862035d0", "name": "Site Content", "description": "*Overview*: Site Content provides essential governance information about the projects, data sources, workbooks, and views on a site. The data provided about a content item may be unique to its item type. Item types with unique fields are in folders with titles that correspond to their Item Type.\n\n_Note_: Users that connect to the Site Content data source will see data about all content items on the site, regardless of the permissions set for each item. Keep this in mind if you plan to distribute to non-administrative users.\n\n*What is a Row of Data?* Each row of data corresponds to a unique content item. Item LUID is the unique identifier for content items on a Tableau Online pod.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } }, { "id": "8fbf558a-d712-4cdc-84c3-fff94ad25b67", "name": "Tokens", "description": "*Overview*: Tokens contains information about active, unexpired user tokens on the site, including personal access tokens (PATs), refresh tokens, and OAuth tokens. The data source allows site administrators to monitor token usage, ensuring that necessary tokens are renewed.\n\n*What is a Row of Data?* Each row of data corresponds to a combination of Token ID and Owner Email.", "project": { "name": "Admin Insights", "id": "4862efd9-3c24-4053-ae1f-18caf18b6ffe" } } ]