Getting Started
-
- Add vitest-evals to an existing Vitest project, adapt your agent
- with a harness, write normal test bodies around run(input), and
- publish the same results locally and in GitHub Actions.
-
- Install the core package plus the harness that matches your app runtime. - Swap the harness package in the command if you use OpenAI Agents or pi-ai. -
-@vitest-evals/harness-ai-sdk
- @vitest-evals/harness-openai-agents
- @vitest-evals/harness-pi-ai
- - Keep evals on their own command and Vitest config. The separate config keeps - longer provider timeouts, eval-only includes, reporter setup, and replay - defaults out of unit tests. -
-- Then add the eval config. The reporter line gives local runs the eval - summary, while GitHub later adds the JSON reporter beside it. -
-
- A harness is the adapter between your app and the eval runner. It executes
- the app once, returns typed output for assertions, and normalizes
- transcript, tool calls, usage, artifacts, and errors for judges and reports.
-
- The examples below show the production-facing agent shape first, then the - eval-only harness file that adapts it. In your project, keep the agent close - to application code and put harness files next to evals. -
-
- Start with the section for your runtime. Each adapter returns the same
- eval-facing HarnessRun shape, so the eval suite you write later
- stays mostly the same.
-
- Use the AI SDK harness when your production code already calls
- generateText, streamText, or an agent wrapper around
- them. The compatible shape is small: expose run(input, runtime)
- or generate(input, runtime), and read tools from
- runtime.tools so the harness can capture and replay local tool
- calls.
-
tools; those are the calls
- the harness can normalize for ToolCallJudge().
- output when your app returns a raw AI SDK result and tests
- need a domain value such as {`{ status: "approved" }`}.
- output if your app entrypoint already returns
- {`{ output }`} or a full HarnessRun.
- - The first file is the kind of production-facing agent shape the harness - expects. The second file is the eval-only adapter. -
-- The harness provides the replay-aware tool map, chooses which tools can be - recorded, and converts the AI SDK result into the output your tests assert - on. -
-
- Use the OpenAI Agents harness when your app already owns an
- Agent and runs it with a Runner. The harness keeps
- that model intact: it creates or receives the agent and runner, calls
- runner.run(agent, input, options), captures output items and
- local function tool activity, then lets output select the typed
- value for test assertions.
-
agent and runner factories when each eval case
- should get fresh state.
- runOptions, not inside the test body.
- toolReplay by the OpenAI function tool name, such as
- lookup_invoice.
- output to parse result.finalOutput or to
- project a structured SDK output into the app-facing value.
- - The agent file stays close to normal OpenAI Agents code; the harness file - only describes how evals should run and normalize it. -
-- The harness keeps the runner setup in one place, applies run options for - every case, and parses the final output into your domain type. -
-
- Use the pi-ai harness when your app exposes a pi-ai agent, a
- toolset, or an app entrypoint that can accept a harness runtime.
- Compatible agents usually implement run(input, runtime). The
- harness provides replay-aware tools through runtime.tools and a
- lightweight event recorder through runtime.events.
-
tools only when the app hides the tool surface.
- runtime.events.assistant(...) when the pi-ai flow returns
- text, provider, model, or usage data that should appear in reports.
- {`{ output }`} from run when tests should
- assert on a parsed domain object instead of raw assistant text.
- toolReplay by the pi-ai tool name, such as
- lookupInvoice.
- - The agent file shows the minimum runtime contract. The harness file wires - that contract into eval execution. -
-- The harness can create the agent per case, pass metadata into your agent - factory, and opt individual pi-ai tools into replay. -
-
- Use createHarness(...) when the first-party adapters do not fit.
- This is the right choice for workflow engines, RAG pipelines, CLIs, or
- service functions that can return JSON-safe output, messages, tool calls, or
- usage without going through a supported SDK adapter.
-
createHarness<Input, Output, Metadata>
- so run(input, {`{ metadata }`}), result.output,
- and judges stay typed.
- output for ordinary Vitest assertions and
- toolCalls for deterministic tool judges.
- setArtifact for JSON-safe debug details that belong in
- reports but not in the app output contract.
- HarnessRun only when you need complete control
- over messages, usage, timings, artifacts, and errors.
- - A custom harness can wrap a plain application function. Keep that function - focused on production behavior, then normalize only the eval-facing details - in the harness file. -
-
- The core package normalizes this lightweight result into the same
- HarnessRun shape used by the first-party adapters.
-
- Bind one harness per suite. Put the user prompt or event in
- input, expected values in metadata, and use ordinary
- Vitest assertions on result.output.
-
- At this point the eval suite should run locally with the dedicated Vitest - config. Use replay modes only when you want to record or enforce cached tool - calls. -
-
- Judges score the run that the harness already captured. Suite-level judges
- run after each run(...); explicit judge assertions are useful for
- one-off checks or rubric scores you want to record without failing.
-
ToolCallJudge()expectedTools from metadata or matcher options and
- checks tool names, order, and arguments.
- StructuredOutputJudge()expected from metadata or matcher options and checks
- JSON-safe output fields.
- - Use explicit judge assertions when a test needs an extra judge or when you - want to record a score without making that score fail the test. -
-- Replay lets evals keep testing real agent behavior without paying for every - external dependency on every run. Use it for local tool calls that are - expensive, slow, rate-limited, or unstable but still useful to preserve in - the eval trace: web search requests, retrieval calls, third-party APIs, - browser fetches, or internal service lookups. -
-
- Opt in per tool from the harness, sanitize anything sensitive or
- high-cardinality, commit useful recordings alongside the evals, then choose
- a mode with VITEST_EVALS_REPLAY_MODE. Keep live provider model
- calls live unless your app exposes them as local tools.
-
| Mode | -Behavior | -
|---|---|
off |
- Call live tools and do not read or write recordings. | -
record |
- Call live tools and overwrite recordings. | -
auto |
- Replay when a recording exists; otherwise call live and record. | -
strict |
- Require an existing recording and fail when one is missing. | -
- Use a replay config object when the cache key needs to ignore unstable - values, when recordings need a version, or when outputs need redaction. -
-
- Emit Vitest JSON, then run the action with if: always(). The
- action reads the JSON file and publishes summaries, annotations, and optional
- Check Runs.
-
- For sharded evals, upload one JSON result per matrix job and publish once - from a reducer job. -
-