Skip to content

vitest evals#1232

Merged
cpinn merged 22 commits intomainfrom
caitlin/vitest
Feb 26, 2026
Merged

vitest evals#1232
cpinn merged 22 commits intomainfrom
caitlin/vitest

Conversation

@cpinn
Copy link
Contributor

@cpinn cpinn commented Jan 6, 2026

Add ability to create experiments writing tests with vitest

Use existing braintrust datasets or pass their own data to test
Use existing scorers or pass their own custom scorer to tests

Each describe will create an experiment and the tests inside will each be their own span in the experiment.

See golden-ts-vitest-experiment-v* projects for some examples

Example:

import { initDataset } from "braintrust";

bt.describe("My experiment", () => {
  const evalData = initDataset({
    project: "llm-evals",
    dataset: "qa-benchmark",
  }).fetchedData();
 // test will also accept the dataset array and it will be expanded or pass the data individually
  bt.test.each(await evalData)(
    "Q&A evaluation",
    {
      scorers: [
        Factuality, // from autoevals
        ({ output, expected }) => ({ // custom scorer
          name: "conciseness",
          score: output.length <= expected.length * 1.2 ? 1 : 0.7,
        }),
      ],
    },
    async (record) => {
      const { input, expected } = record;
      const answer = await llm.answer(input.question);
      return answer; 
    },
  );
});

Copy link
Collaborator

@ibolmo ibolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

various questions


// indicate the project name the tests will be sent to
const bt = wrapVitest(
{ test, expect, describe, afterAll },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend for users to:

import * as vitest from 'vitest';

const { describe, expect, test, ... } = wrapVitest(vitest);

simpler, less prone to error, and future proof if we add more functions to our support

metadata: { category: "math" },
tags: ["arithmetic"],
},
async ({ input, expected }) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

til.. so the context is re-inserted as an argument

@@ -0,0 +1,46 @@
import tsconfigPaths from "vite-tsconfig-paths";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't expect to see this file in here 🤔

return originalDescribe(suiteName, () => {
// Lazily initialize experiment context on first access
let context: ExperimentContext | null = null;
const getOrCreateContext = (): ExperimentContext => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like we want to extract and use the same getOrCreate inside of test() calls

"Braintrust: vitestMethods.describe is required. Please pass in the describe function from vitest.",
);
}
if (!vitestMethods.expect) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't see a wrapExpect in wrapper.ts. I wonder if each expect is a scorer?

});

// If test function returns a value, log it as output
if (testResult !== undefined) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if you traced(maybeFn || configOrFn, ...) you may have gotten this automatically?

scores: {
pass: 0,
},
metadata: {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably just throw again. the traced() call should handle the error.

datasetExamples: Map<string, string>; // test name -> example id
}

// Global context holder (one per describe block)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens with concurrent calls i.e.

it.concurrent(
 describe(..., () => {
 })
);

it.concurrent(
 describe(..., () => {
 })
);

did you give currentExperiment a try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified how the context is created for experiments. I added some additional tests for the concurrent experiments.

@cpinn cpinn marked this pull request as ready for review February 11, 2026 22:48
@cpinn cpinn changed the title [WIP] vitest evals vitest evals Feb 18, 2026
@cpinn cpinn requested a review from ibolmo February 20, 2026 19:29
Copy link
Collaborator

@ibolmo ibolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small suggestions not blocking

async ({ input, expected }) => {
const result = 4;
logOutputs({ answer: result });
expect(result).toBe(expected);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the expects scorers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, could we try multiple expect() calls? I wonder how that would look in the bt logs.

what does the custom message look like in braintrust? i.e.

expect(result, 'equality').toBe(expected)

return {
test: wrappedTest,
it: wrappedTest,
expect: vitestMethods.expect,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so we don't wrap expects.

like i mentioned earlier, it would be interesting to use the expect(..., 'message')... and the message is the index for the output, or maybe we have a counter/stack that we push expect(output) to then the output might be something like

expect(0).toBe(0);
expect('something', 'message').toBe('something');
expect('foo').toBe('bar');

then the event could be

{
  ...
  output: {
    0: output,
    'message': output,
    1: output,
  }, 
  scores: {
    0: 1,
    'message': 1,
    1: 0
  }
 }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes I didn't wrap the expect method just the test methods. I was thinking people cared about getting the scoring which does add its output to individual logs. Thinking about this flow a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a compromise and logged named outputs automatically. Users can still log additional outputs as necessary. Will see what the feedback is like.

pnpm dlx tsx openai.ts
```

### Vitest Golden Tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great coverage

* expected: 'hola',
* metadata: { language: 'spanish' },
* },
* async ({ input, expected }) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we need the trace() method 🤔

*
* bt.describe('Translation Tests', () => {
* bt.afterAll(async () => {
* await bt.flushExperiment(); // Flushes and displays experiment summary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd expect this to be done automatically 🤔. looking at their docs it doesn't seem like they require them to be explicit

Copy link
Contributor Author

@cpinn cpinn Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done automatically. The method is exposed for access but doesn't need to be called. I thought I removed it from the examples and tests, might have missed this one.

@cpinn cpinn merged commit 2111f87 into main Feb 26, 2026
43 of 44 checks passed
@cpinn cpinn deleted the caitlin/vitest branch February 26, 2026 22:34
AbhiPrasad added a commit that referenced this pull request Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants