Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions .github/workflows/eval-py-uv.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Run Python evals

on:
push:
# files:
# - 'test-eval/**'

permissions:
pull-requests: write
contents: read

jobs:
eval:
name: Run Python evals
runs-on: ubuntu-latest

steps:
- name: Checkout
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
submodules: "recursive"

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.12" # TODO: Matrix test different versions

- name: Install dependencies
run: |
cd test-eval-py
uv lock --check
uv sync --no-dev

- name: Run Evals
uses: ./
with:
api_key: ${{ secrets.BRAINTRUST_API_KEY }}
root: test-eval-py
runtime: python
package_manager: uv

# - name: Start terminal session
# uses: mxschmitt/action-tmate@v3
# with:
# limit-access-to-actor: true
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,13 @@ You can configure the following variables:
- `paths`: Specific paths, relative to the root, containing evals you'd like to
run.
- `runtime`: Either `node` or `python`
- `package_manager`: Either `npm`, `pnpm`, or `yarn` for a `node` runtime, or
`pip` or `uv` for a `python` runtime.
- `use_proxy`: Either `true` or `false`. If set, `OPENAI_BASE_URL` will be set
to `https://braintrustproxy.com/v1`, which will automatically cache repetitive
LLM calls and run your evals faster. Defaults to `true`.
- `terminate_on_failure`: Either `true` or `false`. If set to `true`, the evaluation
process will stop when an error occurs. Defaults to `false`.
- `terminate_on_failure`: Either `true` or `false`. If set to `true`, the
evaluation process will stop when an error occurs. Defaults to `false`.

## Full example

Expand Down Expand Up @@ -82,9 +84,10 @@ jobs:

To see examples of fully configured templates, see the `examples` directory:

- [`node with npm`](examples/npm.yml)
- [`node with pnpm`](examples/pnpm.yml)
- [`python`](examples/python.yml)
- [`node with npm`](examples/node/npm.yml)
- [`node with pnpm`](examples/node/pnpm.yml)
- [`python with pip`](examples/python/pip.yml)
- [`python with uv`](examples/python/uv.yml)

## How it works

Expand Down
10 changes: 8 additions & 2 deletions action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@ inputs:
runtime:
description: "The runtime to use for evals. Valid values: node, python."
required: true
package_manager:
description:
"The package manager to use for evals. Valid values: npm, pnpm, yarn, pip,
or uv depending on the runtime."
required: false
default: ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define a default? it looks like if package_manager is "", it goes into an empty case statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's a bit odd, but it's for backward compatibility. I could try without the default '', but ultimately in the code it will fall back to '' due to the zod parsing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh gotchu! makes sense

use_proxy:
description:
"Whether to use the Braintrust proxy (to cache LLM calls). Set to 'true'
Expand All @@ -31,8 +37,8 @@ inputs:
default: "true"
terminate_on_failure:
description:
"Whether to terminate the evaluation process when an error occurs. Set to 'true'
or 'false'."
"Whether to terminate the evaluation process when an error occurs. Set to
'true' or 'false'."
required: false
default: "false"
github_token:
Expand Down
58 changes: 29 additions & 29 deletions eval/dist/index.js

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions eval/dist/index.js.map

Large diffs are not rendered by default.

45 changes: 34 additions & 11 deletions eval/src/braintrust.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ function snakeToCamelCase(str: string) {
}

async function runCommand(command: string, onSummary: OnSummaryFn) {
core.info(`> $ ${command}`);
return new Promise((resolve, reject) => {
const process = execSync(command);

Expand Down Expand Up @@ -76,18 +77,40 @@ export async function runEval(args: Params, onSummary: OnSummaryFn) {
// Change working directory
process.chdir(path.resolve(root));

let command: string;
const terminateFlag = terminate_on_failure ? "--terminate-on-failure" : "";

switch (args.runtime) {
case "node":
command = `npx braintrust eval --jsonl ${terminateFlag} ${paths}`;
break;
case "python":
command = `braintrust eval --jsonl ${terminateFlag} ${paths}`;
break;
default:
throw new Error(`Unsupported runtime: ${args.runtime}`);
}
const baseCommand = (() => {
switch (args.runtime.toLowerCase().trim()) {
case "node":
switch (args.package_manager) {
case "":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if default should be defined and this should be removed -- what happens in this empty case?

Copy link
Contributor Author

@ibolmo ibolmo Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for backward compatibility it'll be whatever we were doing before. so for node it'll be npx and for python it'll be pip. is that what you had in mind? right now it's just a switch case aliased to npm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeet i forgot about the case statement behavior on no returns -- this makes sense

case "npm":
return "npx braintrust";
case "pnpm":
return "pnpm dlx braintrust";
default:
throw new Error(
`Unsupported package manager: ${args.package_manager}`,
);
}
case "python":
switch ((args.package_manager || "").toLowerCase().trim()) {
case "":
case "pip":
return `braintrust`;
case "uv":
return `uv run braintrust`;
default:
throw new Error(
`Unsupported package manager: ${args.package_manager}`,
);
}
default:
throw new Error(`Unsupported runtime: ${args.runtime}`);
}
})();

const command = `${baseCommand} eval --jsonl ${terminateFlag} ${paths}`;

await runCommand(command, onSummary);
}
61 changes: 44 additions & 17 deletions eval/src/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,49 @@ import { ExperimentSummary } from "braintrust";
import { capitalize } from "@braintrust/core";
import { z } from "zod";

const paramsSchema = z.strictObject({
api_key: z.string(),
root: z.string(),
paths: z.string(),
runtime: z.enum(["node", "python"]),
use_proxy: z
.string()
.toLowerCase()
.transform(x => JSON.parse(x))
.pipe(z.boolean()),
terminate_on_failure: z
.string()
.toLowerCase()
.transform(x => JSON.parse(x))
.pipe(z.boolean())
.default("false"),
});
const nodeManagers = ["npm", "pnpm"];
const pythonManagers = ["pip", "uv"];

const paramsSchema = z
.strictObject({
api_key: z.string(),
root: z.string(),
paths: z.string(),
runtime: z.enum(["node", "python"]),
package_manager: z
.enum(["", ...nodeManagers, ...pythonManagers])
.describe("The preferred package manager for the runtime selected")
.default(""),
use_proxy: z
.string()
.toLowerCase()
.transform(x => JSON.parse(x))
.pipe(z.boolean()),
terminate_on_failure: z
.string()
.toLowerCase()
.transform(x => JSON.parse(x))
.pipe(z.boolean())
.default("false"),
})
.refine(
data => {
if (data.package_manager === "") {
return true;
}
if (data.runtime === "node") {
return nodeManagers.includes(data.package_manager as any);
}
if (data.runtime === "python") {
return pythonManagers.includes(data.package_manager as any);
}
return false;
},
{
message: "Package manager must match the selected runtime",
path: ["package_manager"], // This will show the error on the package_manager field
},
);
export type Params = z.infer<typeof paramsSchema>;

const TITLE = "## Braintrust eval report\n";
Expand All @@ -37,6 +63,7 @@ async function main(): Promise<void> {
root: core.getInput("root"),
paths: core.getInput("paths"),
runtime: core.getInput("runtime"),
package_manager: core.getInput("package_manager"),
use_proxy: core.getInput("use_proxy"),
terminate_on_failure: core.getInput("terminate_on_failure"),
});
Expand Down
File renamed without changes.
1 change: 1 addition & 0 deletions examples/pnpm.yml → examples/node/pnpm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,5 @@ jobs:
with:
api_key: ${{ secrets.BRAINTRUST_API_KEY }}
runtime: node
package_manager: pnpm
root: my_eval_dir
File renamed without changes.
42 changes: 42 additions & 0 deletions examples/python/uv.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Run Python evals

on:
push:
# Uncomment to run only when files in the 'evals' directory change
# - paths:
# - "evals/**"

permissions:
pull-requests: write
contents: read

jobs:
eval:
name: Run evals
runs-on: ubuntu-latest

steps:
- name: Checkout
id: checkout
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.12" # Replace with your Python version

# Tweak this to a dependency manager of your choice
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r test-eval-py/requirements.txt

- name: Run Evals
uses: braintrustdata/eval-action@v1
with:
api_key: ${{ secrets.BRAINTRUST_API_KEY }}
runtime: python
package_manager: uv
root: my_eval_dir
4 changes: 4 additions & 0 deletions mise.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[tools]
node = "20.6.0"
pnpm = "8"
python = "latest"
4 changes: 2 additions & 2 deletions script/release
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ GREEN='\033[0;32m'
BLUE='\033[0;34m'

# Get the latest release tag
latest_tag=$(git describe --tags "$(git rev-list --tags --max-count=1)")
latest_tag=$(git tag -l 'v*' --sort=-v:refname | head -n 1)

if [[ -z "$latest_tag" ]]; then
# There are no existing release tags
Expand Down Expand Up @@ -59,6 +59,6 @@ git tag -a "$tag_first_part" -m "$tag_first_part Release" -f
echo -e "${GREEN}Tagged: $tag_first_part${OFF}"

# Push the new tag to the remote
git push --tags -f
git push --tags
echo -e "${GREEN}Release tag pushed to remote${OFF}"
echo -e "${GREEN}Done!${OFF}"