diff --git a/examples/APIAgent-Py/APIAgent.ipynb b/examples/APIAgent-Py/APIAgent.ipynb
index adffce2..d31a0a8 100644
--- a/examples/APIAgent-Py/APIAgent.ipynb
+++ b/examples/APIAgent-Py/APIAgent.ipynb
@@ -826,7 +826,7 @@
"source": [
"Awesome! The logs now have a `no_hallucination` score which we can use to filter down hallucinations.\n",
"\n",
- "\n"
+ "\n"
]
},
{
@@ -839,7 +839,7 @@
"non-hallucinations are correct, but in a real-world scenario, you could [collect user feedback](https://www.braintrust.dev/docs/guides/logging#user-feedback)\n",
"and treat positively rated feedback as ground truth.\n",
"\n",
- "\n",
+ "\n",
"\n",
"## Running evals\n",
"\n",
@@ -1020,7 +1020,7 @@
"\n",
"To understand why, we can filter down to this regression, and take a look at a side-by-side diff.\n",
"\n",
- "\n",
+ "\n",
"\n",
"Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step.\n",
"Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields.\n",
diff --git a/examples/APIAgent-Py/assets/dataset-setup.gif b/examples/APIAgent-Py/assets/dataset-setup.gif
deleted file mode 100644
index 144e53b..0000000
Binary files a/examples/APIAgent-Py/assets/dataset-setup.gif and /dev/null differ
diff --git a/examples/APIAgent-Py/assets/dataset-setup.mp4 b/examples/APIAgent-Py/assets/dataset-setup.mp4
new file mode 100644
index 0000000..3cc5be3
Binary files /dev/null and b/examples/APIAgent-Py/assets/dataset-setup.mp4 differ
diff --git a/examples/APIAgent-Py/assets/logs-with-score.gif b/examples/APIAgent-Py/assets/logs-with-score.gif
deleted file mode 100644
index db34bc8..0000000
Binary files a/examples/APIAgent-Py/assets/logs-with-score.gif and /dev/null differ
diff --git a/examples/APIAgent-Py/assets/logs-with-score.mp4 b/examples/APIAgent-Py/assets/logs-with-score.mp4
new file mode 100644
index 0000000..c88cf40
Binary files /dev/null and b/examples/APIAgent-Py/assets/logs-with-score.mp4 differ
diff --git a/examples/APIAgent-Py/assets/regression-diff.gif b/examples/APIAgent-Py/assets/regression-diff.gif
deleted file mode 100644
index e3676c2..0000000
Binary files a/examples/APIAgent-Py/assets/regression-diff.gif and /dev/null differ
diff --git a/examples/APIAgent-Py/assets/regression-diff.mp4 b/examples/APIAgent-Py/assets/regression-diff.mp4
new file mode 100644
index 0000000..39ad362
Binary files /dev/null and b/examples/APIAgent-Py/assets/regression-diff.mp4 differ
diff --git a/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb b/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb
index 6685083..800171c 100644
--- a/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb
+++ b/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb
@@ -423,7 +423,7 @@
"- You should see the eval scores increase and you can see which test cases improved.\n",
"- You can also filter the test cases by improvements to know exactly why the scores changed.\n",
"\n",
- "\n",
+ "\n",
"\n"
]
},
diff --git a/examples/ClassifyingNewsArticles/assets/inspect.gif b/examples/ClassifyingNewsArticles/assets/inspect.gif
deleted file mode 100644
index 87ab876..0000000
Binary files a/examples/ClassifyingNewsArticles/assets/inspect.gif and /dev/null differ
diff --git a/examples/ClassifyingNewsArticles/assets/inspect.mp4 b/examples/ClassifyingNewsArticles/assets/inspect.mp4
new file mode 100644
index 0000000..811d8fd
Binary files /dev/null and b/examples/ClassifyingNewsArticles/assets/inspect.mp4 differ
diff --git a/examples/Github-Issues/Github-Issues.ipynb b/examples/Github-Issues/Github-Issues.ipynb
index 7de01da..c5a9471 100644
--- a/examples/Github-Issues/Github-Issues.ipynb
+++ b/examples/Github-Issues/Github-Issues.ipynb
@@ -482,7 +482,7 @@
"\n",
"Happy evaluating!\n",
"\n",
- "\n"
+ "\n"
]
}
],
diff --git a/examples/Github-Issues/assets/improvements.gif b/examples/Github-Issues/assets/improvements.gif
deleted file mode 100644
index 1a86c58..0000000
Binary files a/examples/Github-Issues/assets/improvements.gif and /dev/null differ
diff --git a/examples/Github-Issues/assets/improvements.mp4 b/examples/Github-Issues/assets/improvements.mp4
new file mode 100644
index 0000000..4e90f58
Binary files /dev/null and b/examples/Github-Issues/assets/improvements.mp4 differ
diff --git a/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb b/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb
index c034dae..f34369b 100644
--- a/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb
+++ b/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb
@@ -756,7 +756,7 @@
"\n",
"Although it's a fraction of the cost, it's both slower (likely due to rate limits) and worse performing than GPT-4o. 12 of the 60 cases failed to parse. Let's take a look at one of those in depth.\n",
"\n",
- "\n",
+ "\n",
"\n",
"That definitely looks like an invalid tool call. Maybe we can experiment with tweaking the prompt to get better results.\n",
"\n",
diff --git a/examples/LLaMa-3_1-Tools/assets/parsing-failure.gif b/examples/LLaMa-3_1-Tools/assets/parsing-failure.gif
deleted file mode 100644
index a9148b0..0000000
Binary files a/examples/LLaMa-3_1-Tools/assets/parsing-failure.gif and /dev/null differ
diff --git a/examples/LLaMa-3_1-Tools/assets/parsing-failure.mp4 b/examples/LLaMa-3_1-Tools/assets/parsing-failure.mp4
new file mode 100644
index 0000000..f22236e
Binary files /dev/null and b/examples/LLaMa-3_1-Tools/assets/parsing-failure.mp4 differ
diff --git a/examples/OTEL-logging/assets/add-post-filter.gif b/examples/OTEL-logging/assets/add-post-filter.gif
deleted file mode 100644
index 8dfbcbf..0000000
Binary files a/examples/OTEL-logging/assets/add-post-filter.gif and /dev/null differ
diff --git a/examples/OTEL-logging/assets/add-post-filter.mp4 b/examples/OTEL-logging/assets/add-post-filter.mp4
new file mode 100644
index 0000000..bbdbf0a
Binary files /dev/null and b/examples/OTEL-logging/assets/add-post-filter.mp4 differ
diff --git a/examples/OTEL-logging/assets/otel-demo.gif b/examples/OTEL-logging/assets/otel-demo.gif
deleted file mode 100644
index 17f8395..0000000
Binary files a/examples/OTEL-logging/assets/otel-demo.gif and /dev/null differ
diff --git a/examples/OTEL-logging/assets/otel-demo.mp4 b/examples/OTEL-logging/assets/otel-demo.mp4
new file mode 100644
index 0000000..7bcb2f4
Binary files /dev/null and b/examples/OTEL-logging/assets/otel-demo.mp4 differ
diff --git a/examples/OTEL-logging/assets/spans.gif b/examples/OTEL-logging/assets/spans.gif
deleted file mode 100644
index 27afc0c..0000000
Binary files a/examples/OTEL-logging/assets/spans.gif and /dev/null differ
diff --git a/examples/OTEL-logging/assets/spans.mp4 b/examples/OTEL-logging/assets/spans.mp4
new file mode 100644
index 0000000..1fd1e53
Binary files /dev/null and b/examples/OTEL-logging/assets/spans.mp4 differ
diff --git a/examples/OTEL-logging/otel-logging.mdx b/examples/OTEL-logging/otel-logging.mdx
index 98ceaed..7865d20 100644
--- a/examples/OTEL-logging/otel-logging.mdx
+++ b/examples/OTEL-logging/otel-logging.mdx
@@ -141,7 +141,7 @@ Run `npm install` to install the required dependencies, then `npm run dev` to la
Open your Braintrust project to the **Logs** page, and select **What orders have shipped?** in your applications. You should be able to watch the logs filter in as your application makes HTTP requests and LLM calls.
-
+
Because this application is using multi-step streaming and tool calls, the logs are especially interesting. In Braintrust, logs consist of [traces](/docs/guides/traces), which roughly correspond to a single request or interaction in your application. Traces consist of one or more spans, each of which corresponds to a unit of work in your application. In this example, each step and tool call is logged inside of its own span. This level of granularity makes it easier to debug issues, track user behavior, and collect data into datasets.
@@ -149,11 +149,11 @@ Because this application is using multi-step streaming and tool calls, the logs
Run a couple more queries in the app and notice the logs that are generated. Our app is logging both `GET` and `POST` requests, but we’re most interested in the `POST` requests since they contain our LLM calls. We can apply a filter using the [BTQL](/docs/reference/btql) query `Name LIKE 'POST%'` so that we only see the traces we care about:
-
+
You should now have a list of traces for all the `POST` requests your app has made. Each contains the inputs and outputs of each LLM call in a span called `ai.streamText`. If you go further into the trace, you’ll also notice a span for each tool call.
-
+
This is valuable data that can be used to evaluate the quality of accuracy of your application in Braintrust.
diff --git a/examples/PDFPlayground/PDFPlayground.mdx b/examples/PDFPlayground/PDFPlayground.mdx
index 9eea335..43e3cb9 100644
--- a/examples/PDFPlayground/PDFPlayground.mdx
+++ b/examples/PDFPlayground/PDFPlayground.mdx
@@ -348,7 +348,7 @@ Once your traces have been logged, you can use the Braintrust UI to manage your
You can store the user spans from your PDF traces into a dataset. Select the span, and then select **Add span to dataset**, or use the hotkey `D` to speed this up.
-
+
### Trying system prompts in a playground
@@ -357,7 +357,7 @@ Select a system prompt span, and then select **Try prompt** to:
1. Save the prompt (for example, "system1") to your library by selecting **Save as custom prompt**
2. Launch a playground using the saved prompt by selecting **Create playground with prompt**
-
+
### File attachment methods
@@ -365,13 +365,13 @@ There are two ways to attach PDF files in playgrounds: using the paperclip butto
- To upload files directly from your local machine, start by selecting **+ Message** to add a user prompt. Then, select **+ Message Part** > **File**. This will display a paperclip icon on the right side. Select it to upload a file from your local machine.
-
+
This method is particularly useful when you're working with local files that aren't accessible via public URL.
- To use the public URL method, paste the URL directly into the file message input field. You can also use mustache syntax to extract the URL from metadata.
-
+
This method streamlines the process when you're working with publicly available PDFs, like the earnings call transcripts we're using in this cookbook.
diff --git a/examples/PDFPlayground/assets/add-span-to-dataset.gif b/examples/PDFPlayground/assets/add-span-to-dataset.gif
deleted file mode 100644
index a1ee2a4..0000000
Binary files a/examples/PDFPlayground/assets/add-span-to-dataset.gif and /dev/null differ
diff --git a/examples/PDFPlayground/assets/add-span-to-dataset.mp4 b/examples/PDFPlayground/assets/add-span-to-dataset.mp4
new file mode 100644
index 0000000..99afd87
Binary files /dev/null and b/examples/PDFPlayground/assets/add-span-to-dataset.mp4 differ
diff --git a/examples/PDFPlayground/assets/paperclip.gif b/examples/PDFPlayground/assets/paperclip.gif
deleted file mode 100644
index 8eb8f9b..0000000
Binary files a/examples/PDFPlayground/assets/paperclip.gif and /dev/null differ
diff --git a/examples/PDFPlayground/assets/paperclip.mp4 b/examples/PDFPlayground/assets/paperclip.mp4
new file mode 100644
index 0000000..8cc02ba
Binary files /dev/null and b/examples/PDFPlayground/assets/paperclip.mp4 differ
diff --git a/examples/PDFPlayground/assets/try-prompt.gif b/examples/PDFPlayground/assets/try-prompt.gif
deleted file mode 100644
index 32366b6..0000000
Binary files a/examples/PDFPlayground/assets/try-prompt.gif and /dev/null differ
diff --git a/examples/PDFPlayground/assets/try-prompt.mp4 b/examples/PDFPlayground/assets/try-prompt.mp4
new file mode 100644
index 0000000..266b869
Binary files /dev/null and b/examples/PDFPlayground/assets/try-prompt.mp4 differ
diff --git a/examples/PDFPlayground/assets/url.gif b/examples/PDFPlayground/assets/url.gif
deleted file mode 100644
index af3b4af..0000000
Binary files a/examples/PDFPlayground/assets/url.gif and /dev/null differ
diff --git a/examples/PDFPlayground/assets/url.mp4 b/examples/PDFPlayground/assets/url.mp4
new file mode 100644
index 0000000..37c231b
Binary files /dev/null and b/examples/PDFPlayground/assets/url.mp4 differ
diff --git a/examples/ProviderBenchmark/ProviderBenchmark.ipynb b/examples/ProviderBenchmark/ProviderBenchmark.ipynb
index 2a90e01..2fe278f 100644
--- a/examples/ProviderBenchmark/ProviderBenchmark.ipynb
+++ b/examples/ProviderBenchmark/ProviderBenchmark.ipynb
@@ -433,7 +433,7 @@
"\n",
"Let's start by looking at the project view. Braintrust makes it easy to morph this into a multi-level grouped analysis where we can see the score vs. duration in a scatter plot, and how each provider stacks up in the table.\n",
"\n",
- "\n",
+ "\n",
"\n",
"### Insights\n",
"\n",
diff --git a/examples/ProviderBenchmark/assets/configuring-graph.gif b/examples/ProviderBenchmark/assets/configuring-graph.gif
deleted file mode 100644
index 0088f61..0000000
Binary files a/examples/ProviderBenchmark/assets/configuring-graph.gif and /dev/null differ
diff --git a/examples/ProviderBenchmark/assets/configuring-graph.mp4 b/examples/ProviderBenchmark/assets/configuring-graph.mp4
new file mode 100644
index 0000000..a0290d9
Binary files /dev/null and b/examples/ProviderBenchmark/assets/configuring-graph.mp4 differ
diff --git a/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx b/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx
index f296a3b..dcb95cb 100644
--- a/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx
+++ b/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx
@@ -7,7 +7,7 @@ import { LoomVideo } from "#/ui/docs/loom";
import Link from "fumadocs-core/link";
import { Callout } from "fumadocs-ui/components/callout";
import { Step, Steps } from "fumadocs-ui/components/steps";
-import Image from 'next/image';
+import Image from "next/image";
# Changelog
@@ -52,18 +52,18 @@ import Image from 'next/image';
- The Traceloop OTEL integration now uses the input and output attributes to populate the corresponding fields in Braintrust.
- The monitor page now supports querying experiment metrics.
- Removed the `filters` param from the REST API fetch endpoint. For complex
-queries, we recommend using the `/btql` endpoint ([docs](/docs/reference/btql)).
+ queries, we recommend using the `/btql` endpoint ([docs](/docs/reference/btql)).
- New experiment summary layout option, a url-friendly view for experiment summaries that respects all filters.
- Add a default limit of 10 to all fetch and `/btql` requests for project_logs.
- You can now export your prompts from the playground as code snippets and run them through the [AI proxy](/docs/guides/proxy).
- Add a fallback for the "add prompt" dropdown button in the playground, which
-will search for prompts within the current project if the cross-org prompts
-query fails.
+ will search for prompts within the current project if the cross-org prompts
+ query fails.
### SDK (version 0.0.171)
- Add a `.data` method to the `Attachment` class, which lets you inspect the
-loaded attachment data.
+ loaded attachment data.
## Week of 2024-11-12
@@ -99,6 +99,7 @@ loaded attachment data.
- Create custom columns on dataset, experiment and logs tables from `JSON` values in `input`, `output`, `expected`, or `metadata` fields.
### API (version 0.0.59)
+
- Fix permissions bug with updating org-scoped env vars
## Week of 2024-10-28
@@ -151,7 +152,7 @@ loaded attachment data.
### SDK (version 0.0.164)
- Add `braintrust.permalink` function to create deep links pointing to
-particular spans in the Braintrust UI.
+ particular spans in the Braintrust UI.
## Week of 2024-10-07
@@ -170,7 +171,7 @@ particular spans in the Braintrust UI.
### SDK (version 0.0.161)
- Add utility function `spanComponentsToObjectId` for resolving the object ID
-from an exported span slug.
+ from an exported span slug.
## Week of 2024-09-30
@@ -178,19 +179,21 @@ from an exported span slug.
- Add support for [Cerebras](https://cerebras.ai/) models in the proxy, playground, and saved prompts.
- You can now create [span iframe viewers](/docs/guides/tracing#custom-span-iframes) to visualize span data in a custom iframe.
In this example, the "Table" section is a custom span iframe.
-
+ 
- `NOT LIKE`, `NOT ILIKE`, `NOT INCLUDES`, and `NOT CONTAINS` supported in BTQL.
- Add "Upload Rows" button to insert rows into an existing dataset from CSV or JSON.
- Add "Maximum" aggregate score type.
- The experiment table now supports grouping by input (for trials) or by a metadata field.
- - The Name and Input columns are now pinned
+ - The Name and Input columns are now pinned
- Gemini models now support multimodal inputs.
## Week of 2024-09-23
- Basic monitor page that shows aggregate values for latency, token count, time to first token, and cost for logs.
- Create custom tools to use in your prompts and in the playground. See the [docs](/docs/guides/prompts#calling-external-tools) for more details.
-- Set org-wide environment variables to use in these tools
+-
+ Set org-wide environment variables
+ to use in these tools
- Pull your prompts to your codebase using the `braintrust pull` command.
- Select and compare multiple experiments in the experiment view using the `compared with` dropdown.
- The playground now displays aggregate scores (avg/max/min) for each prompt and supports sorting rows by a score.
@@ -220,7 +223,6 @@ from an exported span slug.
- The tag picker now includes tags that were added dynamically via API, in addition to the tags configured for your project.
- Added a REST API for managing AI secrets. See [docs](/docs/reference/api/AiSecrets).
-
### SDK (version 0.0.158)
- A dedicated `update` method is now available for datasets.
@@ -233,11 +235,11 @@ from an exported span slug.
- You can now create server-side online evaluations for your logs. Online evals support both [autoevals](/docs/reference/autoevals) and
[custom scorers](/docs/guides/playground) you define as LLM-as-a-judge, TypeScript, or Python functions. See
[docs](/docs/guides/evals/write#online-evaluation) for more details.
-
+
- New member invitations now support being added to multiple permission groups.
- Move datasets and prompts to a new Library navigation tab, and include a list of custom scorers.
- Clean up tree view by truncating the root preview and showing a preview of a node only if collapsed.
-
+ 
- Automatically save changes to table views.
## Week of 2024-09-02
@@ -294,12 +296,13 @@ npx braintrust eval --bundle
## Week of 2024-08-12
- You can now create custom LLM and code (TypeScript and Python) evaluators in the playground.
-
+
+
- Fullscreen trace toggle
- Datasets now accept JSON file uploads
- When uploading a CSV/JSON file to a dataset, columns/fields named `input`, `expected`, and `metadata`
-are now auto-assigned to the corresponding dataset fields
+ are now auto-assigned to the corresponding dataset fields
- Fix bug in logs/dataset viewer when changing the search params.
### API (version 0.0.53)
@@ -315,7 +318,7 @@ are now auto-assigned to the corresponding dataset fields
- These metrics, along with cost, now exclude LLM calls used in autoevals (as of 0.0.85)
- Switching organizations via the header navigates to the same-named project in the selected organization
- Added `MarkAsyncWrapper` to the Python SDK to allow explicitly marking
-functions which return awaitable objects as async
+ functions which return awaitable objects as async
### Autoevals (version 0.0.85)
@@ -370,37 +373,37 @@ functions which return awaitable objects as async
## Week of 2024-07-22
- Categorical human review scores can now be re-ordered via Drag-n-Drop.
-
+ 
- Human review row selection is now a free text field, enabling a quick jump to a specific row.
-
+ 
- Added REST endpoint for managing org membership. See
[docs](/docs/reference/api/Organizations#modify-organization-membership).
### API (version 0.0.51)
-* The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some
+- The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some
exciting new features. Here is what you need to know:
- * The updates are available as of API version 0.0.51.
- * The proxy is now accessible at `https://api.braintrust.dev/v1/proxy`. You can use this as a base URL in your OpenAI client,
+ - The updates are available as of API version 0.0.51.
+ - The proxy is now accessible at `https://api.braintrust.dev/v1/proxy`. You can use this as a base URL in your OpenAI client,
instead of `https://braintrustproxy.com/v1`. [NOTE: The latter is still supported, but will be deprecated in the future.]
- * If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as
+ - If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as
a separate service.
- * If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the
+ - If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the
"Outputs" tab.

- * Then, replace that in your settings page settings page
+- Then, replace that in your settings page settings page

- * If you have a Docker-based deployment, you can just update your containers.
- * Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set.
+- If you have a Docker-based deployment, you can just update your containers.
+- Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set.
### SDK (version 0.0.146)
-* Add support for `max_concurrency` in the Python SDK
-* Hill climbing evals that use a `BaseExperiment` as data will use that as the default base experiment.
+- Add support for `max_concurrency` in the Python SDK
+- Hill climbing evals that use a `BaseExperiment` as data will use that as the default base experiment.
## Week of 2024-07-15
@@ -420,14 +423,14 @@ functions which return awaitable objects as async
### Autoevals (version 0.0.77)
-* Officially switch the default model to be `gpt-4o`. Our testing showed that it performed on average 10% more accurately than `gpt-3.5-turbo`!
-* Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the `model` param in any LLM based evaluator.
- * Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings.
+- Officially switch the default model to be `gpt-4o`. Our testing showed that it performed on average 10% more accurately than `gpt-3.5-turbo`!
+- Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the `model` param in any LLM based evaluator.
+ - Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings.
## Week of 2024-07-08
- Human review scores are now sortable from the project configuration page.
-
+ 
- Streaming support for tool calls in Anthropic models through the proxy and playground.
- The playground now supports different "parsing" modes:
- `auto`: (same as before) the completion text and the first tool call arguments, if any
@@ -437,7 +440,6 @@ functions which return awaitable objects as async
- Cleaned up environment variables in the public [docker
deployment](https://github.com/braintrustdata/braintrust-deployment/tree/main/docker). Functionally, nothing has changed.
-
### Autoevals (version 0.0.76)
- New `.partial(...)` syntax to initialize a scorer with partial arguments like `criteria` in `ClosedQA`.
@@ -447,7 +449,7 @@ functions which return awaitable objects as async
- Table views [can now be saved](/docs/reference/views), persisting the BTQL filters, sorts, and column state.
- Add support for the new `window.ai` model into the playground.
-
+ 
- Use push history when navigating table rows to allow for back button navigation.
- In the experiments list, grouping by a metadata field will group rows in the table as well.
- Allow the trace tree panel to be resized.
@@ -471,8 +473,8 @@ const foo = wrapTraced(async function foo(input) {
});
```
-
### SDK (version 0.0.138)
+
- The TypeScript SDK's `Eval()` function now takes a `maxConcurrency` parameter, which bounds the
number of concurrent tasks that run.
- `braintrust install api` now sets up your API and Proxy URL in your environment.
@@ -512,7 +514,7 @@ const foo = wrapTraced(async function foo(input) {
## Week of 2024-06-03
- You can now collapse the trace tree. It's auto collapsed if you have a single span.
-
+ 
- Improvements to the experiment chart including greyed out lines for inactive scores and improved legend.
- Show diffs when you save a new prompt version.
@@ -819,7 +821,7 @@ server with `curl [api-url]/version`, where the API URL can be found on the

+ view](/docs/release-notes/ReleaseNotes11-27-search.mp4)
- Upgraded AI Proxy to support [tracking Prometheus metrics](https://github.com/braintrustdata/braintrust-proxy/blob/a31a82e6d46ff442a3c478773e6eec21f3d0ba69/apis/cloudflare/wrangler-template.toml#L19C1-L19C1)
- Modified Autoevals library to use the [AI proxy](/docs/guides/proxy)
@@ -1176,7 +1178,7 @@ Eval([eval_name], {
- Fixed our libraries including Autoevals to work with OpenAI’s new libraries

+ playground](/docs/release-notes/ReleaseNotes-2023-11-functions.mp4)
- Added support for function calling and tools in our prompt playground
- Added tabs on a project page for datasets, experiments, etc.
@@ -1283,7 +1285,7 @@ Eval(
- The prompt playground is now live! We're excited to get your feedback as we continue to build
this feature out. See [the docs](/docs/guides/playground) for more information.
-
+
## Week of 2023-08-21
@@ -1295,7 +1297,7 @@ Eval(
changes to your code.
- You can now edit datasets in the UI.
-
+
## Week of 2023-08-14
@@ -1399,11 +1401,11 @@ braintrust install api --update-template
- You can now swap the primary and comparison experiment with a single click.
-
+
- You can now compare `output` vs. `expected` within an experiment.
-
+
- Version 0.0.19 is out for the SDK. It is an important update that throws an error if your payload is larger than 64KB in size.
@@ -1435,7 +1437,7 @@ braintrust install api --update-template
- New scatter plot and histogram insights to quickly analyze scores and filter down examples.
- 
+ 
- API keys that can be set in the SDK (explicitly or through an environment variable) and do not require user login.
Visit the settings page to create an API key.
diff --git a/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx b/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx
index 8059234..1aa25bb 100644
--- a/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx
+++ b/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx
@@ -13,7 +13,7 @@ feedback from end users, subject matter experts, and product teams in one place.
use human review to evaluate/compare experiments, assess the efficacy of your automated scoring
methods, and curate log events to use in your evals.
-
+
## Configuring human review
@@ -32,7 +32,6 @@ options and their scores.
Once you create a score, it will automatically appear in the "Scores" section in each experiment
and log event throughout the project.
-
### Writing to expected fields
You may choose to write categorical scores to the `expected` field of a span instead of a score.
@@ -40,9 +39,9 @@ To enable this, simply check the "Write to expected field instead of score" opti
an option to select multiple values when writing to the expected field.
- A numeric score will not be assigned to the categorical options when writing to the expected
- field. If there is an existing object in the expected field, the categorical value will be
- appended to the object.
+ A numeric score will not be assigned to the categorical options when writing
+ to the expected field. If there is an existing object in the expected field,
+ the categorical value will be appended to the object.

@@ -54,7 +53,7 @@ In addition to categorical scores, you can always directly edit the structured o
To manually review results in your logs or an experiment, simply click on a row, and you'll see
the human review scores you configured in the expanded trace view.
-
+
As you set scores, they will be automatically saved and reflected in the summary metrics. The exact same
mechanism works whether you're reviewing logs or experiments.
@@ -64,7 +63,7 @@ mechanism works whether you're reviewing logs or experiments.
In addition to setting scores, you can also add comments to spans and update their `expected` values. These updates
are tracked alongside score updates to form an audit trail of edits to a span.
-
+
## Rapid review mode
@@ -72,7 +71,7 @@ If you or a subject matter expert is reviewing a large number of logs, you can u
a UI that's optimized specifically for review. To enter review mode, hit the "r" key or the expand ()
icon next to the "Human review" header.
-
+
In review mode, you can set scores, leave comments, and edit expected values. Review mode is optimized for keyboard
navigation, so you can quickly move between scores and rows with keyboard shortcuts. You can also share a link to the
diff --git a/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx b/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx
index 1963380..cd33703 100644
--- a/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx
+++ b/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx
@@ -31,7 +31,7 @@ that includes one or more prompts and is linked to a dataset.
Playgrounds are designed for collaboration and automatically synchronize in real-time.
-
+
To share a playground, simply copy the URL and send it to your collaborators. Your collaborators
must be members of your organization to see the session. You can invite users from the settings page.
diff --git a/examples/ReceiptExtraction/ReceiptExtraction.ipynb b/examples/ReceiptExtraction/ReceiptExtraction.ipynb
index d51589f..278727e 100644
--- a/examples/ReceiptExtraction/ReceiptExtraction.ipynb
+++ b/examples/ReceiptExtraction/ReceiptExtraction.ipynb
@@ -403,7 +403,7 @@
"\n",
"If you click into the gpt-4o experiment and compare it to gpt-4o-mini, you can drill down into the individual improvements and regressions.\n",
"\n",
- "\n",
+ "\n",
"\n",
"There are several different types of regressions, one of which appears to be that `gpt-4o` returns information in a different case than `gpt-4o-mini`. That may or\n",
"may not be important for this use case, but if not, we could adjust our scoring functions to lowercase everything before comparing.\n",
diff --git a/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.gif b/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.gif
deleted file mode 100644
index b31998d..0000000
Binary files a/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.gif and /dev/null differ
diff --git a/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.mp4 b/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.mp4
new file mode 100644
index 0000000..f99a6f0
Binary files /dev/null and b/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.mp4 differ
diff --git a/examples/SimpleRagas/SimpleRagas.ipynb b/examples/SimpleRagas/SimpleRagas.ipynb
index b60c723..8f3f9bb 100644
--- a/examples/SimpleRagas/SimpleRagas.ipynb
+++ b/examples/SimpleRagas/SimpleRagas.ipynb
@@ -449,7 +449,7 @@
"results, and maybe we should try using `gpt-4` instead. Braintrust lets us test the effect of this quickly, directly in the UI, before we run\n",
"a full experiment:\n",
"\n",
- "\n",
+ "\n",
"\n",
"Looks better. Let's update our scoring function to use it and re-run the experiment.\n"
]
@@ -575,7 +575,7 @@
"We can drill down on individual examples of each regression type to better understand it. The side-by-side diffs built into Braintrust make\n",
"it easy to deeply understand every step of the pipeline, for example, which documents were missing, and why.\n",
"\n",
- "\n",
+ "\n",
"\n",
"And there you have it! Ragas is a powerful technique, that with the right tools and iteration can lead to really high quality RAG applications. Happy evaling!\n"
]
diff --git a/examples/SimpleRagas/assets/missing-docs.gif b/examples/SimpleRagas/assets/missing-docs.gif
deleted file mode 100644
index 15e36ef..0000000
Binary files a/examples/SimpleRagas/assets/missing-docs.gif and /dev/null differ
diff --git a/examples/SimpleRagas/assets/missing-docs.mp4 b/examples/SimpleRagas/assets/missing-docs.mp4
new file mode 100644
index 0000000..f5bc15e
Binary files /dev/null and b/examples/SimpleRagas/assets/missing-docs.mp4 differ
diff --git a/examples/SimpleRagas/assets/try-gpt-4.gif b/examples/SimpleRagas/assets/try-gpt-4.gif
deleted file mode 100644
index 78a398a..0000000
Binary files a/examples/SimpleRagas/assets/try-gpt-4.gif and /dev/null differ
diff --git a/examples/SimpleRagas/assets/try-gpt-4.mp4 b/examples/SimpleRagas/assets/try-gpt-4.mp4
new file mode 100644
index 0000000..b4b6ea0
Binary files /dev/null and b/examples/SimpleRagas/assets/try-gpt-4.mp4 differ
diff --git a/examples/Text2SQL-Data/Text2SQL-Data.ipynb b/examples/Text2SQL-Data/Text2SQL-Data.ipynb
index a523128..9b842f6 100644
--- a/examples/Text2SQL-Data/Text2SQL-Data.ipynb
+++ b/examples/Text2SQL-Data/Text2SQL-Data.ipynb
@@ -400,7 +400,7 @@
"1. Let's capture the good data into a dataset. Since our eval pipeline did the hard work of generating a reference query and results, we can\n",
" now save these, and make sure that future changes we make do not _regress_ the results.\n",
"\n",
- "\n",
+ "\n",
"\n",
"- The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model.\n",
"\n",
@@ -986,7 +986,7 @@
"\n",
"Braintrust makes it easy to filter down to the regressions, and view a side-by-side diff:\n",
"\n",
- "\n",
+ "\n",
"\n",
"## Conclusion\n",
"\n",
diff --git a/examples/Text2SQL-Data/assets/add-to-dataset.gif b/examples/Text2SQL-Data/assets/add-to-dataset.gif
deleted file mode 100644
index 17812b2..0000000
Binary files a/examples/Text2SQL-Data/assets/add-to-dataset.gif and /dev/null differ
diff --git a/examples/Text2SQL-Data/assets/add-to-dataset.mp4 b/examples/Text2SQL-Data/assets/add-to-dataset.mp4
new file mode 100644
index 0000000..ec985c9
Binary files /dev/null and b/examples/Text2SQL-Data/assets/add-to-dataset.mp4 differ
diff --git a/examples/Text2SQL-Data/assets/analyze-regressions.gif b/examples/Text2SQL-Data/assets/analyze-regressions.gif
deleted file mode 100644
index a0c2b72..0000000
Binary files a/examples/Text2SQL-Data/assets/analyze-regressions.gif and /dev/null differ
diff --git a/examples/Text2SQL-Data/assets/analyze-regressions.mp4 b/examples/Text2SQL-Data/assets/analyze-regressions.mp4
new file mode 100644
index 0000000..77204be
Binary files /dev/null and b/examples/Text2SQL-Data/assets/analyze-regressions.mp4 differ
diff --git a/examples/ToolOCR/ToolOCR.mdx b/examples/ToolOCR/ToolOCR.mdx
index 7ac7132..8970bff 100644
--- a/examples/ToolOCR/ToolOCR.mdx
+++ b/examples/ToolOCR/ToolOCR.mdx
@@ -95,7 +95,7 @@ braintrust push ocr.py --requirements requirements.txt
To try out the tool, visit the **toolOCR** project in Braintrust, and navigate **Tools**. Here, you can test different images and see what kinds of outputs you're getting from the tool.
-
+
This is helpful information for deciding if you'd like to do any additional post processing to the text output. For example, you may notice that your output contains `/n` to indicate new lines in the parsed text. You could include additional processing in your tool to do this. If you change your code, just run `braintrust push ocr.py --requirements requirements.txt` again to sync the tool with Braintrust.
@@ -117,7 +117,7 @@ prompt = project.prompts.create(
Just like the tool, you can run it in the UI and even try it out on some examples:
-
+
If you visit the **Logs** tab, you can check out detailed logs for each call:
@@ -142,7 +142,7 @@ Then, navigate to **Dataset** in your playground and select the **Recipes** data
Your playground is now set up with a prompt, model choice, dataset, and the tool we created. Hit **Run** to run the prompt and tool on the images in the dataset.
-
+
## Iterating on the prompt
@@ -150,7 +150,7 @@ Now that we have an interactive environment to test out our prompt and tool call
Hit the copy icon to duplicate your prompt and start tweaking. You can also tweak the original prompt and save your changes there if you'd like. For example, you can try instructing the model to always list the quantity of each ingredient you need to purchase.
-
+
Once you're satisfied with the prompt, hit **Update** to save the changes. Each time you save the prompt, you
create a new version. To learn more about how to use a prompt in your code, check out the
diff --git a/examples/ToolOCR/assets/run-playground.gif b/examples/ToolOCR/assets/run-playground.gif
deleted file mode 100644
index aa30536..0000000
Binary files a/examples/ToolOCR/assets/run-playground.gif and /dev/null differ
diff --git a/examples/ToolOCR/assets/run-playground.mp4 b/examples/ToolOCR/assets/run-playground.mp4
new file mode 100644
index 0000000..24d9e0a
Binary files /dev/null and b/examples/ToolOCR/assets/run-playground.mp4 differ
diff --git a/examples/ToolOCR/assets/try-prompt.gif b/examples/ToolOCR/assets/try-prompt.gif
deleted file mode 100644
index 864ac57..0000000
Binary files a/examples/ToolOCR/assets/try-prompt.gif and /dev/null differ
diff --git a/examples/ToolOCR/assets/try-prompt.mp4 b/examples/ToolOCR/assets/try-prompt.mp4
new file mode 100644
index 0000000..9fccb54
Binary files /dev/null and b/examples/ToolOCR/assets/try-prompt.mp4 differ
diff --git a/examples/ToolOCR/assets/try-tool.gif b/examples/ToolOCR/assets/try-tool.gif
deleted file mode 100644
index bef1402..0000000
Binary files a/examples/ToolOCR/assets/try-tool.gif and /dev/null differ
diff --git a/examples/ToolOCR/assets/try-tool.mp4 b/examples/ToolOCR/assets/try-tool.mp4
new file mode 100644
index 0000000..03dd298
Binary files /dev/null and b/examples/ToolOCR/assets/try-tool.mp4 differ
diff --git a/examples/ToolOCR/assets/tweak-prompt.gif b/examples/ToolOCR/assets/tweak-prompt.gif
deleted file mode 100644
index 55ba4cf..0000000
Binary files a/examples/ToolOCR/assets/tweak-prompt.gif and /dev/null differ
diff --git a/examples/ToolOCR/assets/tweak-prompt.mp4 b/examples/ToolOCR/assets/tweak-prompt.mp4
new file mode 100644
index 0000000..2c68b78
Binary files /dev/null and b/examples/ToolOCR/assets/tweak-prompt.mp4 differ
diff --git a/examples/ToolRAG/ToolRAG.mdx b/examples/ToolRAG/ToolRAG.mdx
index 1731fab..6628849 100644
--- a/examples/ToolRAG/ToolRAG.mdx
+++ b/examples/ToolRAG/ToolRAG.mdx
@@ -7,7 +7,7 @@ to compare multiple versions side-by-side, you'd have to deploy each version sep
Using Braintrust, you can experiment with different
prompts together with retrieval logic, side-by-side, all within the playground UI. In this cookbook, we'll walk through exactly how.
-
+
## Architecture
@@ -117,7 +117,7 @@ The output should be:
To try out the tool, visit the project in Braintrust, and navigate to **Tools**.
-
+
Here, you can test different searches and refine the logic. For example, you could try playing with various
`top_k` values, or adding a prefix to the query to guide the results. If you change the code, run
@@ -150,7 +150,7 @@ npx braintrust push prompt.ts
Once the prompt uploads, you can run it in the UI and even try it out on some examples:
-
+
If you visit the **Logs** tab, you can check out detailed logs for each call:
@@ -181,12 +181,12 @@ Once you create it, if you visit the **Datasets** tab, you'll be able to explore
To try out the prompt together with the dataset, we'll create a playground.
-
+
Once you create the playground, hit **Run** to run the prompt and tool on the questions
in the dataset.
-
+
### Define a scorer
@@ -228,7 +228,7 @@ Once you define the scorer, hit **Run** to run it on the questions in the datase
Now, let's tweak the prompt to see if we can improve the results. Hit the copy icon to duplicate your prompt and start tweaking. You can also tweak the original prompt and save your changes there if you'd like. For example, you can try instructing the model to always include a Python and
TypeScript code snippet.
-
+
Once you're satisfied with the prompt, hit **Update** to save the changes. Each time you save the prompt, you
create a new version. To learn more about how to use a prompt in your code, check out the
diff --git a/examples/ToolRAG/assets/Create-playground.gif b/examples/ToolRAG/assets/Create-playground.gif
deleted file mode 100644
index c12c6b1..0000000
Binary files a/examples/ToolRAG/assets/Create-playground.gif and /dev/null differ
diff --git a/examples/ToolRAG/assets/Create-playground.mp4 b/examples/ToolRAG/assets/Create-playground.mp4
new file mode 100644
index 0000000..43d4af0
Binary files /dev/null and b/examples/ToolRAG/assets/Create-playground.mp4 differ
diff --git a/examples/ToolRAG/assets/Run-playground.gif b/examples/ToolRAG/assets/Run-playground.gif
deleted file mode 100644
index 494ee17..0000000
Binary files a/examples/ToolRAG/assets/Run-playground.gif and /dev/null differ
diff --git a/examples/ToolRAG/assets/Run-playground.mp4 b/examples/ToolRAG/assets/Run-playground.mp4
new file mode 100644
index 0000000..b1d3280
Binary files /dev/null and b/examples/ToolRAG/assets/Run-playground.mp4 differ
diff --git a/examples/ToolRAG/assets/Side-by-side.gif b/examples/ToolRAG/assets/Side-by-side.gif
deleted file mode 100644
index 5c333a2..0000000
Binary files a/examples/ToolRAG/assets/Side-by-side.gif and /dev/null differ
diff --git a/examples/ToolRAG/assets/Side-by-side.mp4 b/examples/ToolRAG/assets/Side-by-side.mp4
new file mode 100644
index 0000000..bc732d6
Binary files /dev/null and b/examples/ToolRAG/assets/Side-by-side.mp4 differ
diff --git a/examples/ToolRAG/assets/Test-prompt.gif b/examples/ToolRAG/assets/Test-prompt.gif
deleted file mode 100644
index 7ed7f95..0000000
Binary files a/examples/ToolRAG/assets/Test-prompt.gif and /dev/null differ
diff --git a/examples/ToolRAG/assets/Test-prompt.mp4 b/examples/ToolRAG/assets/Test-prompt.mp4
new file mode 100644
index 0000000..53f26e2
Binary files /dev/null and b/examples/ToolRAG/assets/Test-prompt.mp4 differ
diff --git a/examples/ToolRAG/assets/Test-tool.gif b/examples/ToolRAG/assets/Test-tool.gif
deleted file mode 100644
index 70b01bb..0000000
Binary files a/examples/ToolRAG/assets/Test-tool.gif and /dev/null differ
diff --git a/examples/ToolRAG/assets/Test-tool.mp4 b/examples/ToolRAG/assets/Test-tool.mp4
new file mode 100644
index 0000000..387f22c
Binary files /dev/null and b/examples/ToolRAG/assets/Test-tool.mp4 differ
diff --git a/examples/ToolRAG/assets/Tweak-prompt.gif b/examples/ToolRAG/assets/Tweak-prompt.gif
deleted file mode 100644
index fe25c28..0000000
Binary files a/examples/ToolRAG/assets/Tweak-prompt.gif and /dev/null differ
diff --git a/examples/ToolRAG/assets/Tweak-prompt.mp4 b/examples/ToolRAG/assets/Tweak-prompt.mp4
new file mode 100644
index 0000000..1b1696d
Binary files /dev/null and b/examples/ToolRAG/assets/Tweak-prompt.mp4 differ
diff --git a/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx b/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx
index ec7ad13..1252fca 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx
@@ -545,7 +545,7 @@ Question: How do I purchase GPUs through Braintrust?
Awesome! The logs now have a `no_hallucination` score which we can use to filter down hallucinations.
-
+
### Creating datasets
@@ -553,7 +553,7 @@ Let's create two datasets: one for good answers and the other for hallucinations
non-hallucinations are correct, but in a real-world scenario, you could [collect user feedback](https://www.braintrust.dev/docs/guides/logging#user-feedback)
and treat positively rated feedback as ground truth.
-
+
## Running evals
@@ -680,7 +680,7 @@ Awesome! Looks like we were able to solve the hallucinations, although we may ha
To understand why, we can filter down to this regression, and take a look at a side-by-side diff.
-
+
Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step.
Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields.
diff --git a/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx b/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx
index 9be6106..11c9ea1 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx
@@ -390,4 +390,4 @@ them into your evals.
Happy evaluating!
-
+
diff --git a/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx b/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx
index d430e96..652b905 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx
@@ -605,7 +605,7 @@ Ok, let's dig into the results. To start, we'll look at how LLaMa-3.1-8B compare
Although it's a fraction of the cost, it's both slower (likely due to rate limits) and worse performing than GPT-4o. 12 of the 60 cases failed to parse. Let's take a look at one of those in depth.
-
+
That definitely looks like an invalid tool call. Maybe we can experiment with tweaking the prompt to get better results.
diff --git a/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx b/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx
index 842c7ae..792c920 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx
@@ -381,7 +381,7 @@ await Promise.all(providers.map(runProviderBenchmark));
Let's start by looking at the project view. Braintrust makes it easy to morph this into a multi-level grouped analysis where we can see the score vs. duration in a scatter plot, and how each provider stacks up in the table.
-
+
### Insights
diff --git a/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx b/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx
index 6dfb6b2..9e5a068 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx
@@ -325,7 +325,7 @@ Let's dig into these individual results in some more depth.
If you click into the gpt-4o experiment and compare it to gpt-4o-mini, you can drill down into the individual improvements and regressions.
-
+
There are several different types of regressions, one of which appears to be that `gpt-4o` returns information in a different case than `gpt-4o-mini`. That may or
may not be important for this use case, but if not, we could adjust our scoring functions to lowercase everything before comparing.
diff --git a/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx b/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx
index 632dfe9..2aabe67 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx
@@ -339,7 +339,7 @@ By default, Ragas is configured to use `gpt-3.5-turbo-16k`. As we observed, it l
results, and maybe we should try using `gpt-4` instead. Braintrust lets us test the effect of this quickly, directly in the UI, before we run
a full experiment:
-
+
Looks better. Let's update our scoring function to use it and re-run the experiment.
@@ -427,6 +427,6 @@ Although not a pure fail, it does seem like in 3 cases we're not retrieving the
We can drill down on individual examples of each regression type to better understand it. The side-by-side diffs built into Braintrust make
it easy to deeply understand every step of the pipeline, for example, which documents were missing, and why.
-
+
And there you have it! Ragas is a powerful technique, that with the right tools and iteration can lead to really high quality RAG applications. Happy evaling!
diff --git a/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx b/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx
index dcb6089..8d58255 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx
@@ -290,7 +290,7 @@ To best utilize these results:
1. Let's capture the good data into a dataset. Since our eval pipeline did the hard work of generating a reference query and results, we can
now save these, and make sure that future changes we make do not _regress_ the results.
-
+
- The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model.
@@ -685,7 +685,7 @@ Interesting. It seems like that was not a slam dunk. There were a few regression
Braintrust makes it easy to filter down to the regressions, and view a side-by-side diff:
-
+
## Conclusion
diff --git a/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx b/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx
index 8ad4a9a..30e3804 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx
@@ -163,7 +163,7 @@ Now, let’s use the comprehensiveness scorer to create a feedback loop that all
Go to your Braintrust **Logs** and select one of your logs. In the expanded view on the left-hand side of your screen, select the **generate-changelog** span, then select **Add to dataset**. Create a new dataset called `eval dataset`, and add a couple more logs to the same dataset. We'll use this dataset to run an experiment that evaluates for comprehensiveness to understand where the prompt might need adjustments.
)
icon next to the "Human review" header.
-
+
In review mode, you can set scores, leave comments, and edit expected values. Review mode is optimized for keyboard
navigation, so you can quickly move between scores and rows with keyboard shortcuts. You can also share a link to the
diff --git a/examples/ToolRAG/tool-rag/docs-sample/logging.mdx b/examples/ToolRAG/tool-rag/docs-sample/logging.mdx
index baf0df6..f0b6c89 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/logging.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/logging.mdx
@@ -439,7 +439,7 @@ def my_feedback_handler(req):
Braintrust supports curating logs by adding tags, and then filtering on them in the UI. Tags naturally flow between logs, to datasets, and even
to experiments, so you can use them to track various kinds of data across your application, and track how they change over time.
-
+
### Configuring tags
@@ -575,7 +575,7 @@ def my_feedback_handler(req):
To filter by tags, simply select the tags you want to filter by in the UI.
-
+
### Using tags to create queues
@@ -586,7 +586,7 @@ and one to indicate that the event is no longer in the queue. For example, you m
As you're reviewing logs, simply add the `triage` tag to the logs you want to review later. To see the logs in the queue, filter by the
`triage` tag. You can add an additional label, like `NOT (tags includes 'triaged')` to exclude logs that have been marked as done.
-
+
## Online evaluation
@@ -599,7 +599,7 @@ a sampling rate along with more granular filters to control which logs get evalu
To create an online evaluation, navigate to the "Configuration" tab in a project and create an online scoring rule.
-
+
The score will now automatically run at the specified sampling rate for all logs in the project.
diff --git a/examples/ToolRAG/tool-rag/docs-sample/playground.mdx b/examples/ToolRAG/tool-rag/docs-sample/playground.mdx
index 28b2b63..5236ef9 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/playground.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/playground.mdx
@@ -29,7 +29,7 @@ that includes one or more prompts and is linked to a dataset.
Playgrounds are designed for collaboration and automatically synchronize in real-time.
-
+
To share a playground, simply copy the URL and send it to your collaborators. Your collaborators
must be members of your organization to see the session. You can invite users from the settings page.
diff --git a/examples/ToolRAG/tool-rag/docs-sample/prompts.mdx b/examples/ToolRAG/tool-rag/docs-sample/prompts.mdx
index 36554f7..102d904 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/prompts.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/prompts.mdx
@@ -19,7 +19,7 @@ To create a prompt, visit the prompts tab in a project, and click the "+ Prompt"
for your prompt. The slug is an immutable identifier that you can use to reference it in your code. As you change
the prompt's name, description, or contents, its slug stays constant.
-
+
Prompts can use [mustache](https://mustache.github.io/mustache.5.html) templating syntax to refer to variables. These variables are substituted
automatically in the API, playground, and using the `.build()` function in your code. More on that below.
@@ -29,7 +29,7 @@ automatically in the API, playground, and using the `.build()` function in your
Each prompt change is versioned, e.g. `5878bd218351fb8e`. You can use this identifier to pin a specific
version of the prompt in your code.
-
+
You can use this identifier to refer to a specific version of the prompt in your code.
@@ -38,7 +38,7 @@ You can use this identifier to refer to a specific version of the prompt in your
While developing a prompt, it can be useful to test it out on real-world data in the [Playground](/docs/guides/playground).
You can open a prompt in the playground, tweak it, and save a new version once you're ready.
-
+
## Using tools
@@ -115,19 +115,19 @@ To use a tool, simply select it in the "Tools" dropdown. Braintrust will automat
- Call the model again with the tool's result as context
- Continue for up to (default) 5 iterations or until the model produces a non-tool result
-
+
### Coercing a model's output schema
To define a set of tools available to a model, expand the "Tools" dropdown and select the Raw tab. You can enter an array of tool definitions,
following the [OpenAI tool format](https://platform.openai.com/docs/guides/function-calling).
-
+
By default, if a tool is called, Braintrust will return the arguments of the first tool call as a JSON object. If you use the [`invoke` API](#executing-directly),
you'll receive a JSON object as the result.
-
+
If you specify `parallel` as the mode, then instead of the first tool call's
@@ -641,7 +641,7 @@ When you use a prompt in your code, Braintrust automatically links spans to the
you to click to open a span in the playground, and see the prompt that generated it alongside the input variables. You can
even test and save a new version of the prompt directly from the playground.
-
+
This workflow is very powerful. It effectively allows you to debug, iterate, and publish changes to your prompts directly
within Braintrust. And because Braintrust flexibly allows you to load the latest prompt, a specific version, or even a version
diff --git a/examples/ToolRAG/tool-rag/docs-sample/proxy.mdx b/examples/ToolRAG/tool-rag/docs-sample/proxy.mdx
index a6446bc..010a4b2 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/proxy.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/proxy.mdx
@@ -329,7 +329,7 @@ resiliency in case one provider is down.
You can setup endpoints directly on the [secrets page](/app/settings?subroute=secrets) in your Braintrust account
by adding endpoints:
-
+
## Advanced configuration
diff --git a/examples/ToolRAG/tool-rag/docs-sample/tracing.mdx b/examples/ToolRAG/tool-rag/docs-sample/tracing.mdx
index bd4894c..70a4676 100644
--- a/examples/ToolRAG/tool-rag/docs-sample/tracing.mdx
+++ b/examples/ToolRAG/tool-rag/docs-sample/tracing.mdx
@@ -206,7 +206,7 @@ def my_route_handler(req):
-
+
When using `wrapOpenAI`/`wrap_openai`, you technically do not need to use
diff --git a/examples/UnreleasedAI/UnreleasedAI.mdx b/examples/UnreleasedAI/UnreleasedAI.mdx
index 7b033ce..1eaac54 100644
--- a/examples/UnreleasedAI/UnreleasedAI.mdx
+++ b/examples/UnreleasedAI/UnreleasedAI.mdx
@@ -126,7 +126,13 @@ Now, let’s use the comprehensiveness scorer to create a feedback loop that all
Go to your Braintrust **Logs** and select one of your logs. In the expanded view on the left-hand side of your screen, select the **generate-changelog** span, then select **Add to dataset**. Create a new dataset called `eval dataset`, and add a couple more logs to the same dataset. We'll use this dataset to run an experiment that evaluates for comprehensiveness to understand where the prompt might need adjustments.
-
+
Alternatively, you can define a dataset in [eval/sampleData.ts](https://github.com/braintrustdata/unreleased-ai/blob/main/eval/sampleData.ts).
diff --git a/examples/UnreleasedAI/assets/add-logs-to-dataset.gif b/examples/UnreleasedAI/assets/add-logs-to-dataset.gif
deleted file mode 100644
index bb97027..0000000
Binary files a/examples/UnreleasedAI/assets/add-logs-to-dataset.gif and /dev/null differ
diff --git a/examples/VercelAISDKTracing/assets/traced.gif b/examples/VercelAISDKTracing/assets/traced.gif
deleted file mode 100644
index 68880c7..0000000
Binary files a/examples/VercelAISDKTracing/assets/traced.gif and /dev/null differ
diff --git a/examples/VercelAISDKTracing/assets/traced.mp4 b/examples/VercelAISDKTracing/assets/traced.mp4
new file mode 100644
index 0000000..9d11773
Binary files /dev/null and b/examples/VercelAISDKTracing/assets/traced.mp4 differ
diff --git a/examples/VercelAISDKTracing/assets/wrapAISDKModel.gif b/examples/VercelAISDKTracing/assets/wrapAISDKModel.gif
deleted file mode 100644
index 7660f4e..0000000
Binary files a/examples/VercelAISDKTracing/assets/wrapAISDKModel.gif and /dev/null differ
diff --git a/examples/VercelAISDKTracing/assets/wrapAISDKModel.mp4 b/examples/VercelAISDKTracing/assets/wrapAISDKModel.mp4
new file mode 100644
index 0000000..434d130
Binary files /dev/null and b/examples/VercelAISDKTracing/assets/wrapAISDKModel.mp4 differ
diff --git a/examples/VercelAISDKTracing/assets/wrapTraced.gif b/examples/VercelAISDKTracing/assets/wrapTraced.gif
deleted file mode 100644
index 492353a..0000000
Binary files a/examples/VercelAISDKTracing/assets/wrapTraced.gif and /dev/null differ
diff --git a/examples/VercelAISDKTracing/assets/wrapTraced.mp4 b/examples/VercelAISDKTracing/assets/wrapTraced.mp4
new file mode 100644
index 0000000..c8519be
Binary files /dev/null and b/examples/VercelAISDKTracing/assets/wrapTraced.mp4 differ
diff --git a/examples/VercelAISDKTracing/vercel-ai-sdk-tracing.mdx b/examples/VercelAISDKTracing/vercel-ai-sdk-tracing.mdx
index 4d86737..511531e 100644
--- a/examples/VercelAISDKTracing/vercel-ai-sdk-tracing.mdx
+++ b/examples/VercelAISDKTracing/vercel-ai-sdk-tracing.mdx
@@ -94,7 +94,7 @@ const model = wrapAISDKModel(openai("gpt-4o"));
When we use the chatbot again, we see three logs appear in Braintrust: one log for the `getWeather` tool call, one log for the `getFahrenheit` tool call, and one call to form the final response. However, it'd probably be more useful to have all of these operations in the same log.
-
+
### Creating spans (and sub-spans)
@@ -147,7 +147,7 @@ export async function POST(request: Request) {
After you uncomment those lines of code, you should see the following:
-
+
A couple of things happened in this step:
@@ -239,7 +239,7 @@ export const getWeather = tool({
After we finish uncommenting the correct lines, we see how the `wrapTraced` function enriches our trace with tool calls.
-
+
Take note of how the `type` argument in both `traced` and `wrapTraced` change the icon within the trace tree. Also, since `checkFreezing` was called by `weatherFunction`, the trace preserves the hierarchy.
diff --git a/examples/VideoQA/VideoQA.ipynb b/examples/VideoQA/VideoQA.ipynb
index 667ca2d..8f47afc 100644
--- a/examples/VideoQA/VideoQA.ipynb
+++ b/examples/VideoQA/VideoQA.ipynb
@@ -221,7 +221,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- ""
+ ""
]
},
{
@@ -365,7 +365,7 @@
"\n",
"After running the evaluation, head over to **Evaluations** in the Braintrust UI to see your results. Select your most recent experiment to review the video frames included in the prompt, the model's answer for each sample, and the scoring by our LLM-based judge. We also attached metadata like `subject` and `question_type`, which you can use to filter in the Braintrust UI. This makes it easy to see whether the model underperforms on a certain type of question or domain. If you discover specific weaknesses, consider refining your prompt with more context or switching models.\n",
"\n",
- ""
+ ""
]
},
{
diff --git a/examples/VideoQA/assets/attachments.gif b/examples/VideoQA/assets/attachments.gif
deleted file mode 100644
index b47a4af..0000000
Binary files a/examples/VideoQA/assets/attachments.gif and /dev/null differ
diff --git a/examples/VideoQA/assets/attachments.mp4 b/examples/VideoQA/assets/attachments.mp4
new file mode 100644
index 0000000..4909ca3
Binary files /dev/null and b/examples/VideoQA/assets/attachments.mp4 differ
diff --git a/examples/VideoQA/assets/filters.gif b/examples/VideoQA/assets/filters.gif
deleted file mode 100644
index 042743e..0000000
Binary files a/examples/VideoQA/assets/filters.gif and /dev/null differ
diff --git a/examples/VideoQA/assets/filters.mp4 b/examples/VideoQA/assets/filters.mp4
new file mode 100644
index 0000000..3c7e939
Binary files /dev/null and b/examples/VideoQA/assets/filters.mp4 differ
diff --git a/examples/VideoQATwelveLabs/VideoQATwelveLabs.ipynb b/examples/VideoQATwelveLabs/VideoQATwelveLabs.ipynb
index d29da2b..b55e235 100644
--- a/examples/VideoQATwelveLabs/VideoQATwelveLabs.ipynb
+++ b/examples/VideoQATwelveLabs/VideoQATwelveLabs.ipynb
@@ -402,7 +402,7 @@
"source": [
"After you run the evaluation, you'll be able to investigate each video as an attachment in Braintrust, so you can dig into any cases that may need attention during evaluation. \n",
"\n",
- ""
+ ""
]
},
{
diff --git a/examples/VideoQATwelveLabs/assets/view-attachment.gif b/examples/VideoQATwelveLabs/assets/view-attachment.gif
deleted file mode 100644
index 8b39fcb..0000000
Binary files a/examples/VideoQATwelveLabs/assets/view-attachment.gif and /dev/null differ
diff --git a/examples/VideoQATwelveLabs/assets/view-attachment.mp4 b/examples/VideoQATwelveLabs/assets/view-attachment.mp4
new file mode 100644
index 0000000..69b89d4
Binary files /dev/null and b/examples/VideoQATwelveLabs/assets/view-attachment.mp4 differ