diff --git a/examples/APIAgent-Py/APIAgent.ipynb b/examples/APIAgent-Py/APIAgent.ipynb index adffce2..d31a0a8 100644 --- a/examples/APIAgent-Py/APIAgent.ipynb +++ b/examples/APIAgent-Py/APIAgent.ipynb @@ -826,7 +826,7 @@ "source": [ "Awesome! The logs now have a `no_hallucination` score which we can use to filter down hallucinations.\n", "\n", - "![Hallucination logs](./assets/logs-with-score.gif)\n" + "![Hallucination logs](./assets/logs-with-score.mp4)\n" ] }, { @@ -839,7 +839,7 @@ "non-hallucinations are correct, but in a real-world scenario, you could [collect user feedback](https://www.braintrust.dev/docs/guides/logging#user-feedback)\n", "and treat positively rated feedback as ground truth.\n", "\n", - "![Dataset setup](./assets/dataset-setup.gif)\n", + "![Dataset setup](./assets/dataset-setup.mp4)\n", "\n", "## Running evals\n", "\n", @@ -1020,7 +1020,7 @@ "\n", "To understand why, we can filter down to this regression, and take a look at a side-by-side diff.\n", "\n", - "![Regression diff](./assets/regression-diff.gif)\n", + "![Regression diff](./assets/regression-diff.mp4)\n", "\n", "Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step.\n", "Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields.\n", diff --git a/examples/APIAgent-Py/assets/dataset-setup.gif b/examples/APIAgent-Py/assets/dataset-setup.gif deleted file mode 100644 index 144e53b..0000000 Binary files a/examples/APIAgent-Py/assets/dataset-setup.gif and /dev/null differ diff --git a/examples/APIAgent-Py/assets/dataset-setup.mp4 b/examples/APIAgent-Py/assets/dataset-setup.mp4 new file mode 100644 index 0000000..3cc5be3 Binary files /dev/null and b/examples/APIAgent-Py/assets/dataset-setup.mp4 differ diff --git a/examples/APIAgent-Py/assets/logs-with-score.gif b/examples/APIAgent-Py/assets/logs-with-score.gif deleted file mode 100644 index db34bc8..0000000 Binary files a/examples/APIAgent-Py/assets/logs-with-score.gif and /dev/null differ diff --git a/examples/APIAgent-Py/assets/logs-with-score.mp4 b/examples/APIAgent-Py/assets/logs-with-score.mp4 new file mode 100644 index 0000000..c88cf40 Binary files /dev/null and b/examples/APIAgent-Py/assets/logs-with-score.mp4 differ diff --git a/examples/APIAgent-Py/assets/regression-diff.gif b/examples/APIAgent-Py/assets/regression-diff.gif deleted file mode 100644 index e3676c2..0000000 Binary files a/examples/APIAgent-Py/assets/regression-diff.gif and /dev/null differ diff --git a/examples/APIAgent-Py/assets/regression-diff.mp4 b/examples/APIAgent-Py/assets/regression-diff.mp4 new file mode 100644 index 0000000..39ad362 Binary files /dev/null and b/examples/APIAgent-Py/assets/regression-diff.mp4 differ diff --git a/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb b/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb index 6685083..800171c 100644 --- a/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb +++ b/examples/ClassifyingNewsArticles/ClassifyingNewsArticles.ipynb @@ -423,7 +423,7 @@ "- You should see the eval scores increase and you can see which test cases improved.\n", "- You can also filter the test cases by improvements to know exactly why the scores changed.\n", "\n", - "![Compare](assets/inspect.gif)\n", + "![Compare](assets/inspect.mp4)\n", "\n" ] }, diff --git a/examples/ClassifyingNewsArticles/assets/inspect.gif b/examples/ClassifyingNewsArticles/assets/inspect.gif deleted file mode 100644 index 87ab876..0000000 Binary files a/examples/ClassifyingNewsArticles/assets/inspect.gif and /dev/null differ diff --git a/examples/ClassifyingNewsArticles/assets/inspect.mp4 b/examples/ClassifyingNewsArticles/assets/inspect.mp4 new file mode 100644 index 0000000..811d8fd Binary files /dev/null and b/examples/ClassifyingNewsArticles/assets/inspect.mp4 differ diff --git a/examples/Github-Issues/Github-Issues.ipynb b/examples/Github-Issues/Github-Issues.ipynb index 7de01da..c5a9471 100644 --- a/examples/Github-Issues/Github-Issues.ipynb +++ b/examples/Github-Issues/Github-Issues.ipynb @@ -482,7 +482,7 @@ "\n", "Happy evaluating!\n", "\n", - "![improvements](./assets/improvements.gif)\n" + "![improvements](./assets/improvements.mp4)\n" ] } ], diff --git a/examples/Github-Issues/assets/improvements.gif b/examples/Github-Issues/assets/improvements.gif deleted file mode 100644 index 1a86c58..0000000 Binary files a/examples/Github-Issues/assets/improvements.gif and /dev/null differ diff --git a/examples/Github-Issues/assets/improvements.mp4 b/examples/Github-Issues/assets/improvements.mp4 new file mode 100644 index 0000000..4e90f58 Binary files /dev/null and b/examples/Github-Issues/assets/improvements.mp4 differ diff --git a/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb b/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb index c034dae..f34369b 100644 --- a/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb +++ b/examples/LLaMa-3_1-Tools/LLaMa-3_1-Tools.ipynb @@ -756,7 +756,7 @@ "\n", "Although it's a fraction of the cost, it's both slower (likely due to rate limits) and worse performing than GPT-4o. 12 of the 60 cases failed to parse. Let's take a look at one of those in depth.\n", "\n", - "![parsing-failure](./assets/parsing-failure.gif)\n", + "![parsing-failure](./assets/parsing-failure.mp4)\n", "\n", "That definitely looks like an invalid tool call. Maybe we can experiment with tweaking the prompt to get better results.\n", "\n", diff --git a/examples/LLaMa-3_1-Tools/assets/parsing-failure.gif b/examples/LLaMa-3_1-Tools/assets/parsing-failure.gif deleted file mode 100644 index a9148b0..0000000 Binary files a/examples/LLaMa-3_1-Tools/assets/parsing-failure.gif and /dev/null differ diff --git a/examples/LLaMa-3_1-Tools/assets/parsing-failure.mp4 b/examples/LLaMa-3_1-Tools/assets/parsing-failure.mp4 new file mode 100644 index 0000000..f22236e Binary files /dev/null and b/examples/LLaMa-3_1-Tools/assets/parsing-failure.mp4 differ diff --git a/examples/OTEL-logging/assets/add-post-filter.gif b/examples/OTEL-logging/assets/add-post-filter.gif deleted file mode 100644 index 8dfbcbf..0000000 Binary files a/examples/OTEL-logging/assets/add-post-filter.gif and /dev/null differ diff --git a/examples/OTEL-logging/assets/add-post-filter.mp4 b/examples/OTEL-logging/assets/add-post-filter.mp4 new file mode 100644 index 0000000..bbdbf0a Binary files /dev/null and b/examples/OTEL-logging/assets/add-post-filter.mp4 differ diff --git a/examples/OTEL-logging/assets/otel-demo.gif b/examples/OTEL-logging/assets/otel-demo.gif deleted file mode 100644 index 17f8395..0000000 Binary files a/examples/OTEL-logging/assets/otel-demo.gif and /dev/null differ diff --git a/examples/OTEL-logging/assets/otel-demo.mp4 b/examples/OTEL-logging/assets/otel-demo.mp4 new file mode 100644 index 0000000..7bcb2f4 Binary files /dev/null and b/examples/OTEL-logging/assets/otel-demo.mp4 differ diff --git a/examples/OTEL-logging/assets/spans.gif b/examples/OTEL-logging/assets/spans.gif deleted file mode 100644 index 27afc0c..0000000 Binary files a/examples/OTEL-logging/assets/spans.gif and /dev/null differ diff --git a/examples/OTEL-logging/assets/spans.mp4 b/examples/OTEL-logging/assets/spans.mp4 new file mode 100644 index 0000000..1fd1e53 Binary files /dev/null and b/examples/OTEL-logging/assets/spans.mp4 differ diff --git a/examples/OTEL-logging/otel-logging.mdx b/examples/OTEL-logging/otel-logging.mdx index 98ceaed..7865d20 100644 --- a/examples/OTEL-logging/otel-logging.mdx +++ b/examples/OTEL-logging/otel-logging.mdx @@ -141,7 +141,7 @@ Run `npm install` to install the required dependencies, then `npm run dev` to la Open your Braintrust project to the **Logs** page, and select **What orders have shipped?** in your applications. You should be able to watch the logs filter in as your application makes HTTP requests and LLM calls. -![LLM calls and logs side by side](assets/otel-demo.gif) +![LLM calls and logs side by side](assets/otel-demo.mp4) Because this application is using multi-step streaming and tool calls, the logs are especially interesting. In Braintrust, logs consist of [traces](/docs/guides/traces), which roughly correspond to a single request or interaction in your application. Traces consist of one or more spans, each of which corresponds to a unit of work in your application. In this example, each step and tool call is logged inside of its own span. This level of granularity makes it easier to debug issues, track user behavior, and collect data into datasets. @@ -149,11 +149,11 @@ Because this application is using multi-step streaming and tool calls, the logs Run a couple more queries in the app and notice the logs that are generated. Our app is logging both `GET` and `POST` requests, but we’re most interested in the `POST` requests since they contain our LLM calls. We can apply a filter using the [BTQL](/docs/reference/btql) query `Name LIKE 'POST%'` so that we only see the traces we care about: -![Filter using BTQL](assets/add-post-filter.gif) +![Filter using BTQL](assets/add-post-filter.mp4) You should now have a list of traces for all the `POST` requests your app has made. Each contains the inputs and outputs of each LLM call in a span called `ai.streamText`. If you go further into the trace, you’ll also notice a span for each tool call. -![Expanding tool call and stream spans](assets/spans.gif) +![Expanding tool call and stream spans](assets/spans.mp4) This is valuable data that can be used to evaluate the quality of accuracy of your application in Braintrust. diff --git a/examples/PDFPlayground/PDFPlayground.mdx b/examples/PDFPlayground/PDFPlayground.mdx index 9eea335..43e3cb9 100644 --- a/examples/PDFPlayground/PDFPlayground.mdx +++ b/examples/PDFPlayground/PDFPlayground.mdx @@ -348,7 +348,7 @@ Once your traces have been logged, you can use the Braintrust UI to manage your You can store the user spans from your PDF traces into a dataset. Select the span, and then select **Add span to dataset**, or use the hotkey `D` to speed this up. -![add span to dataset](./assets/add-span-to-dataset.gif) +![add span to dataset](./assets/add-span-to-dataset.mp4) ### Trying system prompts in a playground @@ -357,7 +357,7 @@ Select a system prompt span, and then select **Try prompt** to: 1. Save the prompt (for example, "system1") to your library by selecting **Save as custom prompt** 2. Launch a playground using the saved prompt by selecting **Create playground with prompt** -![try prompt from span](./assets/try-prompt.gif) +![try prompt from span](./assets/try-prompt.mp4) ### File attachment methods @@ -365,13 +365,13 @@ There are two ways to attach PDF files in playgrounds: using the paperclip butto - To upload files directly from your local machine, start by selecting **+ Message** to add a user prompt. Then, select **+ Message Part** > **File**. This will display a paperclip icon on the right side. Select it to upload a file from your local machine. -![paperclip UI method](./assets/paperclip.gif) +![paperclip UI method](./assets/paperclip.mp4) This method is particularly useful when you're working with local files that aren't accessible via public URL. - To use the public URL method, paste the URL directly into the file message input field. You can also use mustache syntax to extract the URL from metadata. -![public url method](./assets/url.gif) +![public url method](./assets/url.mp4) This method streamlines the process when you're working with publicly available PDFs, like the earnings call transcripts we're using in this cookbook. diff --git a/examples/PDFPlayground/assets/add-span-to-dataset.gif b/examples/PDFPlayground/assets/add-span-to-dataset.gif deleted file mode 100644 index a1ee2a4..0000000 Binary files a/examples/PDFPlayground/assets/add-span-to-dataset.gif and /dev/null differ diff --git a/examples/PDFPlayground/assets/add-span-to-dataset.mp4 b/examples/PDFPlayground/assets/add-span-to-dataset.mp4 new file mode 100644 index 0000000..99afd87 Binary files /dev/null and b/examples/PDFPlayground/assets/add-span-to-dataset.mp4 differ diff --git a/examples/PDFPlayground/assets/paperclip.gif b/examples/PDFPlayground/assets/paperclip.gif deleted file mode 100644 index 8eb8f9b..0000000 Binary files a/examples/PDFPlayground/assets/paperclip.gif and /dev/null differ diff --git a/examples/PDFPlayground/assets/paperclip.mp4 b/examples/PDFPlayground/assets/paperclip.mp4 new file mode 100644 index 0000000..8cc02ba Binary files /dev/null and b/examples/PDFPlayground/assets/paperclip.mp4 differ diff --git a/examples/PDFPlayground/assets/try-prompt.gif b/examples/PDFPlayground/assets/try-prompt.gif deleted file mode 100644 index 32366b6..0000000 Binary files a/examples/PDFPlayground/assets/try-prompt.gif and /dev/null differ diff --git a/examples/PDFPlayground/assets/try-prompt.mp4 b/examples/PDFPlayground/assets/try-prompt.mp4 new file mode 100644 index 0000000..266b869 Binary files /dev/null and b/examples/PDFPlayground/assets/try-prompt.mp4 differ diff --git a/examples/PDFPlayground/assets/url.gif b/examples/PDFPlayground/assets/url.gif deleted file mode 100644 index af3b4af..0000000 Binary files a/examples/PDFPlayground/assets/url.gif and /dev/null differ diff --git a/examples/PDFPlayground/assets/url.mp4 b/examples/PDFPlayground/assets/url.mp4 new file mode 100644 index 0000000..37c231b Binary files /dev/null and b/examples/PDFPlayground/assets/url.mp4 differ diff --git a/examples/ProviderBenchmark/ProviderBenchmark.ipynb b/examples/ProviderBenchmark/ProviderBenchmark.ipynb index 2a90e01..2fe278f 100644 --- a/examples/ProviderBenchmark/ProviderBenchmark.ipynb +++ b/examples/ProviderBenchmark/ProviderBenchmark.ipynb @@ -433,7 +433,7 @@ "\n", "Let's start by looking at the project view. Braintrust makes it easy to morph this into a multi-level grouped analysis where we can see the score vs. duration in a scatter plot, and how each provider stacks up in the table.\n", "\n", - "![Setting up the table](./assets/configuring-graph.gif)\n", + "![Setting up the table](./assets/configuring-graph.mp4)\n", "\n", "### Insights\n", "\n", diff --git a/examples/ProviderBenchmark/assets/configuring-graph.gif b/examples/ProviderBenchmark/assets/configuring-graph.gif deleted file mode 100644 index 0088f61..0000000 Binary files a/examples/ProviderBenchmark/assets/configuring-graph.gif and /dev/null differ diff --git a/examples/ProviderBenchmark/assets/configuring-graph.mp4 b/examples/ProviderBenchmark/assets/configuring-graph.mp4 new file mode 100644 index 0000000..a0290d9 Binary files /dev/null and b/examples/ProviderBenchmark/assets/configuring-graph.mp4 differ diff --git a/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx b/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx index f296a3b..dcb95cb 100644 --- a/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx +++ b/examples/Realtime/realtime-rag/utils/docs-sample/changelog.mdx @@ -7,7 +7,7 @@ import { LoomVideo } from "#/ui/docs/loom"; import Link from "fumadocs-core/link"; import { Callout } from "fumadocs-ui/components/callout"; import { Step, Steps } from "fumadocs-ui/components/steps"; -import Image from 'next/image'; +import Image from "next/image"; # Changelog @@ -52,18 +52,18 @@ import Image from 'next/image'; - The Traceloop OTEL integration now uses the input and output attributes to populate the corresponding fields in Braintrust. - The monitor page now supports querying experiment metrics. - Removed the `filters` param from the REST API fetch endpoint. For complex -queries, we recommend using the `/btql` endpoint ([docs](/docs/reference/btql)). + queries, we recommend using the `/btql` endpoint ([docs](/docs/reference/btql)). - New experiment summary layout option, a url-friendly view for experiment summaries that respects all filters. - Add a default limit of 10 to all fetch and `/btql` requests for project_logs. - You can now export your prompts from the playground as code snippets and run them through the [AI proxy](/docs/guides/proxy). - Add a fallback for the "add prompt" dropdown button in the playground, which -will search for prompts within the current project if the cross-org prompts -query fails. + will search for prompts within the current project if the cross-org prompts + query fails. ### SDK (version 0.0.171) - Add a `.data` method to the `Attachment` class, which lets you inspect the -loaded attachment data. + loaded attachment data. ## Week of 2024-11-12 @@ -99,6 +99,7 @@ loaded attachment data. - Create custom columns on dataset, experiment and logs tables from `JSON` values in `input`, `output`, `expected`, or `metadata` fields. ### API (version 0.0.59) + - Fix permissions bug with updating org-scoped env vars ## Week of 2024-10-28 @@ -151,7 +152,7 @@ loaded attachment data. ### SDK (version 0.0.164) - Add `braintrust.permalink` function to create deep links pointing to -particular spans in the Braintrust UI. + particular spans in the Braintrust UI. ## Week of 2024-10-07 @@ -170,7 +171,7 @@ particular spans in the Braintrust UI. ### SDK (version 0.0.161) - Add utility function `spanComponentsToObjectId` for resolving the object ID -from an exported span slug. + from an exported span slug. ## Week of 2024-09-30 @@ -178,19 +179,21 @@ from an exported span slug. - Add support for [Cerebras](https://cerebras.ai/) models in the proxy, playground, and saved prompts. - You can now create [span iframe viewers](/docs/guides/tracing#custom-span-iframes) to visualize span data in a custom iframe. In this example, the "Table" section is a custom span iframe. -![Span iframe](./guides/traces/span-iframe.png) + ![Span iframe](./guides/traces/span-iframe.png) - `NOT LIKE`, `NOT ILIKE`, `NOT INCLUDES`, and `NOT CONTAINS` supported in BTQL. - Add "Upload Rows" button to insert rows into an existing dataset from CSV or JSON. - Add "Maximum" aggregate score type. - The experiment table now supports grouping by input (for trials) or by a metadata field. - - The Name and Input columns are now pinned + - The Name and Input columns are now pinned - Gemini models now support multimodal inputs. ## Week of 2024-09-23 - Basic monitor page that shows aggregate values for latency, token count, time to first token, and cost for logs. - Create custom tools to use in your prompts and in the playground. See the [docs](/docs/guides/prompts#calling-external-tools) for more details. -- Set org-wide environment variables to use in these tools +- + Set org-wide environment variables + to use in these tools - Pull your prompts to your codebase using the `braintrust pull` command. - Select and compare multiple experiments in the experiment view using the `compared with` dropdown. - The playground now displays aggregate scores (avg/max/min) for each prompt and supports sorting rows by a score. @@ -220,7 +223,6 @@ from an exported span slug. - The tag picker now includes tags that were added dynamically via API, in addition to the tags configured for your project. - Added a REST API for managing AI secrets. See [docs](/docs/reference/api/AiSecrets). - ### SDK (version 0.0.158) - A dedicated `update` method is now available for datasets. @@ -233,11 +235,11 @@ from an exported span slug. - You can now create server-side online evaluations for your logs. Online evals support both [autoevals](/docs/reference/autoevals) and [custom scorers](/docs/guides/playground) you define as LLM-as-a-judge, TypeScript, or Python functions. See [docs](/docs/guides/evals/write#online-evaluation) for more details. - + - New member invitations now support being added to multiple permission groups. - Move datasets and prompts to a new Library navigation tab, and include a list of custom scorers. - Clean up tree view by truncating the root preview and showing a preview of a node only if collapsed. -![Truncated tree view](./reference/release-notes/truncated-tree-view.png) + ![Truncated tree view](./reference/release-notes/truncated-tree-view.png) - Automatically save changes to table views. ## Week of 2024-09-02 @@ -294,12 +296,13 @@ npx braintrust eval --bundle ## Week of 2024-08-12 - You can now create custom LLM and code (TypeScript and Python) evaluators in the playground. - + + - Fullscreen trace toggle - Datasets now accept JSON file uploads - When uploading a CSV/JSON file to a dataset, columns/fields named `input`, `expected`, and `metadata` -are now auto-assigned to the corresponding dataset fields + are now auto-assigned to the corresponding dataset fields - Fix bug in logs/dataset viewer when changing the search params. ### API (version 0.0.53) @@ -315,7 +318,7 @@ are now auto-assigned to the corresponding dataset fields - These metrics, along with cost, now exclude LLM calls used in autoevals (as of 0.0.85) - Switching organizations via the header navigates to the same-named project in the selected organization - Added `MarkAsyncWrapper` to the Python SDK to allow explicitly marking -functions which return awaitable objects as async + functions which return awaitable objects as async ### Autoevals (version 0.0.85) @@ -370,37 +373,37 @@ functions which return awaitable objects as async ## Week of 2024-07-22 - Categorical human review scores can now be re-ordered via Drag-n-Drop. -![Reorder categorical score](./reference/release-notes/category-score-reorder.gif) + ![Reorder categorical score](./reference/release-notes/category-score-reorder.mp4) - Human review row selection is now a free text field, enabling a quick jump to a specific row. -![Human review free text](./reference/release-notes/humanreviewfreetext.png) + ![Human review free text](./reference/release-notes/humanreviewfreetext.png) - Added REST endpoint for managing org membership. See [docs](/docs/reference/api/Organizations#modify-organization-membership). ### API (version 0.0.51) -* The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some +- The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some exciting new features. Here is what you need to know: - * The updates are available as of API version 0.0.51. - * The proxy is now accessible at `https://api.braintrust.dev/v1/proxy`. You can use this as a base URL in your OpenAI client, + - The updates are available as of API version 0.0.51. + - The proxy is now accessible at `https://api.braintrust.dev/v1/proxy`. You can use this as a base URL in your OpenAI client, instead of `https://braintrustproxy.com/v1`. [NOTE: The latter is still supported, but will be deprecated in the future.] - * If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as + - If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as a separate service. - * If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the + - If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the "Outputs" tab. ![Universal URL Cloudformation](./reference/release-notes/universal-url-cloudformation.png) - * Then, replace that in your settings page settings page +- Then, replace that in your settings page settings page ![Universal API](./reference/release-notes/universal-api.png) - * If you have a Docker-based deployment, you can just update your containers. - * Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set. +- If you have a Docker-based deployment, you can just update your containers. +- Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set. ### SDK (version 0.0.146) -* Add support for `max_concurrency` in the Python SDK -* Hill climbing evals that use a `BaseExperiment` as data will use that as the default base experiment. +- Add support for `max_concurrency` in the Python SDK +- Hill climbing evals that use a `BaseExperiment` as data will use that as the default base experiment. ## Week of 2024-07-15 @@ -420,14 +423,14 @@ functions which return awaitable objects as async ### Autoevals (version 0.0.77) -* Officially switch the default model to be `gpt-4o`. Our testing showed that it performed on average 10% more accurately than `gpt-3.5-turbo`! -* Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the `model` param in any LLM based evaluator. - * Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings. +- Officially switch the default model to be `gpt-4o`. Our testing showed that it performed on average 10% more accurately than `gpt-3.5-turbo`! +- Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the `model` param in any LLM based evaluator. + - Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings. ## Week of 2024-07-08 - Human review scores are now sortable from the project configuration page. -![Reorder scores](./reference/release-notes/reorder-human-review-scores.gif) + ![Reorder scores](./reference/release-notes/reorder-human-review-scores.mp4) - Streaming support for tool calls in Anthropic models through the proxy and playground. - The playground now supports different "parsing" modes: - `auto`: (same as before) the completion text and the first tool call arguments, if any @@ -437,7 +440,6 @@ functions which return awaitable objects as async - Cleaned up environment variables in the public [docker deployment](https://github.com/braintrustdata/braintrust-deployment/tree/main/docker). Functionally, nothing has changed. - ### Autoevals (version 0.0.76) - New `.partial(...)` syntax to initialize a scorer with partial arguments like `criteria` in `ClosedQA`. @@ -447,7 +449,7 @@ functions which return awaitable objects as async - Table views [can now be saved](/docs/reference/views), persisting the BTQL filters, sorts, and column state. - Add support for the new `window.ai` model into the playground. -![window.ai](./reference/release-notes/window-ai.gif) + ![window.ai](./reference/release-notes/window-ai.mp4) - Use push history when navigating table rows to allow for back button navigation. - In the experiments list, grouping by a metadata field will group rows in the table as well. - Allow the trace tree panel to be resized. @@ -471,8 +473,8 @@ const foo = wrapTraced(async function foo(input) { }); ``` - ### SDK (version 0.0.138) + - The TypeScript SDK's `Eval()` function now takes a `maxConcurrency` parameter, which bounds the number of concurrent tasks that run. - `braintrust install api` now sets up your API and Proxy URL in your environment. @@ -512,7 +514,7 @@ const foo = wrapTraced(async function foo(input) { ## Week of 2024-06-03 - You can now collapse the trace tree. It's auto collapsed if you have a single span. -![Collapsible trace tree](./reference/release-notes/trace-tree.png) + ![Collapsible trace tree](./reference/release-notes/trace-tree.png) - Improvements to the experiment chart including greyed out lines for inactive scores and improved legend. - Show diffs when you save a new prompt version. @@ -819,7 +821,7 @@ server with `curl [api-url]/version`, where the API URL can be found on the ![Experiment search and filtering on project - view](/docs/release-notes/ReleaseNotes11-27-search.gif) + view](/docs/release-notes/ReleaseNotes11-27-search.mp4) - Upgraded AI Proxy to support [tracking Prometheus metrics](https://github.com/braintrustdata/braintrust-proxy/blob/a31a82e6d46ff442a3c478773e6eec21f3d0ba69/apis/cloudflare/wrangler-template.toml#L19C1-L19C1) - Modified Autoevals library to use the [AI proxy](/docs/guides/proxy) @@ -1176,7 +1178,7 @@ Eval([eval_name], { - Fixed our libraries including Autoevals to work with OpenAI’s new libraries
![Added OpenAI function calling in the prompt - playground](/docs/release-notes/ReleaseNotes-2023-11-functions.gif) + playground](/docs/release-notes/ReleaseNotes-2023-11-functions.mp4)
- Added support for function calling and tools in our prompt playground - Added tabs on a project page for datasets, experiments, etc. @@ -1283,7 +1285,7 @@ Eval( - The prompt playground is now live! We're excited to get your feedback as we continue to build this feature out. See [the docs](/docs/guides/playground) for more information. -![Sync Playground](/docs/release-notes/ReleaseNotes-2023-08-Playground.gif) +![Sync Playground](/docs/release-notes/ReleaseNotes-2023-08-Playground.mp4) ## Week of 2023-08-21 @@ -1295,7 +1297,7 @@ Eval( changes to your code. - You can now edit datasets in the UI. -![Edit Dataset](/docs/release-notes/ReleaseNotes-2023-08-EditDataset.gif) +![Edit Dataset](/docs/release-notes/ReleaseNotes-2023-08-EditDataset.mp4) ## Week of 2023-08-14 @@ -1399,11 +1401,11 @@ braintrust install api --update-template - You can now swap the primary and comparison experiment with a single click. -![Swap experiments](/docs/release-notes/ReleaseNotes-2023-07-Swap.gif) +![Swap experiments](/docs/release-notes/ReleaseNotes-2023-07-Swap.mp4) - You can now compare `output` vs. `expected` within an experiment. -![Diff output and expected](/docs/release-notes/ReleaseNotes-2023-07-Diff.gif) +![Diff output and expected](/docs/release-notes/ReleaseNotes-2023-07-Diff.mp4) - Version 0.0.19 is out for the SDK. It is an important update that throws an error if your payload is larger than 64KB in size. @@ -1435,7 +1437,7 @@ braintrust install api --update-template - New scatter plot and histogram insights to quickly analyze scores and filter down examples. - ![Scatter Plot](/docs/release-notes/ReleaseNotes-2023-06-Scatter.gif) + ![Scatter Plot](/docs/release-notes/ReleaseNotes-2023-06-Scatter.mp4) - API keys that can be set in the SDK (explicitly or through an environment variable) and do not require user login. Visit the settings page to create an API key. diff --git a/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx b/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx index 8059234..1aa25bb 100644 --- a/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx +++ b/examples/Realtime/realtime-rag/utils/docs-sample/human-review.mdx @@ -13,7 +13,7 @@ feedback from end users, subject matter experts, and product teams in one place. use human review to evaluate/compare experiments, assess the efficacy of your automated scoring methods, and curate log events to use in your evals. -![Human review label](./human-review/label.gif) +![Human review label](./human-review/label.mp4) ## Configuring human review @@ -32,7 +32,6 @@ options and their scores. Once you create a score, it will automatically appear in the "Scores" section in each experiment and log event throughout the project. - ### Writing to expected fields You may choose to write categorical scores to the `expected` field of a span instead of a score. @@ -40,9 +39,9 @@ To enable this, simply check the "Write to expected field instead of score" opti an option to select multiple values when writing to the expected field. - A numeric score will not be assigned to the categorical options when writing to the expected - field. If there is an existing object in the expected field, the categorical value will be - appended to the object. + A numeric score will not be assigned to the categorical options when writing + to the expected field. If there is an existing object in the expected field, + the categorical value will be appended to the object. ![Write to expected](./human-review/write-to-expected.webp) @@ -54,7 +53,7 @@ In addition to categorical scores, you can always directly edit the structured o To manually review results in your logs or an experiment, simply click on a row, and you'll see the human review scores you configured in the expanded trace view. -![Set score](./human-review/in-experiment.gif) +![Set score](./human-review/in-experiment.mp4) As you set scores, they will be automatically saved and reflected in the summary metrics. The exact same mechanism works whether you're reviewing logs or experiments. @@ -64,7 +63,7 @@ mechanism works whether you're reviewing logs or experiments. In addition to setting scores, you can also add comments to spans and update their `expected` values. These updates are tracked alongside score updates to form an audit trail of edits to a span. -![Save comment](./human-review/comment.gif) +![Save comment](./human-review/comment.mp4) ## Rapid review mode @@ -72,7 +71,7 @@ If you or a subject matter expert is reviewing a large number of logs, you can u a UI that's optimized specifically for review. To enter review mode, hit the "r" key or the expand () icon next to the "Human review" header. -![Review mode](./human-review/review-mode.gif) +![Review mode](./human-review/review-mode.mp4) In review mode, you can set scores, leave comments, and edit expected values. Review mode is optimized for keyboard navigation, so you can quickly move between scores and rows with keyboard shortcuts. You can also share a link to the diff --git a/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx b/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx index 1963380..cd33703 100644 --- a/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx +++ b/examples/Realtime/realtime-rag/utils/docs-sample/playground.mdx @@ -31,7 +31,7 @@ that includes one or more prompts and is linked to a dataset. Playgrounds are designed for collaboration and automatically synchronize in real-time. -![Sync Playground](/docs/guides/playground/sync-playground.gif) +![Sync Playground](/docs/guides/playground/sync-playground.mp4) To share a playground, simply copy the URL and send it to your collaborators. Your collaborators must be members of your organization to see the session. You can invite users from the settings page. diff --git a/examples/ReceiptExtraction/ReceiptExtraction.ipynb b/examples/ReceiptExtraction/ReceiptExtraction.ipynb index d51589f..278727e 100644 --- a/examples/ReceiptExtraction/ReceiptExtraction.ipynb +++ b/examples/ReceiptExtraction/ReceiptExtraction.ipynb @@ -403,7 +403,7 @@ "\n", "If you click into the gpt-4o experiment and compare it to gpt-4o-mini, you can drill down into the individual improvements and regressions.\n", "\n", - "![Regressions](./assets/GPT-4o-vs-4o-mini.gif)\n", + "![Regressions](./assets/GPT-4o-vs-4o-mini.mp4)\n", "\n", "There are several different types of regressions, one of which appears to be that `gpt-4o` returns information in a different case than `gpt-4o-mini`. That may or\n", "may not be important for this use case, but if not, we could adjust our scoring functions to lowercase everything before comparing.\n", diff --git a/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.gif b/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.gif deleted file mode 100644 index b31998d..0000000 Binary files a/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.gif and /dev/null differ diff --git a/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.mp4 b/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.mp4 new file mode 100644 index 0000000..f99a6f0 Binary files /dev/null and b/examples/ReceiptExtraction/assets/GPT-4o-vs-4o-mini.mp4 differ diff --git a/examples/SimpleRagas/SimpleRagas.ipynb b/examples/SimpleRagas/SimpleRagas.ipynb index b60c723..8f3f9bb 100644 --- a/examples/SimpleRagas/SimpleRagas.ipynb +++ b/examples/SimpleRagas/SimpleRagas.ipynb @@ -449,7 +449,7 @@ "results, and maybe we should try using `gpt-4` instead. Braintrust lets us test the effect of this quickly, directly in the UI, before we run\n", "a full experiment:\n", "\n", - "![try gpt-4](./assets/try-gpt-4.gif)\n", + "![try gpt-4](./assets/try-gpt-4.mp4)\n", "\n", "Looks better. Let's update our scoring function to use it and re-run the experiment.\n" ] @@ -575,7 +575,7 @@ "We can drill down on individual examples of each regression type to better understand it. The side-by-side diffs built into Braintrust make\n", "it easy to deeply understand every step of the pipeline, for example, which documents were missing, and why.\n", "\n", - "![missing docs](./assets/missing-docs.gif)\n", + "![missing docs](./assets/missing-docs.mp4)\n", "\n", "And there you have it! Ragas is a powerful technique, that with the right tools and iteration can lead to really high quality RAG applications. Happy evaling!\n" ] diff --git a/examples/SimpleRagas/assets/missing-docs.gif b/examples/SimpleRagas/assets/missing-docs.gif deleted file mode 100644 index 15e36ef..0000000 Binary files a/examples/SimpleRagas/assets/missing-docs.gif and /dev/null differ diff --git a/examples/SimpleRagas/assets/missing-docs.mp4 b/examples/SimpleRagas/assets/missing-docs.mp4 new file mode 100644 index 0000000..f5bc15e Binary files /dev/null and b/examples/SimpleRagas/assets/missing-docs.mp4 differ diff --git a/examples/SimpleRagas/assets/try-gpt-4.gif b/examples/SimpleRagas/assets/try-gpt-4.gif deleted file mode 100644 index 78a398a..0000000 Binary files a/examples/SimpleRagas/assets/try-gpt-4.gif and /dev/null differ diff --git a/examples/SimpleRagas/assets/try-gpt-4.mp4 b/examples/SimpleRagas/assets/try-gpt-4.mp4 new file mode 100644 index 0000000..b4b6ea0 Binary files /dev/null and b/examples/SimpleRagas/assets/try-gpt-4.mp4 differ diff --git a/examples/Text2SQL-Data/Text2SQL-Data.ipynb b/examples/Text2SQL-Data/Text2SQL-Data.ipynb index a523128..9b842f6 100644 --- a/examples/Text2SQL-Data/Text2SQL-Data.ipynb +++ b/examples/Text2SQL-Data/Text2SQL-Data.ipynb @@ -400,7 +400,7 @@ "1. Let's capture the good data into a dataset. Since our eval pipeline did the hard work of generating a reference query and results, we can\n", " now save these, and make sure that future changes we make do not _regress_ the results.\n", "\n", - "![add to dataset](./assets/add-to-dataset.gif)\n", + "![add to dataset](./assets/add-to-dataset.mp4)\n", "\n", "- The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model.\n", "\n", @@ -986,7 +986,7 @@ "\n", "Braintrust makes it easy to filter down to the regressions, and view a side-by-side diff:\n", "\n", - "![diff](./assets/analyze-regressions.gif)\n", + "![diff](./assets/analyze-regressions.mp4)\n", "\n", "## Conclusion\n", "\n", diff --git a/examples/Text2SQL-Data/assets/add-to-dataset.gif b/examples/Text2SQL-Data/assets/add-to-dataset.gif deleted file mode 100644 index 17812b2..0000000 Binary files a/examples/Text2SQL-Data/assets/add-to-dataset.gif and /dev/null differ diff --git a/examples/Text2SQL-Data/assets/add-to-dataset.mp4 b/examples/Text2SQL-Data/assets/add-to-dataset.mp4 new file mode 100644 index 0000000..ec985c9 Binary files /dev/null and b/examples/Text2SQL-Data/assets/add-to-dataset.mp4 differ diff --git a/examples/Text2SQL-Data/assets/analyze-regressions.gif b/examples/Text2SQL-Data/assets/analyze-regressions.gif deleted file mode 100644 index a0c2b72..0000000 Binary files a/examples/Text2SQL-Data/assets/analyze-regressions.gif and /dev/null differ diff --git a/examples/Text2SQL-Data/assets/analyze-regressions.mp4 b/examples/Text2SQL-Data/assets/analyze-regressions.mp4 new file mode 100644 index 0000000..77204be Binary files /dev/null and b/examples/Text2SQL-Data/assets/analyze-regressions.mp4 differ diff --git a/examples/ToolOCR/ToolOCR.mdx b/examples/ToolOCR/ToolOCR.mdx index 7ac7132..8970bff 100644 --- a/examples/ToolOCR/ToolOCR.mdx +++ b/examples/ToolOCR/ToolOCR.mdx @@ -95,7 +95,7 @@ braintrust push ocr.py --requirements requirements.txt To try out the tool, visit the **toolOCR** project in Braintrust, and navigate **Tools**. Here, you can test different images and see what kinds of outputs you're getting from the tool. -![Try gif](assets/try-tool.gif) +![Try gif](assets/try-tool.mp4) This is helpful information for deciding if you'd like to do any additional post processing to the text output. For example, you may notice that your output contains `/n` to indicate new lines in the parsed text. You could include additional processing in your tool to do this. If you change your code, just run `braintrust push ocr.py --requirements requirements.txt` again to sync the tool with Braintrust. @@ -117,7 +117,7 @@ prompt = project.prompts.create( Just like the tool, you can run it in the UI and even try it out on some examples: -![Try prompt](assets/try-prompt.gif) +![Try prompt](assets/try-prompt.mp4) If you visit the **Logs** tab, you can check out detailed logs for each call: @@ -142,7 +142,7 @@ Then, navigate to **Dataset** in your playground and select the **Recipes** data Your playground is now set up with a prompt, model choice, dataset, and the tool we created. Hit **Run** to run the prompt and tool on the images in the dataset. -![Run playground](assets/run-playground.gif) +![Run playground](assets/run-playground.mp4) ## Iterating on the prompt @@ -150,7 +150,7 @@ Now that we have an interactive environment to test out our prompt and tool call Hit the copy icon to duplicate your prompt and start tweaking. You can also tweak the original prompt and save your changes there if you'd like. For example, you can try instructing the model to always list the quantity of each ingredient you need to purchase. -![Tweak prompt](assets/tweak-prompt.gif) +![Tweak prompt](assets/tweak-prompt.mp4) Once you're satisfied with the prompt, hit **Update** to save the changes. Each time you save the prompt, you create a new version. To learn more about how to use a prompt in your code, check out the diff --git a/examples/ToolOCR/assets/run-playground.gif b/examples/ToolOCR/assets/run-playground.gif deleted file mode 100644 index aa30536..0000000 Binary files a/examples/ToolOCR/assets/run-playground.gif and /dev/null differ diff --git a/examples/ToolOCR/assets/run-playground.mp4 b/examples/ToolOCR/assets/run-playground.mp4 new file mode 100644 index 0000000..24d9e0a Binary files /dev/null and b/examples/ToolOCR/assets/run-playground.mp4 differ diff --git a/examples/ToolOCR/assets/try-prompt.gif b/examples/ToolOCR/assets/try-prompt.gif deleted file mode 100644 index 864ac57..0000000 Binary files a/examples/ToolOCR/assets/try-prompt.gif and /dev/null differ diff --git a/examples/ToolOCR/assets/try-prompt.mp4 b/examples/ToolOCR/assets/try-prompt.mp4 new file mode 100644 index 0000000..9fccb54 Binary files /dev/null and b/examples/ToolOCR/assets/try-prompt.mp4 differ diff --git a/examples/ToolOCR/assets/try-tool.gif b/examples/ToolOCR/assets/try-tool.gif deleted file mode 100644 index bef1402..0000000 Binary files a/examples/ToolOCR/assets/try-tool.gif and /dev/null differ diff --git a/examples/ToolOCR/assets/try-tool.mp4 b/examples/ToolOCR/assets/try-tool.mp4 new file mode 100644 index 0000000..03dd298 Binary files /dev/null and b/examples/ToolOCR/assets/try-tool.mp4 differ diff --git a/examples/ToolOCR/assets/tweak-prompt.gif b/examples/ToolOCR/assets/tweak-prompt.gif deleted file mode 100644 index 55ba4cf..0000000 Binary files a/examples/ToolOCR/assets/tweak-prompt.gif and /dev/null differ diff --git a/examples/ToolOCR/assets/tweak-prompt.mp4 b/examples/ToolOCR/assets/tweak-prompt.mp4 new file mode 100644 index 0000000..2c68b78 Binary files /dev/null and b/examples/ToolOCR/assets/tweak-prompt.mp4 differ diff --git a/examples/ToolRAG/ToolRAG.mdx b/examples/ToolRAG/ToolRAG.mdx index 1731fab..6628849 100644 --- a/examples/ToolRAG/ToolRAG.mdx +++ b/examples/ToolRAG/ToolRAG.mdx @@ -7,7 +7,7 @@ to compare multiple versions side-by-side, you'd have to deploy each version sep Using Braintrust, you can experiment with different prompts together with retrieval logic, side-by-side, all within the playground UI. In this cookbook, we'll walk through exactly how. -![Side-by-side](./assets/Side-by-side.gif) +![Side-by-side](./assets/Side-by-side.mp4) ## Architecture @@ -117,7 +117,7 @@ The output should be: To try out the tool, visit the project in Braintrust, and navigate to **Tools**. -![Test tool](./assets/Test-tool.gif) +![Test tool](./assets/Test-tool.mp4) Here, you can test different searches and refine the logic. For example, you could try playing with various `top_k` values, or adding a prefix to the query to guide the results. If you change the code, run @@ -150,7 +150,7 @@ npx braintrust push prompt.ts Once the prompt uploads, you can run it in the UI and even try it out on some examples: -![Test prompt](./assets/Test-prompt.gif) +![Test prompt](./assets/Test-prompt.mp4) If you visit the **Logs** tab, you can check out detailed logs for each call: @@ -181,12 +181,12 @@ Once you create it, if you visit the **Datasets** tab, you'll be able to explore To try out the prompt together with the dataset, we'll create a playground. -![Create playground](./assets/Create-playground.gif) +![Create playground](./assets/Create-playground.mp4) Once you create the playground, hit **Run** to run the prompt and tool on the questions in the dataset. -![Run playground](./assets/Run-playground.gif) +![Run playground](./assets/Run-playground.mp4) ### Define a scorer @@ -228,7 +228,7 @@ Once you define the scorer, hit **Run** to run it on the questions in the datase Now, let's tweak the prompt to see if we can improve the results. Hit the copy icon to duplicate your prompt and start tweaking. You can also tweak the original prompt and save your changes there if you'd like. For example, you can try instructing the model to always include a Python and TypeScript code snippet. -![Tweak prompt](./assets/Tweak-prompt.gif) +![Tweak prompt](./assets/Tweak-prompt.mp4) Once you're satisfied with the prompt, hit **Update** to save the changes. Each time you save the prompt, you create a new version. To learn more about how to use a prompt in your code, check out the diff --git a/examples/ToolRAG/assets/Create-playground.gif b/examples/ToolRAG/assets/Create-playground.gif deleted file mode 100644 index c12c6b1..0000000 Binary files a/examples/ToolRAG/assets/Create-playground.gif and /dev/null differ diff --git a/examples/ToolRAG/assets/Create-playground.mp4 b/examples/ToolRAG/assets/Create-playground.mp4 new file mode 100644 index 0000000..43d4af0 Binary files /dev/null and b/examples/ToolRAG/assets/Create-playground.mp4 differ diff --git a/examples/ToolRAG/assets/Run-playground.gif b/examples/ToolRAG/assets/Run-playground.gif deleted file mode 100644 index 494ee17..0000000 Binary files a/examples/ToolRAG/assets/Run-playground.gif and /dev/null differ diff --git a/examples/ToolRAG/assets/Run-playground.mp4 b/examples/ToolRAG/assets/Run-playground.mp4 new file mode 100644 index 0000000..b1d3280 Binary files /dev/null and b/examples/ToolRAG/assets/Run-playground.mp4 differ diff --git a/examples/ToolRAG/assets/Side-by-side.gif b/examples/ToolRAG/assets/Side-by-side.gif deleted file mode 100644 index 5c333a2..0000000 Binary files a/examples/ToolRAG/assets/Side-by-side.gif and /dev/null differ diff --git a/examples/ToolRAG/assets/Side-by-side.mp4 b/examples/ToolRAG/assets/Side-by-side.mp4 new file mode 100644 index 0000000..bc732d6 Binary files /dev/null and b/examples/ToolRAG/assets/Side-by-side.mp4 differ diff --git a/examples/ToolRAG/assets/Test-prompt.gif b/examples/ToolRAG/assets/Test-prompt.gif deleted file mode 100644 index 7ed7f95..0000000 Binary files a/examples/ToolRAG/assets/Test-prompt.gif and /dev/null differ diff --git a/examples/ToolRAG/assets/Test-prompt.mp4 b/examples/ToolRAG/assets/Test-prompt.mp4 new file mode 100644 index 0000000..53f26e2 Binary files /dev/null and b/examples/ToolRAG/assets/Test-prompt.mp4 differ diff --git a/examples/ToolRAG/assets/Test-tool.gif b/examples/ToolRAG/assets/Test-tool.gif deleted file mode 100644 index 70b01bb..0000000 Binary files a/examples/ToolRAG/assets/Test-tool.gif and /dev/null differ diff --git a/examples/ToolRAG/assets/Test-tool.mp4 b/examples/ToolRAG/assets/Test-tool.mp4 new file mode 100644 index 0000000..387f22c Binary files /dev/null and b/examples/ToolRAG/assets/Test-tool.mp4 differ diff --git a/examples/ToolRAG/assets/Tweak-prompt.gif b/examples/ToolRAG/assets/Tweak-prompt.gif deleted file mode 100644 index fe25c28..0000000 Binary files a/examples/ToolRAG/assets/Tweak-prompt.gif and /dev/null differ diff --git a/examples/ToolRAG/assets/Tweak-prompt.mp4 b/examples/ToolRAG/assets/Tweak-prompt.mp4 new file mode 100644 index 0000000..1b1696d Binary files /dev/null and b/examples/ToolRAG/assets/Tweak-prompt.mp4 differ diff --git a/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx b/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx index ec7ad13..1252fca 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/APIAgent-Py.mdx @@ -545,7 +545,7 @@ Question: How do I purchase GPUs through Braintrust? Awesome! The logs now have a `no_hallucination` score which we can use to filter down hallucinations. -![Hallucination logs](./../assets/APIAgent-Py/logs-with-score.gif) +![Hallucination logs](./../assets/APIAgent-Py/logs-with-score.mp4) ### Creating datasets @@ -553,7 +553,7 @@ Let's create two datasets: one for good answers and the other for hallucinations non-hallucinations are correct, but in a real-world scenario, you could [collect user feedback](https://www.braintrust.dev/docs/guides/logging#user-feedback) and treat positively rated feedback as ground truth. -![Dataset setup](./../assets/APIAgent-Py/dataset-setup.gif) +![Dataset setup](./../assets/APIAgent-Py/dataset-setup.mp4) ## Running evals @@ -680,7 +680,7 @@ Awesome! Looks like we were able to solve the hallucinations, although we may ha To understand why, we can filter down to this regression, and take a look at a side-by-side diff. -![Regression diff](./../assets/APIAgent-Py/regression-diff.gif) +![Regression diff](./../assets/APIAgent-Py/regression-diff.mp4) Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step. Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields. diff --git a/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx b/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx index 9be6106..11c9ea1 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/Github-Issues.mdx @@ -390,4 +390,4 @@ them into your evals. Happy evaluating! -![improvements](./../assets/Github-Issues/improvements.gif) +![improvements](./../assets/Github-Issues/improvements.mp4) diff --git a/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx b/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx index d430e96..652b905 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/LLaMa-3_1-Tools.mdx @@ -605,7 +605,7 @@ Ok, let's dig into the results. To start, we'll look at how LLaMa-3.1-8B compare Although it's a fraction of the cost, it's both slower (likely due to rate limits) and worse performing than GPT-4o. 12 of the 60 cases failed to parse. Let's take a look at one of those in depth. -![parsing-failure](./../assets/LLaMa-3_1-Tools/parsing-failure.gif) +![parsing-failure](./../assets/LLaMa-3_1-Tools/parsing-failure.mp4) That definitely looks like an invalid tool call. Maybe we can experiment with tweaking the prompt to get better results. diff --git a/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx b/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx index 842c7ae..792c920 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/ProviderBenchmark.mdx @@ -381,7 +381,7 @@ await Promise.all(providers.map(runProviderBenchmark)); Let's start by looking at the project view. Braintrust makes it easy to morph this into a multi-level grouped analysis where we can see the score vs. duration in a scatter plot, and how each provider stacks up in the table. -![Setting up the table](./../assets/ProviderBenchmark/configuring-graph.gif) +![Setting up the table](./../assets/ProviderBenchmark/configuring-graph.mp4) ### Insights diff --git a/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx b/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx index 6dfb6b2..9e5a068 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/ReceiptExtraction.mdx @@ -325,7 +325,7 @@ Let's dig into these individual results in some more depth. If you click into the gpt-4o experiment and compare it to gpt-4o-mini, you can drill down into the individual improvements and regressions. -![Regressions](./../assets/ReceiptExtraction/GPT-4o-vs-4o-mini.gif) +![Regressions](./../assets/ReceiptExtraction/GPT-4o-vs-4o-mini.mp4) There are several different types of regressions, one of which appears to be that `gpt-4o` returns information in a different case than `gpt-4o-mini`. That may or may not be important for this use case, but if not, we could adjust our scoring functions to lowercase everything before comparing. diff --git a/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx b/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx index 632dfe9..2aabe67 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/SimpleRagas.mdx @@ -339,7 +339,7 @@ By default, Ragas is configured to use `gpt-3.5-turbo-16k`. As we observed, it l results, and maybe we should try using `gpt-4` instead. Braintrust lets us test the effect of this quickly, directly in the UI, before we run a full experiment: -![try gpt-4](./../assets/SimpleRagas/try-gpt-4.gif) +![try gpt-4](./../assets/SimpleRagas/try-gpt-4.mp4) Looks better. Let's update our scoring function to use it and re-run the experiment. @@ -427,6 +427,6 @@ Although not a pure fail, it does seem like in 3 cases we're not retrieving the We can drill down on individual examples of each regression type to better understand it. The side-by-side diffs built into Braintrust make it easy to deeply understand every step of the pipeline, for example, which documents were missing, and why. -![missing docs](./../assets/SimpleRagas/missing-docs.gif) +![missing docs](./../assets/SimpleRagas/missing-docs.mp4) And there you have it! Ragas is a powerful technique, that with the right tools and iteration can lead to really high quality RAG applications. Happy evaling! diff --git a/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx b/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx index dcb6089..8d58255 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/Text2SQL-Data.mdx @@ -290,7 +290,7 @@ To best utilize these results: 1. Let's capture the good data into a dataset. Since our eval pipeline did the hard work of generating a reference query and results, we can now save these, and make sure that future changes we make do not _regress_ the results. -![add to dataset](./../assets/Text2SQL-Data/add-to-dataset.gif) +![add to dataset](./../assets/Text2SQL-Data/add-to-dataset.mp4) - The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model. @@ -685,7 +685,7 @@ Interesting. It seems like that was not a slam dunk. There were a few regression Braintrust makes it easy to filter down to the regressions, and view a side-by-side diff: -![diff](./../assets/Text2SQL-Data/analyze-regressions.gif) +![diff](./../assets/Text2SQL-Data/analyze-regressions.mp4) ## Conclusion diff --git a/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx b/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx index 8ad4a9a..30e3804 100644 --- a/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx +++ b/examples/ToolRAG/tool-rag/docs-sample/UnreleasedAI.mdx @@ -163,7 +163,7 @@ Now, let’s use the comprehensiveness scorer to create a feedback loop that all Go to your Braintrust **Logs** and select one of your logs. In the expanded view on the left-hand side of your screen, select the **generate-changelog** span, then select **Add to dataset**. Create a new dataset called `eval dataset`, and add a couple more logs to the same dataset. We'll use this dataset to run an experiment that evaluates for comprehensiveness to understand where the prompt might need adjustments.