Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 219 additions & 0 deletions examples/Mistral-Benchmark/Mistral-Benchmark.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "quNDJhiCJ-AI"
},
"source": [
"# Mistral benchmark\n",
"\n",
"Before starting, please make sure that you have a Braintrust account. If you do not, please [sign up](https://braintrust.dev). Within the account,\n",
"make sure to add OpenAI and Mistral API keys to the [AI secrets](https://www.braintrust.dev/app/settings?subroute=secrets) configuration.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2tMYg0jyKelb"
},
"source": [
"## Setting up the environment\n",
"\n",
"The next few commands will install some libraries and include some helper code for the text2sql application. Feel free to copy/paste/tweak/reuse this code in your own tools.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "pGuPB2SqUWdZ"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -U autoevals braintrust openai --quiet"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fajc2XybK_gt"
},
"source": [
"### Loading the data\n",
"\n",
"We'll use the CoQA dataset created for the [LLaMa 3.1 Tools](https://www.braintrust.dev/docs/cookbook/recipes/LLaMa-3_1-Tools) guide.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'input': {'input': 'What color was Cotton?', 'output': 'white', 'expected': 'white'}, 'expected': 1, 'metadata': {'source': 'mctest', 'story': 'Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer\\'s horses slept. But Cotton wasn\\'t alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton\\'s mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer\\'s orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \\n\\n\"What are you doing, Cotton?!\" \\n\\n\"I only wanted to be more like you\". \\n\\nCotton\\'s mommy rubbed her face on Cotton\\'s and said \"Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way\". And with that, Cotton\\'s mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton\\'s fur was all all dry. \\n\\n\"Don\\'t ever do that again, Cotton!\" they all cried. \"Next time you might mess up that pretty white fur of yours and we wouldn\\'t want that!\" \\n\\nThen Cotton thought, \"I change my mind. I like being special\".', 'questions': ['What color was Cotton?', 'Where did she live?', 'Did she live alone?', 'Who did she live with?', 'What color were her sisters?', 'Was Cotton happy that she looked different than the rest of her family?', 'What did she do to try to make herself the same color as her sisters?', 'Whose paint was it?', \"What did Cotton's mother and siblings do when they saw her painted orange?\", \"Where did Cotton's mother put her to clean the paint off?\", 'What did the other cats do when Cotton emerged from the bucket of water?', 'Did they want Cotton to change the color of her fur?'], 'answers': {'input_text': ['white', 'in a barn', 'no', 'with her mommy and 5 sisters', 'orange and white', 'no', 'she painted herself', 'the farmer', 'they started laughing', 'a bucket of water', 'licked her face', 'no'], 'answer_start': [59, 18, 196, 281, 428, 512, 678, 647, 718, 1035, 1143, 965], 'answer_end': [93, 80, 215, 315, 490, 549, 716, 676, 776, 1097, 1170, 1008]}}}\n"
]
}
],
"source": [
"import json\n",
"data_path = \"../LLaMa-3_1-Tools/coqa-factuality.json\"\n",
"data = json.load(open(data_path))\n",
"\n",
"print(data[0])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Experiment gpt-4o is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o\n",
"coqa-factuality [experiment_name=gpt-4o] (data): 20it [00:00, 87655.26it/s]\n",
"coqa-factuality [experiment_name=gpt-4o] (tasks): 100%|██████████| 20/20 [00:00<00:00, 22.72it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"=========================SUMMARY=========================\n",
"85.00% 'NumericDiff' score\n",
"\n",
"See results for gpt-4o at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Experiment gpt-4o-mini is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o-mini\n",
"coqa-factuality [experiment_name=gpt-4o-mini] (data): 20it [00:00, 111848.11it/s]\n",
"coqa-factuality [experiment_name=gpt-4o-mini] (tasks): 100%|██████████| 20/20 [00:00<00:00, 27.50it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"=========================SUMMARY=========================\n",
"gpt-4o-mini compared to gpt-4o:\n",
"85.00% (-) 'NumericDiff' score\t(0 improvements, 0 regressions)\n",
"\n",
"See results for gpt-4o-mini at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/gpt-4o-mini\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Experiment mistral-large-latest is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/mistral-large-latest\n",
"coqa-factuality [experiment_name=mistral-large-latest] (data): 20it [00:00, 78179.01it/s]\n",
"coqa-factuality [experiment_name=mistral-large-latest] (tasks): 100%|██████████| 20/20 [00:00<00:00, 31.59it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"=========================SUMMARY=========================\n",
"mistral-large-latest compared to gpt-4o-mini:\n",
"85.00% (-) 'NumericDiff' score\t(0 improvements, 0 regressions)\n",
"\n",
"See results for mistral-large-latest at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/mistral-large-latest\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Experiment open-mistral-nemo is running at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/open-mistral-nemo\n",
"coqa-factuality [experiment_name=open-mistral-nemo] (data): 20it [00:00, 62695.13it/s]\n",
"coqa-factuality [experiment_name=open-mistral-nemo] (tasks): 100%|██████████| 20/20 [00:00<00:00, 27.29it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"=========================SUMMARY=========================\n",
"open-mistral-nemo compared to mistral-large-latest:\n",
"85.00% (-) 'NumericDiff' score\t(0 improvements, 0 regressions)\n",
"\n",
"See results for open-mistral-nemo at https://www.braintrust.dev/app/braintrustdata.com/p/coqa-factuality/experiments/open-mistral-nemo\n"
]
}
],
"source": [
"import os\n",
"from autoevals import Factuality, NumericDiff\n",
"from braintrust import current_span, Eval\n",
"\n",
"models = [\"gpt-4o\", \"gpt-4o-mini\", \"mistral-large-latest\", \"open-mistral-nemo\"]\n",
"\n",
"for model in models:\n",
" async def task(input):\n",
" result = await Factuality(\n",
" api_key=os.environ[\"BRAINTRUST_API_KEY\"],\n",
" model=model\n",
" \n",
" ).eval_async(**input)\n",
" current_span().log(output=result)\n",
" return result.score\n",
"\n",
" await Eval(\n",
" \"coqa-factuality\",\n",
" data=data[:20],\n",
" task=task,\n",
" scores=[NumericDiff()],\n",
" trial_count=1,\n",
" experiment_name=f\"{model}\",\n",
" metadata={\"model\": model}\n",
" )"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}