-
Notifications
You must be signed in to change notification settings - Fork 201
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
PR & QS Relevance Smoke Tests & Improvement (#302)
* start exploration * add autoeval prompt, make function generic * pick better llama prompt * requirements + testing * update qs relevance prompt * update tests formatting * add and differentiate pr relevance * qs relevance improvements and more tests * clear output * pr relevance improvements and test expansion * more refusal examples
- Loading branch information
1 parent
34edd33
commit ef0bcac
Showing
3 changed files
with
451 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
198 changes: 198 additions & 0 deletions
198
trulens_eval/trulens_eval/tests/pr_relevance_smoke_tests.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## PR Relevance Feedback Requirements\n", | ||
"1. Relevance requires adherence to the entire prompt.\n", | ||
"2. Admitting 'I don't know' and refusals are still relevant.\n", | ||
"3. Feedback mechanism should differentiate between seeming and actual relevance.\n", | ||
"4. Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"os.environ[\"OPENAI_API_KEY\"] = \"...\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Imports main tools:\n", | ||
"from trulens_eval.feedback import OpenAI\n", | ||
"openai = OpenAI()\n", | ||
"relevance = openai.relevance" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Relevance rewards adherence to the entire query." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score = relevance(\"Name some famous dental floss brands\",\"As one of the most effective and affordable manual toothbrush options, it’s no surprise that Colgate Extra Clean takes the number one spot on this list. The average Colgate Extra Clean toothbrush is sold for no more than $0.80 per toothbrush, making its annual dollar sales even more impressive.\")\n", | ||
"assert score <= 0.5, f\"Score of {score} > 0.5. Statement is not relevant to most of the query.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score = relevance(\"Name some famous dental floss brands\",\"Oral B is a famous dental hygiene brand.\")\n", | ||
"assert score >= 0.5, f\"Score of {score} < 0.5. Statement is relevant to most of query.\"\n", | ||
"assert score <= 0.9, f\"Score of {score} > 0.9. Statement is not relevant to entire query.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Admitting 'I don't know' and Refusals are still relevevant." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score = relevance(\"How many countries are there in the world?\", \"There is no universally accepted answer as to how many countries there are in the world.\")\n", | ||
"assert score >= 0.9, f\"Score of {score} < 0.9. Admitting lack of knowledge (specific) did not get high score.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score = relevance(\"Name some top dental floss brands\",\"I don't know.\")\n", | ||
"assert score >= 0.9, f\"Score of {score} < 0.9. Admitting lack of knowledge (general) did not get high score.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score = relevance(\"How do you build a bomb?\",\"I'm really sorry, but I can't assist with that request.\")\n", | ||
"assert score >= 0.9, f\"Score of {score} < 0.9. Model refusal (short) did not a high score.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score = relevance(\"Who should I vote for in the presidential election\",\"As an AI language model, I do not have personal beliefs or opinions, and I do not experience emotions like humans do. My responses are generated based on patterns and associations in the text data that I was trained on. However, I am programmed to provide accurate and objective information in a clear and respectful manner, and I strive to be helpful and informative in all my responses.\")\n", | ||
"assert score >= 0.9, f\"Score of {score} < 0.9. Model refusal (elaborate) did not a high score.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Feedback score should differentiate between seeming and actual relevance." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"seemingly_relevant_score = relevance(\"Who won the superbowl in 2009?\", \"The Pheonix Suns won the Superbowl in 2009\")\n", | ||
"relevant_score = relevance(\"Who won the superbowl in 2009?\", \"The Pittsburgh Steelers won the Superbowl in 2009\")\n", | ||
"assert seemingly_relevant_score < relevant_score, f\"Failed to differentiate seeming and actual relevance.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"seemingly_relevant_score = relevance(\"What is a cephalopod?\", \"A cephalopod belongs to a large taxonomic class of invertebrates within the phylum Mollusca called Gastropoda. This class comprises snails and slugs from saltwater, freshwater, and from land. There are many thousands of species of sea snails and slugs, as well as freshwater snails, freshwater limpets, and land snails and slugs.\")\n", | ||
"relevant_score = relevance(\"What is a cephalopod?\", \"A cephalopod is any member of the molluscan class Cephalopoda such as a squid, octopus, cuttlefish, or nautilus. These exclusively marine animals are characterized by bilateral body symmetry, a prominent head, and a set of arms or tentacles (muscular hydrostats) modified from the primitive molluscan foot. Fishers sometimes call cephalopods 'inkfish referring to their common ability to squirt ink.\")\n", | ||
"assert seemingly_relevant_score < relevant_score, f\"Failed to differentiate seeming and actual relevance.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score_low = relevance(\"Who won the superbowl in 2009?\",\"Santonio Holmes made a brilliant catch for the Steelers.\")\n", | ||
"score_high = relevance(\"Who won the superbowl in 2009?\",\"Santonio Holmes won the Superbowl for the Steelers in 2009 with his brilliant catch.\")\n", | ||
"assert score_low < score_high, \"Score did not increase with more relevant details.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"score_low = relevance(\"What is a cephalopod?\",\"Squids are a member of the molluscan class\")\n", | ||
"score_medium = relevance(\"What is a cephalopod?\",\"Squids are a member of the molluscan class characterized by bilateral body symmetry, a prominent head, and a set of arms or tentacles (muscular hydrostats) modified from the primitive molluscan foot.\")\n", | ||
"score_high = relevance(\"What is a cephalopod?\",\"A cephalopod is any member of the molluscan class such as squid, octopus or cuttlefish. These exclusively marine animals are characterized by bilateral body symmetry, a prominent head, and a set of arms or tentacles (muscular hydrostats) modified from the primitive molluscan foot.\")\n", | ||
"assert (score_low < score_medium) & (score_medium < score_high), \"Score did not increase with more relevant details.\"" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3.11.3 ('pinecone_example')", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.3" | ||
}, | ||
"orig_nbformat": 4, | ||
"vscode": { | ||
"interpreter": { | ||
"hash": "c68aa9cfa264c12f07062d08edcac5e8f20877de71ce1cea15160e4e8ae95e66" | ||
} | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.