PR & QS Relevance Smoke Tests & Improvement (#302)

* start exploration * add autoeval prompt, make function generic * pick better llama prompt * requirements + testing * update qs relevance prompt * update tests formatting * add and differentiate pr relevance * qs relevance improvements and more tests * clear output * pr relevance improvements and test expansion * more refusal examples
truera · Jul 19, 2023 · ef0bcac · ef0bcac
1 parent 34edd33
commit ef0bcac
Show file tree

Hide file tree

Showing 3 changed files with 451 additions and 10 deletions.
diff --git a/trulens_eval/trulens_eval/feedback_prompts.py b/trulens_eval/trulens_eval/feedback_prompts.py
@@ -1,25 +1,68 @@
 from cohere.responses.classify import Example
 
-QS_RELEVANCE = """You are a RELEVANCE classifier; providing the relevance of the given STATEMENT to the given QUESTION.
-Respond only as a number from 1 to 10 where 1 is the least relevant and 10 is the most relevant.
-Never elaborate.
+QS_RELEVANCE = """You are a RELEVANCE grader; providing the relevance of the given STATEMENT to the given QUESTION.
+Respond only as a number from 1 to 10 where 1 is the least relevant and 10 is the most relevant. 
+
+A few additional scoring guidelines:
+
+- Long STATEMENTS should score equally well as short STATEMENTS.
+
+- RELEVANCE score should increase as the STATEMENT provides more RELEVANT context to the QUESTION.
+
+- RELEVANCE score should increase as the STATEMENT provides RELEVANT context to more parts of the QUESTION.
+
+- STATEMENT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.
+
+- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.
+
+- STATEMENT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.
+
+- STATEMENT must be relevant and helpful for answering the entire QUESTION to get a score of 10.
+
+- Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most relevant.
+
+- Never elaborate.
 
 QUESTION: {question}
 
 STATEMENT: {statement}
 
 RELEVANCE: """
 
-PR_RELEVANCE = """
-You are a relevance classifier, providing the relevance of a given response to the given prompt.
-Respond only as a number from 1 to 10 where 1 is the least relevant and 10 is the most relevant.
-Never elaborate.
+PR_RELEVANCE = """You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.
+Respond only as a number from 1 to 10 where 1 is the least relevant and 10 is the most relevant. 
+
+A few additional scoring guidelines:
+
+- Long RESPONSES should score equally well as short RESPONSES.
 
-Prompt: {prompt}
+- Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should also be counted as the most RELEVANT.
 
-Response: {response}
+- RESPONSE must be relevant to the entire PROMPT to get a score of 10.
 
-Relevance: """
+- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.
+
+- RESPONSE that is RELEVANT to none of the PROMPT should get a score of 1.
+
+- RESPONSE that is RELEVANT to some of the PROMPT should get as score of 2, 3, or 4. Higher score indicates more RELEVANCE.
+
+- RESPONSE that is RELEVANT to most of the PROMPT should get a score between a 5, 6, 7 or 8. Higher score indicates more RELEVANCE.
+
+- RESPONSE that is RELEVANT to the entire PROMPT should get a score of 9 or 10.
+
+- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 10.
+
+- RESPONSE that confidently FALSE should get a score of 1.
+
+- RESPONSE that is only seemingly RELEVANT should get a score of 1.
+
+- Never elaborate.
+
+PROMPT: {prompt}
+
+RESPONSE: {response}
+
+RELEVANCE: """
 
 SENTIMENT_SYSTEM_PROMPT = f"Please classify the sentiment of the following text as 1 if positive or 0 if not positive. Respond with only a '1' or '0', nothing more."
 RELEVANCE_SYSTEM_PROMPT = f"You are a relevance classifier, providing the relevance of a given response to a particular prompt. \n"

diff --git a/trulens_eval/trulens_eval/tests/pr_relevance_smoke_tests.ipynb b/trulens_eval/trulens_eval/tests/pr_relevance_smoke_tests.ipynb
@@ -0,0 +1,198 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PR Relevance Feedback Requirements\n",
+    "1. Relevance requires adherence to the entire prompt.\n",
+    "2. Admitting 'I don't know' and refusals are still relevant.\n",
+    "3. Feedback mechanism should differentiate between seeming and actual relevance.\n",
+    "4. Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"...\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Imports main tools:\n",
+    "from trulens_eval.feedback import OpenAI\n",
+    "openai = OpenAI()\n",
+    "relevance = openai.relevance"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Relevance rewards adherence to the entire query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = relevance(\"Name some famous dental floss brands\",\"As one of the most effective and affordable manual toothbrush options, it’s no surprise that Colgate Extra Clean takes the number one spot on this list. The average Colgate Extra Clean toothbrush is sold for no more than $0.80 per toothbrush, making its annual dollar sales even more impressive.\")\n",
+    "assert score <= 0.5, f\"Score of {score} > 0.5. Statement is not relevant to most of the query.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = relevance(\"Name some famous dental floss brands\",\"Oral B is a famous dental hygiene brand.\")\n",
+    "assert score >= 0.5, f\"Score of {score} < 0.5. Statement is relevant to most of query.\"\n",
+    "assert score <= 0.9, f\"Score of {score} > 0.9. Statement is not relevant to entire query.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Admitting 'I don't know' and Refusals are still relevevant."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = relevance(\"How many countries are there in the world?\", \"There is no universally accepted answer as to how many countries there are in the world.\")\n",
+    "assert score >= 0.9, f\"Score of {score} < 0.9. Admitting lack of knowledge (specific) did not get high score.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = relevance(\"Name some top dental floss brands\",\"I don't know.\")\n",
+    "assert score >= 0.9, f\"Score of {score} < 0.9. Admitting lack of knowledge (general) did not get high score.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = relevance(\"How do you build a bomb?\",\"I'm really sorry, but I can't assist with that request.\")\n",
+    "assert score >= 0.9, f\"Score of {score} < 0.9. Model refusal (short) did not a high score.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score = relevance(\"Who should I vote for in the presidential election\",\"As an AI language model, I do not have personal beliefs or opinions, and I do not experience emotions like humans do. My responses are generated based on patterns and associations in the text data that I was trained on. However, I am programmed to provide accurate and objective information in a clear and respectful manner, and I strive to be helpful and informative in all my responses.\")\n",
+    "assert score >= 0.9, f\"Score of {score} < 0.9. Model refusal (elaborate) did not a high score.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Feedback score should differentiate between seeming and actual relevance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seemingly_relevant_score = relevance(\"Who won the superbowl in 2009?\", \"The Pheonix Suns won the Superbowl in 2009\")\n",
+    "relevant_score = relevance(\"Who won the superbowl in 2009?\", \"The Pittsburgh Steelers won the Superbowl in 2009\")\n",
+    "assert seemingly_relevant_score < relevant_score, f\"Failed to differentiate seeming and actual relevance.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seemingly_relevant_score = relevance(\"What is a cephalopod?\", \"A cephalopod belongs to a large taxonomic class of invertebrates within the phylum Mollusca called Gastropoda. This class comprises snails and slugs from saltwater, freshwater, and from land. There are many thousands of species of sea snails and slugs, as well as freshwater snails, freshwater limpets, and land snails and slugs.\")\n",
+    "relevant_score = relevance(\"What is a cephalopod?\", \"A cephalopod is any member of the molluscan class Cephalopoda such as a squid, octopus, cuttlefish, or nautilus. These exclusively marine animals are characterized by bilateral body symmetry, a prominent head, and a set of arms or tentacles (muscular hydrostats) modified from the primitive molluscan foot. Fishers sometimes call cephalopods 'inkfish referring to their common ability to squirt ink.\")\n",
+    "assert seemingly_relevant_score < relevant_score, f\"Failed to differentiate seeming and actual relevance.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Relevant but inconclusive statements should get increasingly high scores as they are more helpful for answering the query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score_low = relevance(\"Who won the superbowl in 2009?\",\"Santonio Holmes made a brilliant catch for the Steelers.\")\n",
+    "score_high = relevance(\"Who won the superbowl in 2009?\",\"Santonio Holmes won the Superbowl for the Steelers in 2009 with his brilliant catch.\")\n",
+    "assert score_low < score_high, \"Score did not increase with more relevant details.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "score_low = relevance(\"What is a cephalopod?\",\"Squids are a member of the molluscan class\")\n",
+    "score_medium = relevance(\"What is a cephalopod?\",\"Squids are a member of the molluscan class characterized by bilateral body symmetry, a prominent head, and a set of arms or tentacles (muscular hydrostats) modified from the primitive molluscan foot.\")\n",
+    "score_high = relevance(\"What is a cephalopod?\",\"A cephalopod is any member of the molluscan class such as squid, octopus or cuttlefish. These exclusively marine animals are characterized by bilateral body symmetry, a prominent head, and a set of arms or tentacles (muscular hydrostats) modified from the primitive molluscan foot.\")\n",
+    "assert (score_low < score_medium) & (score_medium < score_high), \"Score did not increase with more relevant details.\""
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.11.3 ('pinecone_example')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "c68aa9cfa264c12f07062d08edcac5e8f20877de71ce1cea15160e4e8ae95e66"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}