explodinggradients · wlbksy · Jul 23, 2025 · Jul 23, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -135,7 +135,7 @@ The monorepo has the following structure:
 │   ├── pyproject.toml  # ragas-specific build config
 │
 ├── experimental/    # nbdev-based experimental project
-│   ├── nbs/         # Notebooks for nbdev  
+│   ├── nbs/         # Notebooks for nbdev
 │   ├── ragas_experimental/  # Generated code
 │   ├── pyproject.toml  # experimental-specific config
 │   ├── settings.ini    # nbdev config
@@ -154,8 +154,8 @@ The Ragas core library provides metrics, test data generation and evaluation fun
 1. **Metrics** - Various metrics for evaluating LLM applications including:
    - AspectCritic
    - AnswerCorrectness
-   - ContextPrecision
-   - ContextRecall
+   - LLMContextPrecisionWithReference
+   - LLMContextRecall
    - Faithfulness
    - and many more
 

diff --git a/docs/howtos/applications/vertexai_model_comparision.md b/docs/howtos/applications/vertexai_model_comparision.md
@@ -184,7 +184,7 @@ Select and define the metrics that are most relevant to your application.
 
 ```python
 from ragas import evaluate
-from ragas.metrics import ContextPrecision, Faithfulness, RubricsScore, RougeScore
+from ragas.metrics import LLMContextPrecisionWithReference, Faithfulness, RubricsScore, RougeScore
 
 rouge_score = RougeScore()
 
@@ -197,7 +197,7 @@ helpfulness_rubrics = {
 }
 
 rubrics_score = RubricsScore(name="helpfulness", rubrics=helpfulness_rubrics)
-context_precision = ContextPrecision(llm=evaluator_llm)
+context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
 faithfulness = Faithfulness(llm=evaluator_llm)
 ```
 
@@ -662,4 +662,4 @@ plot_bar_plot(eval_results)
 Checkout other tutorials of this series:
 
 - [Ragas with Vertex AI](./vertexai_x_ragas.md): Learn how to use Vertex AI models with Ragas to evaluate your LLM workflows.
-- [Align LLM Metrics](./vertexai_alignment.md): Train and align your LLM evaluators to better match human judgment.
+- [Align LLM Metrics](./vertexai_alignment.md): Train and align your LLM evaluators to better match human judgment.
diff --git a/docs/howtos/applications/vertexai_x_ragas.md b/docs/howtos/applications/vertexai_x_ragas.md
@@ -140,9 +140,9 @@ Model-based metrics leverage pre-trained language models to assess generated tex
 
 ```python
 from ragas import evaluate
-from ragas.metrics import ContextPrecision, Faithfulness
+from ragas.metrics import LLMContextPrecisionWithReference, Faithfulness
 
-context_precision = ContextPrecision(llm=evaluator_llm)
+context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
 faithfulness = Faithfulness(llm=evaluator_llm)
 ```
 
@@ -357,4 +357,4 @@ Output
 Checkout other tutorials of this series:
 
 - [Align LLM Metrics](./vertexai_alignment.md): Train and align your LLM evaluators to better match human judgment.
-- [Model Comparison](./vertexai_model_comparision.md): Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics.
+- [Model Comparison](./vertexai_model_comparision.md): Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics.
diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md b/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
@@ -2,7 +2,7 @@ While evaluating your LLM application with Ragas metrics, you may find yourself
 
 It assumes that you are already familiar with the concepts of [Metrics](/concepts/metrics/overview/index.md) and [Prompt Objects](/concepts/components/prompt.md) in Ragas. If not, please review those topics before proceeding.
 
-For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications. 
+For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications.
 
 
 ## Formulate your metric
@@ -15,12 +15,12 @@ $$
 
 **Step 2**: Decide how are you going to derive this information from the sample. Here I am going to use LLM to do it, ie to check whether the request was refused or answered. You may use Non LLM based methods too. Since I am using LLM based method, this would become an LLM based metric.
 
-**Step 3**: Decide if your metric should work in Single Turn and or Multi Turn data. 
+**Step 3**: Decide if your metric should work in Single Turn and or Multi Turn data.
 
 
 ## Import required base classes
 
-For refusal rate, I have decided it to be a LLM based metric that should work both in single turn and multi turn data samples. 
+For refusal rate, I have decided it to be a LLM based metric that should work both in single turn and multi turn data samples.
 
 
 ```python
@@ -69,7 +69,7 @@ class RefusalPrompt(PydanticPrompt[RefusalInput, RefusalOutput]):
     ]
 ```
 
-Now let's implement the new metric. Here, since I want this metric to work with both `SingleTurnSample` and `MultiTurnSample` I am implementing scoring methods for both types. 
+Now let's implement the new metric. Here, since I want this metric to work with both `SingleTurnSample` and `MultiTurnSample` I am implementing scoring methods for both types.
 Also since for the sake of simplicity I am implementing a simple method to calculate refusal rate in multi-turn conversations
 
 
@@ -91,9 +91,6 @@ class RefusalRate(MetricWithLLM, MultiTurnMetric, SingleTurnMetric):
     )
     refusal_prompt: PydanticPrompt = RefusalPrompt()
 
-    async def _ascore(self, row):
-        pass
-
     async def _single_turn_ascore(self, sample, callbacks):
         prompt_input = RefusalInput(
             user_input=sample.user_input, response=sample.response
@@ -212,5 +209,3 @@ await scorer.multi_turn_ascore(sample)
 
 
     0
-
-
diff --git a/docs/howtos/customizations/metrics/write_your_own_metric_advanced.ipynb b/docs/howtos/customizations/metrics/write_your_own_metric_advanced.ipynb
@@ -141,9 +141,6 @@
     "    )\n",
     "    refusal_prompt: PydanticPrompt = RefusalPrompt()\n",
     "\n",
-    "    async def _ascore(self, row):\n",
-    "        pass\n",
-    "\n",
     "    async def _single_turn_ascore(self, sample, callbacks):\n",
     "        prompt_input = RefusalInput(\n",
     "            user_input=sample.user_input, response=sample.response\n",

diff --git a/docs/howtos/integrations/_haystack.md b/docs/howtos/integrations/_haystack.md
@@ -1,6 +1,6 @@
 # Haystack Integration
 
-Haystack is a  LLM orchestration framework to build customizable, production-ready LLM applications. 
+Haystack is a  LLM orchestration framework to build customizable, production-ready LLM applications.
 
 The underlying concept of Haystack is that all individual tasks, such as storing documents, retrieving relevant data, and generating responses, are handled by modular components like Document Stores, Retrievers, and Generators, which are seamlessly connected and orchestrated using Pipelines.
 
@@ -125,7 +125,7 @@ Pass all the Ragas metrics you want to use for evaluation, ensuring that all the
 For example:
 
 - **AnswerRelevancy**: requires both the **query** and the **response**.
-- **ContextPrecision**: requires the **query**, **retrieved documents**, and the **reference**.
+- **LLMContextPrecisionWithReference**: requires the **query**, **retrieved documents**, and the **reference**.
 - **Faithfulness**: requires the **query**, **retrieved documents**, and the **response**.
 
 Make sure to include all relevant data for each metric to ensure accurate evaluation.
@@ -136,13 +136,13 @@ from haystack_integrations.components.evaluators.ragas import RagasEvaluator
 
 from langchain_openai import ChatOpenAI
 from ragas.llms import LangchainLLMWrapper
-from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness
+from ragas.metrics import AnswerRelevancy, LLMContextPrecisionWithReference, Faithfulness
 
 llm = ChatOpenAI(model="gpt-4o-mini")
 evaluator_llm = LangchainLLMWrapper(llm)
 
 ragas_evaluator = RagasEvaluator(
-    ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
+    ragas_metrics=[AnswerRelevancy(), LLMContextPrecisionWithReference(), Faithfulness()],
     evaluator_llm=evaluator_llm,
 )
 ```
@@ -236,8 +236,8 @@ print(result["ragas_evaluator"]["result"])
     Evaluating: 100%|██████████| 3/3 [00:14<00:00,  4.72s/it]
 
 
-    Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development. 
-    
+    Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development.
+
     {'answer_relevancy': 0.9782, 'context_precision': 1.0000, 'faithfulness': 1.0000}
 
 
@@ -296,5 +296,3 @@ output["result"]
 
 
     {'sports_relevance_metric': 1.0000, 'domain_specific_rubrics': 3.0000}
-
-
diff --git a/docs/howtos/integrations/_langfuse.md b/docs/howtos/integrations/_langfuse.md
@@ -153,19 +153,19 @@ print("answer: ", row["answer"])
 
     question:  What are the global implications of the USA Supreme Court ruling on abortion?
     answer:  The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:
-    
+
     1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.
-    
+
     2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.
-    
+
     3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.
-    
+
     4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.
-    
+
     5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.
-    
+
     6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.
-    
+
     It is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.
 
 
@@ -186,7 +186,7 @@ async def score_with_ragas(query, chunks, answer):
     scores = {}
     for m in metrics:
         print(f"calculating {m.name}")
-        scores[m.name] = await m.ascore(
+        scores[m.name] = await m.single_turn_ascore(
             row={"question": query, "contexts": chunks, "answer": answer}
         )
     return scores
@@ -320,7 +320,3 @@ Note that the scoring is blocking so make sure that you sent the generated answe
 ## Feedback
 
 If you have any feedback or requests, please create a GitHub [Issue](https://langfuse.com/issue) or share your work with the community on [Discord](https://discord.langfuse.com/).
-
-
-
-
diff --git a/docs/howtos/integrations/_llamaindex.md b/docs/howtos/integrations/_llamaindex.md
@@ -128,7 +128,7 @@ with a test dataset to test our `QueryEngine` lets now build one and evaluate it
 
 ## Building the `QueryEngine`
 
-To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it. 
+To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it.
 
 Since we already loaded the dataset into `documents` lets use that.
 
@@ -170,13 +170,13 @@ print(response_vector)
 
 ## Evaluating the `QueryEngine`
 
-Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it. 
+Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it.
 
 In order to run an evaluation with Ragas and LlamaIndex you need 3 things
 
 1. LlamaIndex `QueryEngine`: what we will be evaluating
 2. Metrics: Ragas defines a set of metrics that can measure different aspects of the `QueryEngine`. The available metrics and their meaning can be found [here](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/)
-3. Questions: A list of questions that ragas will test the `QueryEngine` against. 
+3. Questions: A list of questions that ragas will test the `QueryEngine` against.
 
 first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we'll be using a few example question.
 
@@ -188,8 +188,8 @@ Now lets import the metrics we will be using to evaluate
 from ragas.metrics import (
     Faithfulness,
     AnswerRelevancy,
-    ContextPrecision,
-    ContextRecall,
+    LLMContextPrecisionWithReference,
+    LLMContextRecall,
 )
 
 # init metrics with evaluator LLM
@@ -199,8 +199,8 @@ evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
 metrics = [
     Faithfulness(llm=evaluator_llm),
     AnswerRelevancy(llm=evaluator_llm),
-    ContextPrecision(llm=evaluator_llm),
-    ContextRecall(llm=evaluator_llm),
+    LLMContextPrecisionWithReference(llm=evaluator_llm),
+    LLMContextRecall(llm=evaluator_llm),
 ]
 ```
 
@@ -357,5 +357,3 @@ result.to_pandas()
   </tbody>
 </table>
 </div>
-
-
diff --git a/docs/howtos/integrations/griptape.md b/docs/howtos/integrations/griptape.md
@@ -203,15 +203,15 @@ Now, let's evaluate our RAG system using Ragas metrics:
 
 To evaluate our retrieval performance, we can utilize Ragas built-in metrics or create custom metrics tailored to our specific needs. For a comprehensive list of all available metrics and customization options, please visit the [documentation]().
 
-We will use `ContextPrecision`, `ContextRecall` and `ContextRelevance` to measure the retrieval performance:
+We will use `LLMContextPrecisionWithReference`, `LLMContextRecall` and `ContextRelevance` to measure the retrieval performance:
 
-- [ContextPrecision](../../concepts/metrics/available_metrics/context_precision.md): Measures how well a RAG system's retriever ranks relevant chunks at the top of the retrieved context for a given query, calculated as the mean precision@k across all chunks.
-- [ContextRecall](../../concepts/metrics/available_metrics/context_recall.md): Measures the proportion of relevant information successfully retrieved from a knowledge base.
+- [LLMContextPrecisionWithReference](../../concepts/metrics/available_metrics/context_precision.md): Measures how well a RAG system's retriever ranks relevant chunks at the top of the retrieved context for a given query, calculated as the mean precision@k across all chunks.
+- [LLMContextRecall](../../concepts/metrics/available_metrics/context_recall.md): Measures the proportion of relevant information successfully retrieved from a knowledge base.
 - [ContextRelevance](../../concepts/metrics/available_metrics/nvidia_metrics.md#context-relevance): Measures how well the retrieved contexts address the user’s query by evaluating their pertinence through dual LLM judgments.
 
 
 ```python
-from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
+from ragas.metrics import LLMContextPrecisionWithReference, LLMContextRecall, ContextRelevance
 from ragas import evaluate
 from langchain_openai import ChatOpenAI
 from ragas.llms import LangchainLLMWrapper
@@ -220,8 +220,8 @@ llm = ChatOpenAI(model="gpt-4o-mini")
 evaluator_llm = LangchainLLMWrapper(llm)
 
 ragas_metrics = [
-    ContextPrecision(llm=evaluator_llm),
-    ContextRecall(llm=evaluator_llm),
+    LLMContextPrecisionWithReference(llm=evaluator_llm),
+    LLMContextRecall(llm=evaluator_llm),
     ContextRelevance(llm=evaluator_llm),
 ]
 

diff --git a/docs/howtos/integrations/haystack.ipynb b/docs/howtos/integrations/haystack.ipynb
@@ -226,7 +226,7 @@
     "For example:\n",
     "\n",
     "- **AnswerRelevancy**: requires both the **query** and the **response**.\n",
-    "- **ContextPrecision**: requires the **query**, **retrieved documents**, and the **reference**.\n",
+    "- **LLMContextPrecisionWithReference**: requires the **query**, **retrieved documents**, and the **reference**.\n",
     "- **Faithfulness**: requires the **query**, **retrieved documents**, and the **response**.\n",
     "\n",
     "Make sure to include all relevant data for each metric to ensure accurate evaluation."
@@ -242,13 +242,13 @@
     "\n",
     "from langchain_openai import ChatOpenAI\n",
     "from ragas.llms import LangchainLLMWrapper\n",
-    "from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness\n",
+    "from ragas.metrics import AnswerRelevancy, LLMContextPrecisionWithReference, Faithfulness\n",
     "\n",
     "llm = ChatOpenAI(model=\"gpt-4o-mini\")\n",
     "evaluator_llm = LangchainLLMWrapper(llm)\n",
     "\n",
     "ragas_evaluator = RagasEvaluator(\n",
-    "    ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],\n",
+    "    ragas_metrics=[AnswerRelevancy(), LLMContextPrecisionWithReference(), Faithfulness()],\n",
     "    evaluator_llm=evaluator_llm,\n",
     ")"
    ]