Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ The monorepo has the following structure:
│ ├── pyproject.toml # ragas-specific build config
├── experimental/ # nbdev-based experimental project
│ ├── nbs/ # Notebooks for nbdev
│ ├── nbs/ # Notebooks for nbdev
│ ├── ragas_experimental/ # Generated code
│ ├── pyproject.toml # experimental-specific config
│ ├── settings.ini # nbdev config
Expand All @@ -154,8 +154,8 @@ The Ragas core library provides metrics, test data generation and evaluation fun
1. **Metrics** - Various metrics for evaluating LLM applications including:
- AspectCritic
- AnswerCorrectness
- ContextPrecision
- ContextRecall
- LLMContextPrecisionWithReference
- LLMContextRecall
- Faithfulness
- and many more

Expand Down
6 changes: 3 additions & 3 deletions docs/howtos/applications/vertexai_model_comparision.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ Select and define the metrics that are most relevant to your application.

```python
from ragas import evaluate
from ragas.metrics import ContextPrecision, Faithfulness, RubricsScore, RougeScore
from ragas.metrics import LLMContextPrecisionWithReference, Faithfulness, RubricsScore, RougeScore

rouge_score = RougeScore()

Expand All @@ -197,7 +197,7 @@ helpfulness_rubrics = {
}

rubrics_score = RubricsScore(name="helpfulness", rubrics=helpfulness_rubrics)
context_precision = ContextPrecision(llm=evaluator_llm)
context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
faithfulness = Faithfulness(llm=evaluator_llm)
```

Expand Down Expand Up @@ -662,4 +662,4 @@ plot_bar_plot(eval_results)
Checkout other tutorials of this series:

- [Ragas with Vertex AI](./vertexai_x_ragas.md): Learn how to use Vertex AI models with Ragas to evaluate your LLM workflows.
- [Align LLM Metrics](./vertexai_alignment.md): Train and align your LLM evaluators to better match human judgment.
- [Align LLM Metrics](./vertexai_alignment.md): Train and align your LLM evaluators to better match human judgment.
6 changes: 3 additions & 3 deletions docs/howtos/applications/vertexai_x_ragas.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,9 +140,9 @@ Model-based metrics leverage pre-trained language models to assess generated tex

```python
from ragas import evaluate
from ragas.metrics import ContextPrecision, Faithfulness
from ragas.metrics import LLMContextPrecisionWithReference, Faithfulness

context_precision = ContextPrecision(llm=evaluator_llm)
context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
faithfulness = Faithfulness(llm=evaluator_llm)
```

Expand Down Expand Up @@ -357,4 +357,4 @@ Output
Checkout other tutorials of this series:

- [Align LLM Metrics](./vertexai_alignment.md): Train and align your LLM evaluators to better match human judgment.
- [Model Comparison](./vertexai_model_comparision.md): Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics.
- [Model Comparison](./vertexai_model_comparision.md): Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ While evaluating your LLM application with Ragas metrics, you may find yourself

It assumes that you are already familiar with the concepts of [Metrics](/concepts/metrics/overview/index.md) and [Prompt Objects](/concepts/components/prompt.md) in Ragas. If not, please review those topics before proceeding.

For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications.
For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications.


## Formulate your metric
Expand All @@ -15,12 +15,12 @@ $$

**Step 2**: Decide how are you going to derive this information from the sample. Here I am going to use LLM to do it, ie to check whether the request was refused or answered. You may use Non LLM based methods too. Since I am using LLM based method, this would become an LLM based metric.

**Step 3**: Decide if your metric should work in Single Turn and or Multi Turn data.
**Step 3**: Decide if your metric should work in Single Turn and or Multi Turn data.


## Import required base classes

For refusal rate, I have decided it to be a LLM based metric that should work both in single turn and multi turn data samples.
For refusal rate, I have decided it to be a LLM based metric that should work both in single turn and multi turn data samples.


```python
Expand Down Expand Up @@ -69,7 +69,7 @@ class RefusalPrompt(PydanticPrompt[RefusalInput, RefusalOutput]):
]
```

Now let's implement the new metric. Here, since I want this metric to work with both `SingleTurnSample` and `MultiTurnSample` I am implementing scoring methods for both types.
Now let's implement the new metric. Here, since I want this metric to work with both `SingleTurnSample` and `MultiTurnSample` I am implementing scoring methods for both types.
Also since for the sake of simplicity I am implementing a simple method to calculate refusal rate in multi-turn conversations


Expand All @@ -91,9 +91,6 @@ class RefusalRate(MetricWithLLM, MultiTurnMetric, SingleTurnMetric):
)
refusal_prompt: PydanticPrompt = RefusalPrompt()

async def _ascore(self, row):
pass

async def _single_turn_ascore(self, sample, callbacks):
prompt_input = RefusalInput(
user_input=sample.user_input, response=sample.response
Expand Down Expand Up @@ -212,5 +209,3 @@ await scorer.multi_turn_ascore(sample)


0


Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,6 @@
" )\n",
" refusal_prompt: PydanticPrompt = RefusalPrompt()\n",
"\n",
" async def _ascore(self, row):\n",
" pass\n",
"\n",
" async def _single_turn_ascore(self, sample, callbacks):\n",
" prompt_input = RefusalInput(\n",
" user_input=sample.user_input, response=sample.response\n",
Expand Down
14 changes: 6 additions & 8 deletions docs/howtos/integrations/_haystack.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Haystack Integration

Haystack is a LLM orchestration framework to build customizable, production-ready LLM applications.
Haystack is a LLM orchestration framework to build customizable, production-ready LLM applications.

The underlying concept of Haystack is that all individual tasks, such as storing documents, retrieving relevant data, and generating responses, are handled by modular components like Document Stores, Retrievers, and Generators, which are seamlessly connected and orchestrated using Pipelines.

Expand Down Expand Up @@ -125,7 +125,7 @@ Pass all the Ragas metrics you want to use for evaluation, ensuring that all the
For example:

- **AnswerRelevancy**: requires both the **query** and the **response**.
- **ContextPrecision**: requires the **query**, **retrieved documents**, and the **reference**.
- **LLMContextPrecisionWithReference**: requires the **query**, **retrieved documents**, and the **reference**.
- **Faithfulness**: requires the **query**, **retrieved documents**, and the **response**.

Make sure to include all relevant data for each metric to ensure accurate evaluation.
Expand All @@ -136,13 +136,13 @@ from haystack_integrations.components.evaluators.ragas import RagasEvaluator

from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness
from ragas.metrics import AnswerRelevancy, LLMContextPrecisionWithReference, Faithfulness

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

ragas_evaluator = RagasEvaluator(
ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],
ragas_metrics=[AnswerRelevancy(), LLMContextPrecisionWithReference(), Faithfulness()],
evaluator_llm=evaluator_llm,
)
```
Expand Down Expand Up @@ -236,8 +236,8 @@ print(result["ragas_evaluator"]["result"])
Evaluating: 100%|██████████| 3/3 [00:14<00:00, 4.72s/it]


Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development.
Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development.

{'answer_relevancy': 0.9782, 'context_precision': 1.0000, 'faithfulness': 1.0000}


Expand Down Expand Up @@ -296,5 +296,3 @@ output["result"]


{'sports_relevance_metric': 1.0000, 'domain_specific_rubrics': 3.0000}


20 changes: 8 additions & 12 deletions docs/howtos/integrations/_langfuse.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,19 +153,19 @@ print("answer: ", row["answer"])

question: What are the global implications of the USA Supreme Court ruling on abortion?
answer: The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:

1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.

2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.

3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.

4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.

5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.

6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.

It is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.


Expand All @@ -186,7 +186,7 @@ async def score_with_ragas(query, chunks, answer):
scores = {}
for m in metrics:
print(f"calculating {m.name}")
scores[m.name] = await m.ascore(
scores[m.name] = await m.single_turn_ascore(
row={"question": query, "contexts": chunks, "answer": answer}
)
return scores
Expand Down Expand Up @@ -320,7 +320,3 @@ Note that the scoring is blocking so make sure that you sent the generated answe
## Feedback

If you have any feedback or requests, please create a GitHub [Issue](https://langfuse.com/issue) or share your work with the community on [Discord](https://discord.langfuse.com/).




16 changes: 7 additions & 9 deletions docs/howtos/integrations/_llamaindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ with a test dataset to test our `QueryEngine` lets now build one and evaluate it

## Building the `QueryEngine`

To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it.
To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it.

Since we already loaded the dataset into `documents` lets use that.

Expand Down Expand Up @@ -170,13 +170,13 @@ print(response_vector)

## Evaluating the `QueryEngine`

Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it.
Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it.

In order to run an evaluation with Ragas and LlamaIndex you need 3 things

1. LlamaIndex `QueryEngine`: what we will be evaluating
2. Metrics: Ragas defines a set of metrics that can measure different aspects of the `QueryEngine`. The available metrics and their meaning can be found [here](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/)
3. Questions: A list of questions that ragas will test the `QueryEngine` against.
3. Questions: A list of questions that ragas will test the `QueryEngine` against.

first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we'll be using a few example question.

Expand All @@ -188,8 +188,8 @@ Now lets import the metrics we will be using to evaluate
from ragas.metrics import (
Faithfulness,
AnswerRelevancy,
ContextPrecision,
ContextRecall,
LLMContextPrecisionWithReference,
LLMContextRecall,
)

# init metrics with evaluator LLM
Expand All @@ -199,8 +199,8 @@ evaluator_llm = LlamaIndexLLMWrapper(OpenAI(model="gpt-4o"))
metrics = [
Faithfulness(llm=evaluator_llm),
AnswerRelevancy(llm=evaluator_llm),
ContextPrecision(llm=evaluator_llm),
ContextRecall(llm=evaluator_llm),
LLMContextPrecisionWithReference(llm=evaluator_llm),
LLMContextRecall(llm=evaluator_llm),
]
```

Expand Down Expand Up @@ -357,5 +357,3 @@ result.to_pandas()
</tbody>
</table>
</div>


12 changes: 6 additions & 6 deletions docs/howtos/integrations/griptape.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,15 +203,15 @@ Now, let's evaluate our RAG system using Ragas metrics:

To evaluate our retrieval performance, we can utilize Ragas built-in metrics or create custom metrics tailored to our specific needs. For a comprehensive list of all available metrics and customization options, please visit the [documentation]().

We will use `ContextPrecision`, `ContextRecall` and `ContextRelevance` to measure the retrieval performance:
We will use `LLMContextPrecisionWithReference`, `LLMContextRecall` and `ContextRelevance` to measure the retrieval performance:

- [ContextPrecision](../../concepts/metrics/available_metrics/context_precision.md): Measures how well a RAG system's retriever ranks relevant chunks at the top of the retrieved context for a given query, calculated as the mean precision@k across all chunks.
- [ContextRecall](../../concepts/metrics/available_metrics/context_recall.md): Measures the proportion of relevant information successfully retrieved from a knowledge base.
- [LLMContextPrecisionWithReference](../../concepts/metrics/available_metrics/context_precision.md): Measures how well a RAG system's retriever ranks relevant chunks at the top of the retrieved context for a given query, calculated as the mean precision@k across all chunks.
- [LLMContextRecall](../../concepts/metrics/available_metrics/context_recall.md): Measures the proportion of relevant information successfully retrieved from a knowledge base.
- [ContextRelevance](../../concepts/metrics/available_metrics/nvidia_metrics.md#context-relevance): Measures how well the retrieved contexts address the user’s query by evaluating their pertinence through dual LLM judgments.


```python
from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
from ragas.metrics import LLMContextPrecisionWithReference, LLMContextRecall, ContextRelevance
from ragas import evaluate
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper
Expand All @@ -220,8 +220,8 @@ llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

ragas_metrics = [
ContextPrecision(llm=evaluator_llm),
ContextRecall(llm=evaluator_llm),
LLMContextPrecisionWithReference(llm=evaluator_llm),
LLMContextRecall(llm=evaluator_llm),
ContextRelevance(llm=evaluator_llm),
]

Expand Down
6 changes: 3 additions & 3 deletions docs/howtos/integrations/haystack.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@
"For example:\n",
"\n",
"- **AnswerRelevancy**: requires both the **query** and the **response**.\n",
"- **ContextPrecision**: requires the **query**, **retrieved documents**, and the **reference**.\n",
"- **LLMContextPrecisionWithReference**: requires the **query**, **retrieved documents**, and the **reference**.\n",
"- **Faithfulness**: requires the **query**, **retrieved documents**, and the **response**.\n",
"\n",
"Make sure to include all relevant data for each metric to ensure accurate evaluation."
Expand All @@ -242,13 +242,13 @@
"\n",
"from langchain_openai import ChatOpenAI\n",
"from ragas.llms import LangchainLLMWrapper\n",
"from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness\n",
"from ragas.metrics import AnswerRelevancy, LLMContextPrecisionWithReference, Faithfulness\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")\n",
"evaluator_llm = LangchainLLMWrapper(llm)\n",
"\n",
"ragas_evaluator = RagasEvaluator(\n",
" ragas_metrics=[AnswerRelevancy(), ContextPrecision(), Faithfulness()],\n",
" ragas_metrics=[AnswerRelevancy(), LLMContextPrecisionWithReference(), Faithfulness()],\n",
" evaluator_llm=evaluator_llm,\n",
")"
]
Expand Down
Loading