Skip to content

Commit 0e93396

Browse files
docs: updated testset generation (#510)
Co-authored-by: Jithin James <[email protected]>
1 parent fc6ef22 commit 0e93396

File tree

4 files changed

+40
-60
lines changed

4 files changed

+40
-60
lines changed

docs/_static/imgs/question_types.png

-5.43 KB
Loading

docs/concepts/testset_generation.md

+12-37
Original file line numberDiff line numberDiff line change
@@ -57,40 +57,25 @@ Checkout [llama-index](https://gpt-index.readthedocs.io/en/stable/core_modules/d
5757

5858

5959
```{code-block} python
60-
:caption: Customising test set generation
61-
from ragas.testset import TestsetGenerator
62-
from langchain.embeddings import OpenAIEmbeddings
63-
from langchain.chat_models import ChatOpenAI
64-
from ragas.llms import LangchainLLM
60+
:caption: Customising test data distribution
61+
from ragas.testset.generator import TestsetGenerator
62+
from ragas.testset.evolutions import simple, reasoning, multi_context
6563
6664
# documents = load your documents
6765
68-
# Add custom llms and embeddings
69-
generator_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-3.5-turbo"))
70-
critic_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-4"))
71-
embeddings_model = OpenAIEmbeddings()
66+
# generator with openai models
67+
generator = TestsetGenerator.with_openai()
7268
7369
# Change resulting question type distribution
74-
testset_distribution = {
75-
"simple": 0.25,
76-
"reasoning": 0.5,
77-
"multi_context": 0.0,
78-
"conditional": 0.25,
70+
distributions = {
71+
simple: 0.5,
72+
multi_context: 0.4,
73+
reasoning: 0.1
7974
}
8075
81-
# percentage of conversational question
82-
chat_qa = 0.2
83-
84-
85-
test_generator = TestsetGenerator(
86-
generator_llm=generator_llm,
87-
critic_llm=critic_llm,
88-
embeddings_model=embeddings_model,
89-
testset_distribution=testset_distribution,
90-
chat_qa=chat_qa,
91-
)
92-
93-
testset = test_generator.generate(documents, test_size=5)
76+
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
77+
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
78+
testset.to_pandas()
9479
9580
```
9681

@@ -109,16 +94,6 @@ test_df.head()
10994

11095
Analyze the frequency of different question types in the created dataset
11196

112-
```{code-block} python
113-
:caption: bar graph of question types
114-
import seaborn as sns
115-
sns.set(rc={'figure.figsize':(9,6)})
116-
117-
test_data_dist = test_df.question_type.value_counts().to_frame().reset_index()
118-
sns.set_theme(style="whitegrid")
119-
g = sns.barplot(y='count',x='question_type', data=test_data_dist)
120-
g.set_title("Question type distribution",fontdict = { 'fontsize': 20})
121-
```
12297

12398
<p align="left">
12499
<img src="../_static/imgs/question_types.png" alt="test-outputs" width="450" height="400" />

docs/getstarted/testset_generation.md

+16-19
Original file line numberDiff line numberDiff line change
@@ -11,30 +11,23 @@ os.environ["OPENAI_API_KEY"] = "your-openai-key"
1111

1212
## Documents
1313

14-
To begin, we require a collection of documents to generate synthetic Question/Context/Answer samples. Here, we will employ the llama-index document loaders to retrieve documents.
14+
To begin, we require a collection of documents to generate synthetic Question/Context/Answer samples. Here, we will employ the langchain document loader to load documents.
1515

1616
```{code-block} python
17-
:caption: Load documents from Semantic Scholar
18-
from llama_index import download_loader
19-
20-
SemanticScholarReader = download_loader("SemanticScholarReader")
21-
loader = SemanticScholarReader()
22-
# Narrow down the search space
23-
query_space = "large language models"
24-
# Increase the limit to obtain more documents
25-
documents = loader.load_data(query=query_space, limit=10)
17+
:caption: Load documents from directory
18+
from langchain.document_loaders import DirectoryLoader
19+
loader = DirectoryLoader("your-directory")
20+
documents = loader.load()
2621
```
2722

2823
:::{note}
2924
Each Document object contains a metadata dictionary, which can be used to store additional information about the document which can be accessed with `Document.metadata`. Please ensure that the metadata dictionary contains a key called `file_name` as this will be used in the generation process. The `file_name` attribute in metadata is used to identify chunks belonging to the same document. For example, pages belonging to the same research publication can be identifies using filename.
3025

31-
An example of how to do this for `SemanticScholarReader` is shown below.
26+
An example of how to do this is shown below.
3227

3328
```{code-block} python
34-
for d in documents:
35-
d.metadata["file_name"] = d.metadata["title"]
36-
37-
documents[0].metadata
29+
for document in documents:
30+
document.metadata['file_name'] = document.metadata['source']
3831
```
3932
:::
4033

@@ -46,11 +39,15 @@ We will now import and use Ragas' `Testsetgenerator` to promptly generate a synt
4639

4740
```{code-block} python
4841
:caption: Create 10 samples using default configuration
49-
from ragas.testset import TestsetGenerator
42+
from ragas.testset.generator import TestsetGenerator
43+
from ragas.testset.evolutions import simple, reasoning, multi_context
44+
45+
# generator with openai models
46+
generator = TestsetGenerator.with_openai()
5047
51-
testsetgenerator = TestsetGenerator.from_default()
52-
test_size = 10
53-
testset = testsetgenerator.generate(documents, test_size=test_size)
48+
# generate testset
49+
testset = generator.generate_with_langchain_docs(documents, test_size=10)
50+
testset.to_pandas()
5451
```
5552

5653
Subsequently, we can export the results into a Pandas DataFrame.

src/ragas/testset/generator.py

+12-4
Original file line numberDiff line numberDiff line change
@@ -13,17 +13,25 @@
1313
from ragas.executor import Executor
1414
from ragas.llms import BaseRagasLLM, LangchainLLMWrapper
1515
from ragas.testset.docstore import Document, DocumentStore, InMemoryDocumentStore
16-
from ragas.testset.evolutions import ComplexEvolution, CurrentNodes, DataRow
16+
from ragas.testset.evolutions import (
17+
ComplexEvolution,
18+
CurrentNodes,
19+
DataRow,
20+
multi_context,
21+
reasoning,
22+
simple,
23+
)
1724
from ragas.testset.filters import EvolutionFilter, NodeFilter, QuestionFilter
1825

1926
if t.TYPE_CHECKING:
2027
from llama_index.readers.schema import Document as LlamaindexDocument
2128
from langchain_core.documents import Document as LCDocument
2229

23-
Distributions = t.Dict[t.Any, float]
24-
2530
logger = logging.getLogger(__name__)
2631

32+
Distributions = t.Dict[t.Any, float]
33+
DEFAULT_DISTRIBUTION = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}
34+
2735

2836
@dataclass
2937
class TestDataset:
@@ -126,7 +134,7 @@ def generate_with_langchain_docs(
126134
def generate(
127135
self,
128136
test_size: int,
129-
distributions: Distributions = {},
137+
distributions: Distributions = DEFAULT_DISTRIBUTION,
130138
with_debugging_logs=False,
131139
):
132140
# init filters and evolutions

0 commit comments

Comments
 (0)