You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: tutorials/nemo-retriever-synthetic-data-generation/README.md
+21-8
Original file line number
Diff line number
Diff line change
@@ -45,22 +45,35 @@ Navigate to the [quick start notebook](notebooks/quickstart.ipynb) and follow th
45
45
46
46
### Run Pipeline (CLI)
47
47
48
-
The pipeline can be run with datasets in rawdoc (only text, title and ids if any) format. To test the pipeline, you can use the provided example data at ```sample_data_rawdoc.jsonl```
48
+
The pipeline can be run with datasets in ```jsonl``` (only text, title and ids if any) format. To test the pipeline, you can use the provided example data at ```sample_data/sample_data_rawdoc.jsonl```
49
49
50
-
Navigate to the top level of this project directory and run the following command in your command line. It will take roughly 5-10 minutes.
50
+
To use jsonl format, provide your data in a single or multiple `.jsonl` files. The structure of the data should follow this format: `{"text": <document>, "title": <title>}`. Additionally, if the documents already have a document id, the input file can also contain document ids. The same ids will be persisted in the generated data as well. Another accepted format is `{"_id": <document_id>, "text": <document>, "title": <title>}`.
51
51
52
-
-`Rawdoc format`
53
-
54
-
To use rawdoc format, provide your data in a `.jsonl` file. The structure of the data should follow this format: `{"text": <document>, "title": <title>}`. Additionally, if the documents already have a document id, the input file can also contain document ids. The same ids will be persisted in the generated data as well. Another accepted format is `{"_id": <document_id>, "text": <document>, "title": <title>}`.
52
+
The pipeline can be run in two modes (1. Generation and 2. Filtering). In order to run the full pipeline in generation mode, use the script ```main.py``` with the flag ```--pipeline-type=generate```
The data can be saved in two formats (1. jsonl, 2. beir). Additionally, the user can pass ```--n-partitions``` flag to speed-up generation for large datasets.
55
65
56
-
In order to run the pipeline, use the script ```main.py```
66
+
To filter pre-generated data, run ```main.py``` with ```--pipeline-type=filter```
67
+
Note the change in the ```input-dir```, we need to use the path to the generated data in jsonl format.
num_criteria: 4# Number of criteria to parse from the response. It must be alined with the prompt template
77
77
answerability_system_prompt: |
78
78
You are an evaluator who is rating questions to given context passages based on the given criteria. Assess the given question for clarity and answerability given enough domain knowledge, consider the following evaluation criterion:
num_criteria: 4# Number of criteria to parse from the response. It must be alined with the prompt template
77
77
answerability_system_prompt: |
78
78
You are an evaluator who is rating questions to given context passages based on the given criteria. Assess the given question for clarity and answerability given enough domain knowledge, consider the following evaluation criterion:
num_criteria: 4# Number of criteria to parse from the response. It must be alined with the prompt template
68
68
answerability_system_prompt: |
69
69
You are an evaluator who is rating questions to given context passages based on the given criteria. Assess the given question for clarity and answerability given enough domain knowledge, consider the following evaluation criterion:
0 commit comments