Skip to content

Commit 2fbbaaa

Browse files
committed
refactor: remove unused argument rouge_threshold
instructlab.sdg.generate_data doesn't use argument `rouge_threshold` anymore. TODO: remove the argument from generate_data() completely. Signed-off-by: Costa Shulyupin <[email protected]>
1 parent 9f413fe commit 2fbbaaa

File tree

3 files changed

+1
-12
lines changed

3 files changed

+1
-12
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
* InstructLab now uses XDG-based directories on macOS, similar to Linux.
66
Users are advised to re-initialize their config files and remove cached models.
7+
* Removed unused argument `--rouge-threshold` of `ilab data generate`
78

89
## v0.18.1
910

TROUBLESHOOTING.md

-3
Original file line numberDiff line numberDiff line change
@@ -72,9 +72,6 @@ The data generation step is executed via the `ilab data generate` command, and i
7272
1. Increase the number of instructions generated by passing the `--num-instructions` flag to the `ilab data generate` command as follows: `ilab data generate --num-instructions 1000`.
7373
The `--num-instructions` flag will generate 1000 points of synthetic data based on your provided examples. The greater the number of instructions generated, the better the model will be trained (within reasonable limits).
7474

75-
2. Adjust the rouge threshold via `--rouge-threshold` parameter. Rogue threshold is a parameter that determines how likely a synthetic data point, generated by the model, will be added to the output based on how similar it is to previously generated data points.
76-
The value of rouge threshold ranges from 1.0 to 0.0, where 1.0 indicates maximum leniency (every newly generated data point is accepted) and 0.0 indicates maximum strictness (every newly generated data point is rejected). Setting a rouge threshold value closer to 0.0 would force the model to generate data that is different from what it has already generated, leading to a more diverse dataset overall. Rouge threshold can be set as follows: `ilab data generate --rouge-threshold 0.75`
77-
7875
3. Using a better model via `--model`. Larger models can lead to better data generation. This option requires users to be familiar with various existing models, and which specific models would suit their needs. This could mean either using a model with more nodes than the default InstructLab `merlinite-7b-lab` model, such as the `Mixtral-8x7B-Instruct-v0.1` model, or using an unquantized version of the InstructLab `merlinite-7b-lab` model. It can be used as follows: `ilab serve --model-path models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf` and `ilab data generate --model models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf`
7976

8077
4. Set the number of CPU cores that can be used to generate data via `--num-cpus`. This defaults to 10, but increasing this value could potentially lead to better generated data. It can be used as follows: `ilab data generate --num-cpus 15`

src/instructlab/data/generate.py

-9
Original file line numberDiff line numberDiff line change
@@ -72,13 +72,6 @@
7272
cls=clickext.ConfigOption,
7373
help="Path to output generated files.",
7474
)
75-
@click.option(
76-
"--rouge-threshold",
77-
type=click.FLOAT,
78-
default=0.9,
79-
show_default=True,
80-
help="Threshold of (max) Rouge score to keep samples; 1.0 means accept all samples.",
81-
)
8275
@click.option(
8376
"--quiet",
8477
is_flag=True,
@@ -173,7 +166,6 @@ def generate(
173166
taxonomy_path,
174167
taxonomy_base,
175168
output_dir,
176-
rouge_threshold,
177169
quiet,
178170
endpoint_url,
179171
api_key,
@@ -307,7 +299,6 @@ def generate(
307299
taxonomy=taxonomy_path,
308300
taxonomy_base=taxonomy_base,
309301
output_dir=output_dir,
310-
rouge_threshold=rouge_threshold,
311302
console_output=not quiet,
312303
yaml_rules=yaml_rules,
313304
chunk_word_count=chunk_word_count,

0 commit comments

Comments
 (0)