refactor: remove unused argument rouge_threshold

makelinux · makelinux · commit 2fbbaaaeed37 · 2024-09-10T11:20:58.000+03:00
instructlab.sdg.generate_data doesn't use argument
`rouge_threshold` anymore.

TODO: remove the argument from generate_data() completely.

Signed-off-by: Costa Shulyupin &lt;costa.shul@redhat.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,7 @@
 
 * InstructLab now uses XDG-based directories on macOS, similar to Linux.
   Users are advised to re-initialize their config files and remove cached models.
+* Removed unused argument `--rouge-threshold` of `ilab data generate`
 
 ## v0.18.1
 
diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md
@@ -72,9 +72,6 @@ The data generation step is executed via the `ilab data generate` command, and i
 1. Increase the number of instructions generated by passing the `--num-instructions` flag to the `ilab data generate` command as follows: `ilab data generate --num-instructions 1000`.
    The `--num-instructions` flag will generate 1000 points of synthetic data based on your provided examples. The greater the number of instructions generated, the better the model will be trained (within reasonable limits).
 
-2. Adjust the rouge threshold via `--rouge-threshold` parameter. Rogue threshold is a parameter that determines how likely a synthetic data point, generated by the model, will be added to the output based on how similar it is to previously generated data points.
-    The value of rouge threshold ranges from 1.0 to 0.0, where 1.0 indicates maximum leniency (every newly generated data point is accepted) and 0.0 indicates maximum strictness (every newly generated data point is rejected). Setting a rouge threshold value closer to 0.0 would force the model to generate data that is different from what it has already generated, leading to a more diverse dataset overall. Rouge threshold can be set as follows: `ilab data generate --rouge-threshold 0.75`
-
 3. Using a better model via `--model`. Larger models can lead to better data generation. This option requires users to be familiar with various existing models, and which specific models would suit their needs. This could mean either using a model with more nodes than the default InstructLab `merlinite-7b-lab` model, such as the `Mixtral-8x7B-Instruct-v0.1` model, or using an unquantized version of the InstructLab `merlinite-7b-lab` model. It can be used as follows: `ilab serve --model-path models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf` and `ilab data generate --model models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf`
 
 4. Set the number of CPU cores that can be used to generate data via `--num-cpus`. This defaults to 10, but increasing this value could potentially lead to better generated data. It can be used as follows: `ilab data generate --num-cpus 15`
diff --git a/src/instructlab/data/generate.py b/src/instructlab/data/generate.py
@@ -72,13 +72,6 @@
     cls=clickext.ConfigOption,
     help="Path to output generated files.",
 )
-@click.option(
-    "--rouge-threshold",
-    type=click.FLOAT,
-    default=0.9,
-    show_default=True,
-    help="Threshold of (max) Rouge score to keep samples; 1.0 means accept all samples.",
-)
 @click.option(
     "--quiet",
     is_flag=True,
@@ -173,7 +166,6 @@ def generate(
     taxonomy_path,
     taxonomy_base,
     output_dir,
-    rouge_threshold,
     quiet,
     endpoint_url,
     api_key,
@@ -307,7 +299,6 @@ def generate(
             taxonomy=taxonomy_path,
             taxonomy_base=taxonomy_base,
             output_dir=output_dir,
-            rouge_threshold=rouge_threshold,
             console_output=not quiet,
             yaml_rules=yaml_rules,
             chunk_word_count=chunk_word_count,