Skip to content

Conversation

Sanjaykumar030
Copy link
Contributor

@Sanjaykumar030 Sanjaykumar030 commented Sep 13, 2025

This PR fixes a bug in the generate_examples function where datasets.Value features with a float dtype were incorrectly generated using np.random.randint. This resulted in integer values being cast to float, which is not representative of true floating-point data.

Key changes include:

  • Added explicit handling for float features using np.random.rand to generate continuous values.
  • Introduced fail-fast type checks for unsupported dtypes to improve robustness.
  • Added validation for sequence features to ensure seq_shapes is provided.

Before Fix

Float features were generated incorrectly as integers cast to float:

- Example 0:
- int_feature: 0
- float_feature: 9.0  <-- Incorrect: An integer disguised as a float
- string_feature: The small grey turtle was surprisingly fast...
- seq_feature: [0.3048 0.4291 0.4283]

After Fix

Float features are now correctly generated as continuous numbers in the range [0, 1):

+ Example 0:
+ int_feature: 0
+ float_feature: 0.0183  <-- Correct: A true random float
+ string_feature: The small grey turtle was surprisingly fast...
+ seq_feature: [0.9237 0.7972 0.8526]

Note: This PR is a follow-up/fix of the previously closed PR #7769 for clarity and context.

@Sanjaykumar030 Sanjaykumar030 changed the title Fix: Correct float feature generation in generate_examples #7769 Fix: Correct float feature generation in generate_examples Sep 13, 2025
@Sanjaykumar030
Copy link
Contributor Author

Hi @lhoestq, just a gentle follow-up on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant