Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Dataset Configuration for Modular Typing, Discriminated Unions, and Backward Compatibility #2271

Open
5 tasks done
NJordan72 opened this issue Jan 21, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@NJordan72
Copy link
Contributor

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Our current dataset configuration combines multiple concerns (e.g., how to locate the dataset vs. how to interpret its fields) across different classes (SFTDataset, DPODataset, and KTODataset), with overlapping fields and loosely enforced typing. We want to improve maintainability, reduce duplication, and enforce clearer type boundaries without breaking existing configurations.

✔️ Solution

  1. Modular Classes

    • Separate “metadata” (where/how the dataset is hosted) from “field definitions” (how dataset fields are processed).
    • Consider that metadata itself might need a discriminated union (e.g., local vs. remote hosting).
  2. Discriminated Unions

    • Use tagged (discriminated) unions to let Pydantic clearly decide which dataset type (SFT, DPO, or KTO) is in use, potentially via a callable discriminator if older configs do not include a type field.
  3. Multiple Inheritance or Composition

    • Consolidate shared fields into base or mixin classes to avoid duplication.
    • Let each dataset-specific class inherit or compose these base pieces, keeping user-facing shapes similar to the existing config.
  4. Backward Compatibility

    • Preserve older configs (e.g., do not force a new field if it’s not already present).
    • Use fallback logic to detect dataset types if a type field is missing.

Why: This approach reduces duplicated fields, clarifies validation (especially for dataset-specific logic), and allows new hosting or dataset types to be added more seamlessly in the future.

❓ Alternatives

  • Single Monolithic Class: Keep everything as-is but add more validators. This would not fully address the duplication or clarity issues.
  • Require an Explicit type Field: Easier for strict typing but could break older configs. The callable discriminator is more flexible.
  • Composition Over Inheritance: Instead of multiple inheritance, each dataset class could hold separate “MetadataConfig” or “FieldConfig” objects. It’s another valid pattern, but multiple inheritance may better preserve existing class shapes.

📝 Additional Context

  • Some fields (e.g., path, split, data_files, revision) are repeated across all current dataset classes.
  • Each dataset also has unique validation (e.g., SFT’s chat templates vs. DPO’s chosen/rejected fields).
  • We want to be able to expand hosting modes (like local, Hugging Face, or S3) without tangling new logic into each dataset class.
  • A callable discriminator helps if we don’t want to require new fields in legacy configs.

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant