You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
⚠️ Please check that this feature request hasn't been suggested before.
I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.
🔖 Feature description
Our current dataset configuration combines multiple concerns (e.g., how to locate the dataset vs. how to interpret its fields) across different classes (SFTDataset, DPODataset, and KTODataset), with overlapping fields and loosely enforced typing. We want to improve maintainability, reduce duplication, and enforce clearer type boundaries without breaking existing configurations.
✔️ Solution
Modular Classes
Separate “metadata” (where/how the dataset is hosted) from “field definitions” (how dataset fields are processed).
Consider that metadata itself might need a discriminated union (e.g., local vs. remote hosting).
Discriminated Unions
Use tagged (discriminated) unions to let Pydantic clearly decide which dataset type (SFT, DPO, or KTO) is in use, potentially via a callable discriminator if older configs do not include a type field.
Multiple Inheritance or Composition
Consolidate shared fields into base or mixin classes to avoid duplication.
Let each dataset-specific class inherit or compose these base pieces, keeping user-facing shapes similar to the existing config.
Backward Compatibility
Preserve older configs (e.g., do not force a new field if it’s not already present).
Use fallback logic to detect dataset types if a type field is missing.
Why: This approach reduces duplicated fields, clarifies validation (especially for dataset-specific logic), and allows new hosting or dataset types to be added more seamlessly in the future.
❓ Alternatives
Single Monolithic Class: Keep everything as-is but add more validators. This would not fully address the duplication or clarity issues.
Require an Explicit type Field: Easier for strict typing but could break older configs. The callable discriminator is more flexible.
Composition Over Inheritance: Instead of multiple inheritance, each dataset class could hold separate “MetadataConfig” or “FieldConfig” objects. It’s another valid pattern, but multiple inheritance may better preserve existing class shapes.
📝 Additional Context
Some fields (e.g., path, split, data_files, revision) are repeated across all current dataset classes.
Each dataset also has unique validation (e.g., SFT’s chat templates vs. DPO’s chosen/rejected fields).
We want to be able to expand hosting modes (like local, Hugging Face, or S3) without tangling new logic into each dataset class.
A callable discriminator helps if we don’t want to require new fields in legacy configs.
Acknowledgements
My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.
The text was updated successfully, but these errors were encountered:
🔖 Feature description
Our current dataset configuration combines multiple concerns (e.g., how to locate the dataset vs. how to interpret its fields) across different classes (SFTDataset, DPODataset, and KTODataset), with overlapping fields and loosely enforced typing. We want to improve maintainability, reduce duplication, and enforce clearer type boundaries without breaking existing configurations.
✔️ Solution
Modular Classes
Discriminated Unions
type
field.Multiple Inheritance or Composition
Backward Compatibility
type
field is missing.Why: This approach reduces duplicated fields, clarifies validation (especially for dataset-specific logic), and allows new hosting or dataset types to be added more seamlessly in the future.
❓ Alternatives
type
Field: Easier for strict typing but could break older configs. The callable discriminator is more flexible.📝 Additional Context
path
,split
,data_files
,revision
) are repeated across all current dataset classes.Acknowledgements
The text was updated successfully, but these errors were encountered: