Refactor Dataset Configuration for Modular Typing, Discriminated Unions, and Backward Compatibility #2271

NJordan72 · 2025-01-21T14:11:47Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Our current dataset configuration combines multiple concerns (e.g., how to locate the dataset vs. how to interpret its fields) across different classes (SFTDataset, DPODataset, and KTODataset), with overlapping fields and loosely enforced typing. We want to improve maintainability, reduce duplication, and enforce clearer type boundaries without breaking existing configurations.

✔️ Solution

Modular Classes
- Separate “metadata” (where/how the dataset is hosted) from “field definitions” (how dataset fields are processed).
- Consider that metadata itself might need a discriminated union (e.g., local vs. remote hosting).
Discriminated Unions
- Use tagged (discriminated) unions to let Pydantic clearly decide which dataset type (SFT, DPO, or KTO) is in use, potentially via a callable discriminator if older configs do not include a type field.
Multiple Inheritance or Composition
- Consolidate shared fields into base or mixin classes to avoid duplication.
- Let each dataset-specific class inherit or compose these base pieces, keeping user-facing shapes similar to the existing config.
Backward Compatibility
- Preserve older configs (e.g., do not force a new field if it’s not already present).
- Use fallback logic to detect dataset types if a type field is missing.

Why: This approach reduces duplicated fields, clarifies validation (especially for dataset-specific logic), and allows new hosting or dataset types to be added more seamlessly in the future.

❓ Alternatives

Single Monolithic Class: Keep everything as-is but add more validators. This would not fully address the duplication or clarity issues.
Require an Explicit type Field: Easier for strict typing but could break older configs. The callable discriminator is more flexible.
Composition Over Inheritance: Instead of multiple inheritance, each dataset class could hold separate “MetadataConfig” or “FieldConfig” objects. It’s another valid pattern, but multiple inheritance may better preserve existing class shapes.

📝 Additional Context

Some fields (e.g., path, split, data_files, revision) are repeated across all current dataset classes.
Each dataset also has unique validation (e.g., SFT’s chat templates vs. DPO’s chosen/rejected fields).
We want to be able to expand hosting modes (like local, Hugging Face, or S3) without tangling new logic into each dataset class.
A callable discriminator helps if we don’t want to require new fields in legacy configs.

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

The text was updated successfully, but these errors were encountered:

NJordan72 added the enhancement New feature or request label Jan 21, 2025

NJordan72 mentioned this issue Jan 21, 2025

feat: add config for optional parameters in a chat message #2260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Dataset Configuration for Modular Typing, Discriminated Unions, and Backward Compatibility #2271

Refactor Dataset Configuration for Modular Typing, Discriminated Unions, and Backward Compatibility #2271

NJordan72 commented Jan 21, 2025

Refactor Dataset Configuration for Modular Typing, Discriminated Unions, and Backward Compatibility #2271

Refactor Dataset Configuration for Modular Typing, Discriminated Unions, and Backward Compatibility #2271

Comments

NJordan72 commented Jan 21, 2025

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements