Skip to content

[ON HOLD] feat: auto detection of model encoding types #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from

Conversation

shuangwu5
Copy link
Contributor

@shuangwu5 shuangwu5 commented Feb 28, 2025

Main changes:

  • Introduce new model encoding types: TABULAR_AUTO, LANGUAGE_AUTO
  • Change the behavior of the existing model encoding type AUTO: Before the PR, it means the same as TABULAR_AUTO, now it is like the union of TABULAR_AUTO and LANGUAGE_AUTO
  • Extend auto detection logics:
    • LANGUAGE_AUTO -> auto detect LANGUAGE_NUMERIC, LANGUAGE_DATETIME, LANGUAGE_TEXT
    • TABULAR_AUTO -> "highly-unique categorical columns" (categorical columns that has >100 rows and more than 5% of rows contain unique values) will be detected as TABULAR_CHARACTER
    • AUTO -> "highly-unique categorical columns" will be detected as TABULAR_CHARACTER when lengths of all rows are the same. Otherwise it will be LANGUAGE_TEXT
  • Do auto detection before creating a generator (if columns in the config are undefined at all or have one of the *_AUTO encoding types)
  • Skip validation of *_model_configuration in SourceTableConfig before auto detection
  • Revalidate SourceTableConfig strictly (i.e., including *_model_configuration) after auto detection is done
  • Ensure that SourceTable always has strict validation
  • New unit tests: test_auto_detect_encoding_types_and_pk, test__auto_detect_encoding_type, test__auto_detect_primary_key

Code refactoring:

  • Move main logic for detecting location schema from Core to SDK
  • Move logics for auto detection from a isolated script into DataTable class

Misc / Changes that are not directly related to the requirements:

  • Update CustomBaseModel in mostlyai/sdk/_data/metadata_objects.py so that the classes can be initialized without using aliases/camelCase field names.
  • Fix a bug in MostlyAI.train(): allow data to be fed as the first unnamed argument
  • Fix one incorrect test case in test_execution_plan.py
  • Add .DS_Store to .gitignore

@shuangwu5 shuangwu5 changed the title feat: auto detection of model encoding types [ON HOLD] feat: auto detection of model encoding types May 20, 2025
@shuangwu5 shuangwu5 closed this Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant