Skip to content

Reusage schemas fix #1252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 17, 2025
Merged

Reusage schemas fix #1252

merged 8 commits into from
Jun 17, 2025

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Jun 13, 2025

Fixes #1222 by replacing the strictlyEqualNestedSchemas parameter by a more explicit ComparisonMode.

Comparing two schemas in LENIENT mode can return IsSuper, IsDerived, IsEqual, or None.
Compating in STRICT mode can only return IsEqual or None, because the schema's need to exactly match.
STRICT_FOR_NESTED_SCHEMAS works in LENIENT mode for the top-level, but STRICT for nested schemas. This is often used in Jupyter notebooks, to prevent nested types from extending each other and thus avoid a potential comparison explosion. (There could be a lot of nested types)

Also, added documentation everywhere

Requires tiny patch in the compiler plugin

@Jolanrensen Jolanrensen marked this pull request as ready for review June 16, 2025 11:11
@Jolanrensen Jolanrensen requested a review from koperagen June 16, 2025 11:12
@Jolanrensen Jolanrensen force-pushed the reusage-schemas-fix branch from b751010 to a0c3f2e Compare June 16, 2025 11:21
Copy link
Collaborator

@koperagen koperagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good, but it's quite hard to review with many stylistic / non-functional changes :(

public fun compare(other: DataFrameSchema, strictlyEqualNestedSchemas: Boolean = false): CompareResult
public fun compare(
other: DataFrameSchema,
comparisonMode: ComparisonMode = STRICT_FOR_NESTED_SCHEMAS,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think LENIENT should be default for more visibility - easier to see where our "special" codegen mode STRICT_FOR_NESTED_SCHEMAS handling is used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree :)

internal fun compareStrictlyEqualNestedSchemas(other: ColumnSchema): CompareResult = compare(other, true)

private fun compare(other: ColumnSchema, strictlyEqualNestedSchemas: Boolean): CompareResult {
public fun compare(other: ColumnSchema, comparisonMode: ComparisonMode = STRICT_FOR_NESTED_SCHEMAS): CompareResult {
if (kind != other.kind) return CompareResult.None
if (this === other) return CompareResult.Equals
return when (this) {
is Value -> compare(other as Value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd that comparison mode is not used here. How you tell the difference between nullable and non-nullable column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I had it at one point, but it wasn't in the implementation before... Let's see what breaks if I add it back

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the same behavior was achieved by if (comparison != Equals && comparisonMode == STRICT) None else comparison in the other file, but I improved it now so ColumnSchema.Value now also has a comparisonMode argument and the "strictness increase" is better explained.

@Jolanrensen
Copy link
Collaborator Author

Seems good, but it's quite hard to review with many stylistic / non-functional changes :(

Sorry, I just had no clue what was going on before refactoring, let alone debug how it should behave. Hopefully the new approach expresses the intention behind the code better :)

@Jolanrensen Jolanrensen requested a review from koperagen June 16, 2025 14:04
@Jolanrensen Jolanrensen merged commit 1d6756e into master Jun 17, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jupyter: Data schema generation bug with mixed nullability of similarly named column
2 participants