Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51067][SQL] Revert session level collation as object level collation will be used instead #49772

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

dejankrak-db
Copy link
Contributor

@dejankrak-db dejankrak-db commented Feb 3, 2025

What changes were proposed in this pull request?

This PR is a partial revert of the original PR #48962 that introduced the resolution of default session level collation for DDL and DML queries.
The part that is reverted is the default collation resolution for DML queries, whereas the part that is kept is the default collation resolution for DDL queries, which is required to apply the object level collation that was introduced as part of PR #49084.

Why are the changes needed?

As there were some unresolved technical issues when attempting to merge the functionality from PR #48962 on Delta side, due to its effect on DML queries, it was decided to pause this functionality for now, thus partially reverting unused parts for maintaining a cleaner code moving forward.
Also, this is inline with customer feedback where object level collation is much more requested functionality, so the focus is to introduce the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests that cover the collations functionality, as well as some of the new dedicated tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Feb 3, 2025
@dejankrak-db
Copy link
Contributor Author

@cloud-fan, @stefankandic, please take a look - this is just a revert of PR #48962, as we decided not to proceed with session level collations for now, and will do a follow up to apply object level collations for queries.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other audience, could you provide a link for this decision, @dejankrak-db ?

The decision has since been made not to ship this functionality for now,

@dejankrak-db dejankrak-db changed the title [SPARK-51067][SQL] Revert session level collation changes [SPARK-51067][SQL] Partially revert session level collation as object level collation will be used instead Feb 4, 2025
@dejankrak-db
Copy link
Contributor Author

dejankrak-db commented Feb 4, 2025

For the other audience, could you provide a link for this decision, @dejankrak-db ?

The decision has since been made not to ship this functionality for now,

@dongjoon-hyun , there are 2 main reasons for this decision:

  1. There were some unresolved technical issues when attempting to merge the original PR functionality on Delta side, due to its effect on DML queries when changing the underlying collation in this way.
  2. As per customer feedback gathered so far, object level collation is much more requested functionality, whereas there were no explicit requests for default session level collation so far, hence the focus has shifted to introducing the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

Therefore, it was decided to pause session level collation functionality for now, thus partially reverting unused parts of the original PR for maintaining a cleaner code moving forward, while still keeping other parts required to support object level collation resolution. Hope this clarifies the reasoning well! I have also updated the PR description with this info, thanks!

@@ -47,7 +46,7 @@ object ResolveDefaultStringTypes extends Rule[LogicalPlan] {
if (isDDLCommand(plan)) {
transformDDL(plan)
} else {
transformPlan(plan, sessionDefaultStringType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove the transformPlan method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also remove the hack in the apply method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic kindly helped refactor this code to remove all unnecessary/unused references, but we still need to do transform plan for DML statements using the default string type which is now UTF8_BINARY, and the apply method logic is still needed to ensure correct results where default string type is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire rule is useless now because there is no longer session collation. The DDL collation resolution is not implemented yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can think of it as writing a new rule to resolve DDL commands, and it should be very different from the current form.

@cloud-fan
Copy link
Contributor

I'm good with removing this hacky feature. It's too fragile to use object StringType as undetermined string collation, and hard for third party Spark extensions to follow.

@github-actions github-actions bot added the AVRO label Feb 7, 2025
@dejankrak-db dejankrak-db changed the title [SPARK-51067][SQL] Partially revert session level collation as object level collation will be used instead [SPARK-51067][SQL] Revert session level collation as object level collation will be used instead Feb 7, 2025
@dejankrak-db
Copy link
Contributor Author

I'm good with removing this hacky feature. It's too fragile to use object StringType as undetermined string collation, and hard for third party Spark extensions to follow.

@cloud-fan, we actually agreed on fully removing the associated DEFAULT_COLLATION and defaultStringType from the code, which essentially removes the entire feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants