[SPARK-51067][SQL] Revert session level collation as object level collation will be used instead #49772

dejankrak-db · 2025-02-03T12:32:27Z

What changes were proposed in this pull request?

This PR is a partial revert of the original PR #48962 that introduced the resolution of default session level collation for DDL and DML queries.
The part that is reverted is the default collation resolution for DML queries, whereas the part that is kept is the default collation resolution for DDL queries, which is required to apply the object level collation that was introduced as part of PR #49084.

Why are the changes needed?

As there were some unresolved technical issues when attempting to merge the functionality from PR #48962 on Delta side, due to its effect on DML queries, it was decided to pause this functionality for now, thus partially reverting unused parts for maintaining a cleaner code moving forward.
Also, this is inline with customer feedback where object level collation is much more requested functionality, so the focus is to introduce the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests that cover the collations functionality, as well as some of the new dedicated tests.

Was this patch authored or co-authored using generative AI tooling?

No

dejankrak-db · 2025-02-03T12:34:18Z

@cloud-fan, @stefankandic, please take a look - this is just a revert of PR #48962, as we decided not to proceed with session level collations for now, and will do a follow up to apply object level collations for queries.

dongjoon-hyun

For the other audience, could you provide a link for this decision, @dejankrak-db ?

The decision has since been made not to ship this functionality for now,

dejankrak-db · 2025-02-04T00:26:23Z

For the other audience, could you provide a link for this decision, @dejankrak-db ?

The decision has since been made not to ship this functionality for now,

@dongjoon-hyun , there are 2 main reasons for this decision:

There were some unresolved technical issues when attempting to merge the original PR functionality on Delta side, due to its effect on DML queries when changing the underlying collation in this way.
As per customer feedback gathered so far, object level collation is much more requested functionality, whereas there were no explicit requests for default session level collation so far, hence the focus has shifted to introducing the resolution of object level collation for DDL queries instead, allowing the collation to be specified per table or view on their creation or modification, with propagating the default collation specified to subsequent queries on top of those entities.

Therefore, it was decided to pause session level collation functionality for now, thus partially reverting unused parts of the original PR for maintaining a cleaner code moving forward, while still keeping other parts required to support object level collation resolution. Hope this clarifies the reasoning well! I have also updated the PR description with this info, thanks!

cloud-fan · 2025-02-05T08:25:13Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultStringTypes.scala

@@ -47,7 +46,7 @@ object ResolveDefaultStringTypes extends Rule[LogicalPlan] {
    if (isDDLCommand(plan)) {
      transformDDL(plan)
    } else {
-      transformPlan(plan, sessionDefaultStringType)


shall we remove the transformPlan method?

can we also remove the hack in the apply method?

@stefankandic kindly helped refactor this code to remove all unnecessary/unused references, but we still need to do transform plan for DML statements using the default string type which is now UTF8_BINARY, and the apply method logic is still needed to ensure correct results where default string type is used.

This entire rule is useless now because there is no longer session collation. The DDL collation resolution is not implemented yet.

You can think of it as writing a new rule to resolve DDL commands, and it should be very different from the current form.

...talyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultStringTypes.scala

cloud-fan · 2025-02-05T08:28:26Z

I'm good with removing this hacky feature. It's too fragile to use object StringType as undetermined string collation, and hard for third party Spark extensions to follow.

dejankrak-db · 2025-02-07T02:13:35Z

I'm good with removing this hacky feature. It's too fragile to use object StringType as undetermined string collation, and hard for third party Spark extensions to follow.

@cloud-fan, we actually agreed on fully removing the associated DEFAULT_COLLATION and defaultStringType from the code, which essentially removes the entire feature.

Addressing merge conflicts

7b7fdb1

github-actions bot added the SQL label Feb 3, 2025

Fixing dependencies

fcca8b1

dongjoon-hyun reviewed Feb 3, 2025

View reviewed changes

dejankrak-db added 7 commits February 3, 2025 19:51

Reintroducing default string type resolution

61659b5

Including analyzer rule

60dee86

Reintroducing parts of the previous resolution logic

0b822a5

Reintroducing remaining changes from the original PR that should remain

f096e52

Fix indentation

0295f03

Minor indentation fix

6e050df

Align brackets

6026281

dejankrak-db changed the title ~~[SPARK-51067][SQL] Revert session level collation changes~~ [SPARK-51067][SQL] Partially revert session level collation as object level collation will be used instead Feb 4, 2025

Remove string resolution clause in CollationTypeCoercion

1b49e57

cloud-fan reviewed Feb 5, 2025

View reviewed changes

...talyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultStringTypes.scala Outdated Show resolved Hide resolved

dejankrak-db and others added 12 commits February 7, 2025 01:23

Revert

4649b2b

Revert

e95325d

Revert

8b60e26

Revert

86dd157

Revert

5a8f898

Revert

724496d

Revert

c2d11c8

Revert

1dab4f6

Revert

ec9a59d

Revert

8ee3481

initial

d387e9e

initial

0783ec5

stefankandic added 4 commits February 7, 2025 02:16

initial

aa1223a

initial

0637483

fix default suite

c430d8c

fix hll test

c1e86be

github-actions bot added the AVRO label Feb 7, 2025

dejankrak-db changed the title ~~[SPARK-51067][SQL] Partially revert session level collation as object level collation will be used instead~~ [SPARK-51067][SQL] Revert session level collation as object level collation will be used instead Feb 7, 2025

dejankrak-db added 2 commits February 7, 2025 03:07

Merge with latest master

8e94a68

Merge branch 'apache:master' into revert-session-collations

a99c407

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51067][SQL] Revert session level collation as object level collation will be used instead #49772

[SPARK-51067][SQL] Revert session level collation as object level collation will be used instead #49772

dejankrak-db commented Feb 3, 2025 •

edited

Loading

dejankrak-db commented Feb 3, 2025

dongjoon-hyun left a comment

dejankrak-db commented Feb 4, 2025 •

edited

Loading

cloud-fan Feb 5, 2025

cloud-fan Feb 5, 2025

dejankrak-db Feb 7, 2025

cloud-fan Feb 7, 2025

cloud-fan Feb 7, 2025

cloud-fan commented Feb 5, 2025

dejankrak-db commented Feb 7, 2025

[SPARK-51067][SQL] Revert session level collation as object level collation will be used instead #49772

Are you sure you want to change the base?

[SPARK-51067][SQL] Revert session level collation as object level collation will be used instead #49772

Conversation

dejankrak-db commented Feb 3, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dejankrak-db commented Feb 3, 2025

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dejankrak-db commented Feb 4, 2025 • edited Loading

cloud-fan Feb 5, 2025

Choose a reason for hiding this comment

cloud-fan Feb 5, 2025

Choose a reason for hiding this comment

dejankrak-db Feb 7, 2025

Choose a reason for hiding this comment

cloud-fan Feb 7, 2025

Choose a reason for hiding this comment

cloud-fan Feb 7, 2025

Choose a reason for hiding this comment

cloud-fan commented Feb 5, 2025

dejankrak-db commented Feb 7, 2025

dejankrak-db commented Feb 3, 2025 •

edited

Loading

dejankrak-db commented Feb 4, 2025 •

edited

Loading