Skip to content

Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype #15239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 18, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype

What changes are included in this PR?

Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype

Are these changes tested?

yes

Are there any user-facing changes?

Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype.

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 15, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas -- I have one request otherwise I think this looks good to me

@github-actions github-actions bot added the common Related to common crate label Mar 16, 2025
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 16, 2025

Thanks @zhuqi-lucas -- I have one request otherwise I think this looks good to me

Thank you @alamb for review, i change back the original call to equivalent_names_and_types and add cast checking, because the check_arrow_schema_type_compatible will cause testing failed, it only check the datatype not checking other things such as filed length and schema field name.

Also clean up the code which check_arrow_schema_type_compatible is not used.

@github-actions github-actions bot added the substrait Changes to the substrait crate label Mar 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas -- this code looks really nice 👨‍🍳 👌

I think we should avoid API breakages but otherwise looks great to me

///
/// This is a specialized version of Eq that ignores differences
/// in nullability and metadata.
///
/// Use [DFSchema]::logically_equivalent_names_and_types for a weaker
/// logical type checking, which for example would consider a dictionary
/// encoded UTF8 array to be equivalent to a plain UTF8 array.
pub fn equivalent_names_and_types(&self, other: &Self) -> bool {
pub fn equivalent_names_and_types(&self, other: &Self) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, to ease upgrade, can you please put the old function signature back and mark it deprecated and put the new signature as a new function? Perhaps something like

   #[deprecated(since = "47.0.0", note = "Use has_equivalent_names_and_types`  instead")]
    pub fn equivalent_names_and_types(&self, other: &Self) -> bool {
      self.has_equivalent_names_and_types.is_err()
    }

    pub fn has_equivalent_names_and_types(&self, other: &Self) -> Result<()> { 
...
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for view, got it, good guide for me to upgrade. Addressed in latest PR.

@zhuqi-lucas
Copy link
Contributor Author

Thanks @zhuqi-lucas -- this code looks really nice 👨‍🍳 👌

I think we should avoid API breakages but otherwise looks great to me

Thank you for review @alamb , addressed it in latest PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas and @xudong963 -- I pushed a small commit to move / update the documentation.

I think this is ready to merge now

@alamb alamb merged commit 9e1e32f into apache:main Mar 18, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 18, 2025

Thanks again

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb and @xudong963 for review!

|| (!DFSchema::datatype_is_semantically_equal(
f1.data_type(),
f2.data_type(),
) && !can_cast_types(f2.data_type(), f1.data_type()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuqi-lucas this addition of "can_cast_types" broken Substrait consumer (since we use this logic to decide if we should cast things to get names to match - this now always passes).

But also overall using can_cast_types here doesn't feel right. Can_cast_types is very lenient. I think what you'd want to do is to allow Utf8 -> Utf8View in logically_equivalent_names_and_types and switch the call site you're interested to use that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Blizzara for testing and feedback, i am sorry for the broken Substrait consumer which i was not aware. I will help review when you submit the fix.

@zhuqi-lucas this addition of "can_cast_types" broken Substrait consumer (since we use this logic to decide if we should cast things to get names to match - this now always passes).

But also overall using can_cast_types here doesn't feel right. Can_cast_types is very lenient. I think what you'd want to do is to allow Utf8 -> Utf8View in logically_equivalent_names_and_types and switch the call site you're interested to use that one.

alamb pushed a commit that referenced this pull request Apr 10, 2025
…ds (#15634)

* add a failing test case for #15239 (comment)

* fix invariants to use logical schema equality instead, and mark utf8 and utf8view as logically equivalent

* fix logical equivalence to be strictly superset of semantic equivalence, including ignoring decimal precision/scale
@@ -564,6 +564,7 @@ impl DFSchema {
}

/// Check to see if fields in 2 Arrow schemas are compatible
#[deprecated(since = "47.0.0", note = "This method is no longer used")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a replaceable method for this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have a similar API already:

logically_equivalent_names_and_types

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…ds (apache#15634)

* add a failing test case for apache#15239 (comment)

* fix invariants to use logical schema equality instead, and mark utf8 and utf8view as logically equivalent

* fix logical equivalence to be strictly superset of semantic equivalence, including ignoring decimal precision/scale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants