-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype #15239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ombined with Utf8 datatype
…ombined with Utf8 datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas -- I have one request otherwise I think this looks good to me
Thank you @alamb for review, i change back the original call to equivalent_names_and_types and add cast checking, because the check_arrow_schema_type_compatible will cause testing failed, it only check the datatype not checking other things such as filed length and schema field name. Also clean up the code which check_arrow_schema_type_compatible is not used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas -- this code looks really nice 👨🍳 👌
I think we should avoid API breakages but otherwise looks great to me
datafusion/common/src/dfschema.rs
Outdated
/// | ||
/// This is a specialized version of Eq that ignores differences | ||
/// in nullability and metadata. | ||
/// | ||
/// Use [DFSchema]::logically_equivalent_names_and_types for a weaker | ||
/// logical type checking, which for example would consider a dictionary | ||
/// encoded UTF8 array to be equivalent to a plain UTF8 array. | ||
pub fn equivalent_names_and_types(&self, other: &Self) -> bool { | ||
pub fn equivalent_names_and_types(&self, other: &Self) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, to ease upgrade, can you please put the old function signature back and mark it deprecated and put the new signature as a new function? Perhaps something like
#[deprecated(since = "47.0.0", note = "Use has_equivalent_names_and_types` instead")]
pub fn equivalent_names_and_types(&self, other: &Self) -> bool {
self.has_equivalent_names_and_types.is_err()
}
pub fn has_equivalent_names_and_types(&self, other: &Self) -> Result<()> {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for view, got it, good guide for me to upgrade. Addressed in latest PR.
Thank you for review @alamb , addressed it in latest PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas and @xudong963 -- I pushed a small commit to move / update the documentation.
I think this is ready to merge now
Thanks again |
Thank you @alamb and @xudong963 for review! |
|| (!DFSchema::datatype_is_semantically_equal( | ||
f1.data_type(), | ||
f2.data_type(), | ||
) && !can_cast_types(f2.data_type(), f1.data_type())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuqi-lucas this addition of "can_cast_types" broken Substrait consumer (since we use this logic to decide if we should cast things to get names to match - this now always passes).
But also overall using can_cast_types
here doesn't feel right. Can_cast_types is very lenient. I think what you'd want to do is to allow Utf8 -> Utf8View in logically_equivalent_names_and_types
and switch the call site you're interested to use that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Blizzara for testing and feedback, i am sorry for the broken Substrait consumer which i was not aware. I will help review when you submit the fix.
@zhuqi-lucas this addition of "can_cast_types" broken Substrait consumer (since we use this logic to decide if we should cast things to get names to match - this now always passes).
But also overall using
can_cast_types
here doesn't feel right. Can_cast_types is very lenient. I think what you'd want to do is to allow Utf8 -> Utf8View inlogically_equivalent_names_and_types
and switch the call site you're interested to use that one.
…ds (#15634) * add a failing test case for #15239 (comment) * fix invariants to use logical schema equality instead, and mark utf8 and utf8view as logically equivalent * fix logical equivalence to be strictly superset of semantic equivalence, including ignoring decimal precision/scale
@@ -564,6 +564,7 @@ impl DFSchema { | |||
} | |||
|
|||
/// Check to see if fields in 2 Arrow schemas are compatible | |||
#[deprecated(since = "47.0.0", note = "This method is no longer used")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a replaceable method for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we have a similar API already:
logically_equivalent_names_and_types
…ds (apache#15634) * add a failing test case for apache#15239 (comment) * fix invariants to use logical schema equality instead, and mark utf8 and utf8view as logically equivalent * fix logical equivalence to be strictly superset of semantic equivalence, including ignoring decimal precision/scale
Which issue does this PR close?
Rationale for this change
Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype
What changes are included in this PR?
Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype
Are these changes tested?
yes
Are there any user-facing changes?
Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype.