Skip to content

fix(substrait): fix regressed edge case in renaming inner struct fields #15634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Apr 8, 2025

Which issue does this PR close?

#15239 introduced a regression in Substrait consumer, where if a field was correctly named but contained struct fields which were not, those would not be renamed. See #15239 (comment)

However, the logic used in that PR feels wrong also on a more general level. The can_cast_types is a very general check, so has_equivalent_names_and_types should quite certainly not use it as a criteria.

Rationale for this change

What changes are included in this PR?

  1. add a failing test case for Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype #15239 (comment)
  2. switch invariants to use logically_equivalent_names_and_types instead, since that's meant to be a weaker comparison
  3. mark Utf8 and Utf8View as logically the same

This does make the check in invariants tighter again. I'd argue that's a good thing, since the "can_cast_types" check is very relaxed and certainly doesn't fulfill the criteria in https://datafusion.apache.org/contributor-guide/specification/invariants.html#logical-schema-is-invariant-under-logical-optimization. That said, I'm not sure if the Utf8->Utf8View change fulfills that criteria either, but makes sense to me to allow it?

Are these changes tested?

Added a new test to catch the regression, + existing tests for the invariants.

Are there any user-facing changes?

@github-actions github-actions bot added the substrait Changes to the substrait crate label Apr 8, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules common Related to common crate labels Apr 8, 2025
@@ -1061,7 +1061,7 @@ async fn roundtrip_literal_list() -> Result<()> {
async fn roundtrip_literal_struct() -> Result<()> {
let plan = generate_plan_from_sql(
"SELECT STRUCT(1, true, CAST(NULL AS STRING)) FROM data",
false,
true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, but it was passing with true so why not have that

@@ -1076,6 +1076,46 @@ async fn roundtrip_literal_struct() -> Result<()> {
Ok(())
}

#[tokio::test]
async fn roundtrip_literal_named_struct() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses a named_struct() func so it wasn't hitting the bug, but I kept it as an addition still

@Blizzara Blizzara marked this pull request as ready for review April 8, 2025 12:43
@Blizzara
Copy link
Contributor Author

Blizzara commented Apr 8, 2025

FYI @zhuqi-lucas @alamb @xudong963 from the earlier PR

@@ -711,6 +711,9 @@ impl DFSchema {
.zip(iter2)
.all(|((t1, f1), (t2, f2))| t1 == t2 && Self::field_is_logically_equal(f1, f2))
}
// Utf8 and Utf8View are logically equivalent
(DataType::Utf8, DataType::Utf8View) => true,
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Blizzara for fix, this new relax looks good to me, and if we want more relax type, we can add later.

…ce, including ignoring decimal precision/scale
// This test aims to hit a case where the struct column itself has the expected name, but its
// inner field needs to be renamed.
let plan = generate_plan_from_sql(
"SELECT CAST((STRUCT(1)) AS Struct<\"int_field\"Int>) AS 'Struct({c0:1})' FROM data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this query is converting String --> StringView

(this seems like a good test in my opinion)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there's no String -> StringView in this test. This is just the test the regression that happened in Substrait consumer, where we were no longer renaming the inner fields, as the "is the schema the same" check passed too easily.

The String -> StringView is tested by the slt tests added in #15239

@alamb
Copy link
Contributor

alamb commented Apr 9, 2025

Thanks @Blizzara and @zhuqi-lucas

@alamb alamb merged commit 3b2df6f into apache:main Apr 10, 2025
27 checks passed
@Blizzara Blizzara deleted the avo/substrait-consumer-inner-struct-fields-edge-case branch April 11, 2025 08:06
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…ds (apache#15634)

* add a failing test case for apache#15239 (comment)

* fix invariants to use logical schema equality instead, and mark utf8 and utf8view as logically equivalent

* fix logical equivalence to be strictly superset of semantic equivalence, including ignoring decimal precision/scale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate logical-expr Logical plan and expressions optimizer Optimizer rules substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants