Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Support arrays_overlap function #1312

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

erenavsarogullari
Copy link
Member

@erenavsarogullari erenavsarogullari commented Jan 20, 2025

Which issue does this PR close?

Related to Epic: #1042
arrays_overlap: select arrays_overlap(array('hello', '-', 'world'), array('hello')) => true
DataFusion' s array_has_any has same behavior with Spark 's arrays_overlap function
Spark: https://docs.databricks.com/en/sql/language-manual/functions/arrays_overlap.html
DataFusion: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#array-has-any

Rationale for this change

Defined under Epic: #1042

What changes are included in this PR?

planner.rs: Maps Spark 's arrays_overlap function to DataFusion array_has_any physical expression from Spark physical expression with return type: DataType::Boolean,
expr.proto: arrays_overlap message has been added,
QueryPlanSerde.scala: arrays_overlap pattern matching case has been added,
CometExpressionSuite.scala: A new UT has been added for arrays_overlap function.

How are these changes tested?

A new UT has been added.

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2025

Codecov Report

Attention: Patch coverage is 57.14286% with 6 lines in your changes missing coverage. Please review.

Project coverage is 39.09%. Comparing base (f09f8af) to head (96f776e).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 57.14% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1312       +/-   ##
=============================================
- Coverage     56.12%   39.09%   -17.04%     
- Complexity      976     2065     +1089     
=============================================
  Files           119      260      +141     
  Lines         11743    60269    +48526     
  Branches       2251    12834    +10583     
=============================================
+ Hits           6591    23562    +16971     
- Misses         4012    32226    +28214     
- Partials       1140     4481     +3341     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -2300,6 +2300,12 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim
expr.children(1),
inputs,
(builder, binaryExpr) => builder.setArrayAppend(binaryExpr))
case ArraysOverlap(leftArrayExpr, rightArrayExpr) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support this for all input data types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the PR only contains basic tests, could you add a check to enable this expression only if CometConf.COMET_CAST_ALLOW_INCOMPATIBLE is enabled? We can remove this check in a future PR that adds comprehensive tests and demonstrates that we have Spark-compatible behavior for all supported data types.`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense and COMET_CAST_ALLOW_INCOMPATIBLE check has been added.

@@ -2568,4 +2568,21 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper {
}
}
}

test("arrays_overlap") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests where one or both input expressions evaluate to null on some rows? This can be achieved with a CASE expression as in some of the other array tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for more coverage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case Expression based test case has been added.

@andygrove
Copy link
Member

@erenavsarogullari could you rebase this PR when you have time? We can go ahead and merge this and then improve tests as part of #1269

@erenavsarogullari
Copy link
Member Author

erenavsarogullari commented Jan 28, 2025

@erenavsarogullari could you rebase this PR when you have time? We can go ahead and merge this and then improve tests as part of #1269

Thanks @andygrove and @kazuyukitanimura for the review. PR has been rebased and addressed the comments.

FYI, Currently, DataFusion does not support arrays_overlap function but supports same functionality by array_has_any function. Other query engines like Snowflake, Presto support arrays_overlap function. In terms of this, i also created DataFusion PR: apache/datafusion#14217

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @erenavsarogullari. LGTM.

checkSparkAnswerAndOperator(sql(
"SELECT arrays_overlap(array('a', null), array('b', null)) from t1 where _1 is not null"))
checkSparkAnswerAndOperator(spark.sql(
"SELECT arrays_overlap((CASE WHEN _2 =_3 THEN array(_6, _7) END), array(_6, _7)) FROM t1"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _2 = _3 may always be true. The problem with makeParquetFileAllTypes is that every for each row, each column contains the same integer value cast to the column's type, so it is not ideal for tests like this. We can improve as part of #1269

@andygrove andygrove merged commit 07274e8 into apache:main Jan 28, 2025
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants