Feat: Support array_intersect function #1271

erenavsarogullari · 2025-01-13T01:52:31Z

Which issue does this PR close?

Related to Epic: #1042
array_intersect: select array_intersect(array(1, 2, 3), array(2, 3, 4)) => array(2, 3)

Rationale for this change

Defined under Epic: #1042

What changes are included in this PR?

planner.rs: Created DataFusion array_intersect physical expression from Spark physical expression,
expr.proto: array_intersect message has been added,
QueryPlanSerde.scala: array_intersect pattern matching case has been added,
CometExpressionSuite.scala: A new UT has been added for array_intersect function.

How are these changes tested?

A new UT has been added.

andygrove · 2025-01-13T14:55:45Z

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

+        makeParquetFileAllTypes(path, dictionaryEnabled, 10000)
+        spark.read.parquet(path.toString).createOrReplaceTempView("t1")
+        checkSparkAnswerAndOperator(
+          sql("SELECT array_intersect(array(_2, _3, _4), array(_9, _10)) from t1"))


It isn't obvious to me whether any of these arrays actually intersect. Perhaps you could add one that is guaranteed to intersect such as array_intersect(array(_2, _3, _4), array(_3, _4)) or does Spark optimize that out?

Thanks for the review. Updated unit test case. Spark and Comet Physical Plans are as follows:
Spark Physical Plan:

*(1) Project [array_intersect(array(cast(_2#1 as int), cast(_3#2 as int), _4#3), array(cast(_3#2 as int), _4#3)) AS array_intersect(array(_2, _3, _4), array(_3, _4))#44] +- *(1) ColumnarToRow +- FileScan parquet [_2#1,_3#2,_4#3] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/jq/jhn012m16zzg7dc9lcgbdvjc0000gp/T/spark-97..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_2:tinyint,_3:smallint,_4:int>

Comet Physical Plan:

*(1) CometColumnarToRow +- CometProject [array_intersect(array(_2, _3, _4), array(_3, _4))#49], [array_intersect(array(cast(_2#1 as int), cast(_3#2 as int), _4#3), array(cast(_3#2 as int), _4#3)) AS array_intersect(array(_2, _3, _4), array(_3, _4))#49] +- CometScan parquet [_2#1,_3#2,_4#3] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/jq/jhn012m16zzg7dc9lcgbdvjc0000gp/T/spark-97..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_2:tinyint,_3:smallint,_4:int>

Also, the execution result is as follows (by test.parquet):

Did you intend to add a commit that updates the unit test? I don't see any changes.

Yes, i have just pushed it. Thanks for the letting me know.

andygrove

Thanks @erenavsarogullari. It would be good to also have tests for null and empty arrays and also for other data types such as strings, but I think we can handle that as part of #1269 since this applies to all of the recently added array functions.

erenavsarogullari · 2025-01-15T00:35:30Z

Sure @andygrove. I can also work on #1269 by assigning myself and cover array_intersect as part of it.

andygrove · 2025-01-18T16:18:19Z

Sure @andygrove. I can also work on #1269 by assigning myself and cover array_intersect as part of it.

Thanks @erenavsarogullari. It would be great to have help with this. I will try and add some more notes to the issue with suggestions for how we can improve coverage.

erenavsarogullari · 2025-01-19T19:40:04Z

Thanks @erenavsarogullari. It would be great to have help with this. I will try and add some more notes to the issue with suggestions for how we can improve coverage.

Thanks for #1308. We will need to apply same approach to other array functions after #1308 is merged as part of #1269. I think our scope is here to test all supported types per array function and catch violations after passing analysis phase.

andygrove · 2025-01-21T19:24:51Z

Thanks @erenavsarogullari. It would be great to have help with this. I will try and add some more notes to the issue with suggestions for how we can improve coverage.

Thanks for #1308. We will need to apply same approach to other array functions after #1308 is merged as part of #1269. I think our scope is here to test all supported types per array function and catch violations after passing analysis phase.

I agree. We are hoping to merge the comet-parquet-exec branch into main today or tomorrow, and once that is done I will go ahead and start merging the current array function PRs and then we can work on the testing.

Feat: Support array_intersect

015aeb6

andygrove reviewed Jan 13, 2025

View reviewed changes

Address review comment

7bcf6cb

andygrove approved these changes Jan 15, 2025

View reviewed changes

andygrove mentioned this pull request Jan 7, 2025

[EPIC] Add support for all array expressions #1042

Open

21 tasks

erenavsarogullari changed the title ~~Feat: Support array_intersect~~ Feat: Support array_intersect function Jan 20, 2025

andygrove merged commit 824ad1a into apache:main Jan 21, 2025
75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Support array_intersect function #1271

Feat: Support array_intersect function #1271

erenavsarogullari commented Jan 13, 2025

andygrove Jan 13, 2025

erenavsarogullari Jan 14, 2025 •

edited

Loading

andygrove Jan 14, 2025

erenavsarogullari Jan 14, 2025

andygrove left a comment

erenavsarogullari commented Jan 15, 2025

andygrove commented Jan 18, 2025

erenavsarogullari commented Jan 19, 2025 •

edited

Loading

andygrove commented Jan 21, 2025

Feat: Support array_intersect function #1271

Feat: Support array_intersect function #1271

Conversation

erenavsarogullari commented Jan 13, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove Jan 13, 2025

Choose a reason for hiding this comment

erenavsarogullari Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

andygrove Jan 14, 2025

Choose a reason for hiding this comment

erenavsarogullari Jan 14, 2025

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

erenavsarogullari commented Jan 15, 2025

andygrove commented Jan 18, 2025

erenavsarogullari commented Jan 19, 2025 • edited Loading

andygrove commented Jan 21, 2025

erenavsarogullari Jan 14, 2025 •

edited

Loading

erenavsarogullari commented Jan 19, 2025 •

edited

Loading