fix: SortMergeJoin for timestamp keys #1901

SKY-ALIN · 2025-06-18T11:20:55Z

Which issue does this PR close?

Closes #1900.

Rationale for this change

This type is supported, but missed on the proto stage + message formatting is incorrect

How are these changes tested?

These changes are tested locally comparring results with Spark without and with COmet extension

SKY-ALIN · 2025-06-18T11:23:42Z

It fixes formatting also, now it looks like this:

25/06/18 12:53:47 WARN CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
Project
+- BroadcastHashJoin
   :- Project
   :  +- Window
   :     +- Sort
   :        +-  Exchange [COMET: ]
   :           +- Project
   :              +-  SortMergeJoin [COMET: Unsupported join key type TimestampType on key CAST(time AS TIMESTAMP)]
...

but after adding TimestampType into match statement there is no this message anyway:), just for clearance

mbutrovich · 2025-06-18T13:11:45Z

Thanks for the contribution, @SKY-ALIN! Could we add a test case with timestamps as the join key?

andygrove · 2025-06-18T13:52:37Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim {
   */
  private def supportedSortMergeJoinEqualType(dataType: DataType): Boolean = dataType match {
    case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _: FloatType |
-        _: DoubleType | _: StringType | _: DateType | _: DecimalType | _: BooleanType =>
+        _: DoubleType | _: StringType | _: DateType | _: DecimalType | _: BooleanType |
+        _: TimestampType =>


Should TimestampNTZType also be supported?

TimestampNTZType is supported; there is another case 1 line below.

case TimestampNTZType => true

parthchandra · 2025-06-18T23:02:11Z

Thanks for the contribution, @SKY-ALIN! Could we add a test case with timestamps as the join key?

The test should have the left side and the right side timestamps be in different timezones.

Should TimestampNTZType also be supported?

If the above test case passes we probably can.

codecov-commenter · 2025-06-19T01:47:34Z

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 42.54%. Comparing base (f09f8af) to head (75d681d).
Report is 270 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	0.00%	4 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1901       +/-   ##
=============================================
- Coverage     56.12%   42.54%   -13.58%     
+ Complexity      976      938       -38     
=============================================
  Files           119      130       +11     
  Lines         11743    12828     +1085     
  Branches       2251     2414      +163     
=============================================
- Hits           6591     5458     -1133     
- Misses         4012     6283     +2271     
+ Partials       1140     1087       -53

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SKY-ALIN · 2025-06-23T03:19:14Z

@mbutrovich done.

parthchandra · 2025-06-24T17:11:59Z

spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala

@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
        .toSeq)
  }

-  test("SortMergeJoin with unsupported key type should fall back to Spark") {


The test you have added is great. Thank you!

However this removed test exercises a few things that we should also check for - Timestamps read from Parquet, and the actual plan created. For the latter you can simply change the test to use checkSparkAnswerAndOperator and remove the check that the canonicalized plans are the same.

Also, to really make sure that we are testing timestamps, we really should have the left side and the right side of the join use timestamps with different timezones.

To create timestamps with different timezones, we can modify this test to create the test files separately -

withSQLConf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu", SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { withTable("t1", "t2") { withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Australia/Darwin") { sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET") sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01 11:11:11')") } withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Canada/Pacific") { sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET") sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01 11:11:11')") } ...

The join above with different timezones will return zero rows since the timezones are not the same.

Also, we could rename the test.

Such a test will not work as the current implementation of datafusion doesn't include this case.

A DF join on timestamp columns does not take into account time zones? Then it wouldn't be correct, would it?

parthchandra · 2025-06-25T22:59:44Z

I don't think this PR is making the correct change.
With this PR the removed test fails to execute the query (let alone pass the assertion)

  test("SortMergeJoin with unsupported key type should fall back to Spark") {
    withSQLConf(
      SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu",
      SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
      withTable("t1", "t2") {
        sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET")
        sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01 11:11:11')")

        sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET")
        sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01 11:11:11')")

        val df = sql("SELECT * FROM t1 JOIN t2 ON t1.time = t2.time")
        val (sparkPlan, cometPlan) = checkSparkAnswer(df)                // should NOT fail here, but does
        assert(sparkPlan.canonicalized === cometPlan.canonicalized)       // should fail here
      }
    }
  }

The reason for the failure is -

org.apache.comet.CometNativeException: Unsupported data type in sort merge join comparator: Timestamp(Microsecond, Some("UTC"))

This means that supportedSortMergeJoinEqualType should not return true for Timestamp

The newly added test does not exercise the change. The plan (see below) does not include a SMJ and the test passes with and without the changes in PR.

AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [ts#4], [ts#10], Inner, BuildRight, false
   :- LocalTableScan [ts#4]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, timestamp, true]),false), [plan_id=106]
      +- LocalTableScan [ts#10]

Fix SortMergeJoin for timestamp keys

5d5a096

SKY-ALIN added 2 commits June 18, 2025 14:25

Formatting fix

4db047d

Remvoe missed function

75d681d

SKY-ALIN changed the title ~~Fix SortMergeJoin for timestamp keys~~ fix: SortMergeJoin for timestamp keys Jun 18, 2025

andygrove reviewed Jun 18, 2025

View reviewed changes

SKY-ALIN and others added 2 commits June 23, 2025 04:22

Merge branch 'apache:main' into main

946b399

Update tests

78a869d

parthchandra reviewed Jun 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: SortMergeJoin for timestamp keys #1901

fix: SortMergeJoin for timestamp keys #1901

Uh oh!

SKY-ALIN commented Jun 18, 2025

Uh oh!

SKY-ALIN commented Jun 18, 2025 •

edited

Loading

Uh oh!

mbutrovich commented Jun 18, 2025 •

edited

Loading

Uh oh!

andygrove Jun 18, 2025

Uh oh!

SKY-ALIN Jun 23, 2025

Uh oh!

parthchandra commented Jun 18, 2025

Uh oh!

codecov-commenter commented Jun 19, 2025 •

edited

Loading

Uh oh!

SKY-ALIN commented Jun 23, 2025

Uh oh!

parthchandra Jun 24, 2025

Uh oh!

SKY-ALIN Jun 24, 2025

Uh oh!

parthchandra Jun 25, 2025

Uh oh!

parthchandra commented Jun 25, 2025

Uh oh!

Uh oh!

fix: SortMergeJoin for timestamp keys #1901

Are you sure you want to change the base?

fix: SortMergeJoin for timestamp keys #1901

Uh oh!

Conversation

SKY-ALIN commented Jun 18, 2025

Which issue does this PR close?

Rationale for this change

How are these changes tested?

Uh oh!

SKY-ALIN commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

SKY-ALIN Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Jun 18, 2025

Uh oh!

codecov-commenter commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SKY-ALIN commented Jun 23, 2025

Uh oh!

parthchandra Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

SKY-ALIN Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

parthchandra Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

parthchandra commented Jun 25, 2025

Uh oh!

Uh oh!

SKY-ALIN commented Jun 18, 2025 •

edited

Loading

mbutrovich commented Jun 18, 2025 •

edited

Loading

codecov-commenter commented Jun 19, 2025 •

edited

Loading