-
Notifications
You must be signed in to change notification settings - Fork 225
fix: SortMergeJoin for timestamp keys #1901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It fixes formatting also, now it looks like this: 25/06/18 12:53:47 WARN CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
Project
+- BroadcastHashJoin
:- Project
: +- Window
: +- Sort
: +- Exchange [COMET: ]
: +- Project
: +- SortMergeJoin [COMET: Unsupported join key type TimestampType on key CAST(time AS TIMESTAMP)]
... but after adding TimestampType into match statement there is no this message anyway:), just for clearance |
Thanks for the contribution, @SKY-ALIN! Could we add a test case with timestamps as the join key? |
@@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim { | |||
*/ | |||
private def supportedSortMergeJoinEqualType(dataType: DataType): Boolean = dataType match { | |||
case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _: FloatType | | |||
_: DoubleType | _: StringType | _: DateType | _: DecimalType | _: BooleanType => | |||
_: DoubleType | _: StringType | _: DateType | _: DecimalType | _: BooleanType | | |||
_: TimestampType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should TimestampNTZType
also be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimestampNTZType is supported; there is another case 1 line below.
case TimestampNTZType => true
The test should have the left side and the right side timestamps be in different timezones.
If the above test case passes we probably can. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1901 +/- ##
=============================================
- Coverage 56.12% 42.54% -13.58%
+ Complexity 976 938 -38
=============================================
Files 119 130 +11
Lines 11743 12828 +1085
Branches 2251 2414 +163
=============================================
- Hits 6591 5458 -1133
- Misses 4012 6283 +2271
+ Partials 1140 1087 -53 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@mbutrovich done. |
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase { | |||
.toSeq) | |||
} | |||
|
|||
test("SortMergeJoin with unsupported key type should fall back to Spark") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test you have added is great. Thank you!
However this removed test exercises a few things that we should also check for - Timestamps read from Parquet, and the actual plan created. For the latter you can simply change the test to use checkSparkAnswerAndOperator
and remove the check that the canonicalized plans are the same.
Also, to really make sure that we are testing timestamps, we really should have the left side and the right side of the join use timestamps with different timezones.
To create timestamps with different timezones, we can modify this test to create the test files separately -
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu",
SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
withTable("t1", "t2") {
withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Australia/Darwin") {
sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET")
sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01 11:11:11')")
}
withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Canada/Pacific") {
sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET")
sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01 11:11:11')")
}
...
The join above with different timezones will return zero rows since the timezones are not the same.
Also, we could rename the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a test will not work as the current implementation of datafusion doesn't include this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DF join on timestamp columns does not take into account time zones? Then it wouldn't be correct, would it?
I don't think this PR is making the correct change.
The reason for the failure is -
This means that The newly added test does not exercise the change. The plan (see below) does not include a SMJ and the test passes with and without the changes in PR.
|
Which issue does this PR close?
Closes #1900.
Rationale for this change
This type is supported, but missed on the proto stage + message formatting is incorrect
How are these changes tested?
These changes are tested locally comparring results with Spark without and with COmet extension