You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When translating Spark expressions to protocol buffer-encoded native expressions in QueryPlanSerde, we should check that we can support the expressions natively and guarantee that we will produce the same results as Spark.
One of these checks should be to check that we support the expression for the specified input data types. For example, maybe we can support Sha2 with string inputs but not with binary inputs (this is just a hypothetical example).
We should ensure that we have tests to cover all of the data types that we say that we can support.
We currently have some gaps in this area and this epic is intended to link to issues to improve the situation.
Here are some specific examples where our current approach is lacking.
Shared data type checks for multiple expressions
In some cases, we call a generic supportedDataType method:
case add @Add(left, right, _) if supportedDataType(left.dataType) =>case s @ execution.ScalarSubquery(_, _) if supportedDataType(s.dataType) =>caseXxHash64(children, seed) =>valfirstUnSupportedInput= children.find(c =>!supportedDataType(c.dataType))
It seems unlikely that all these expressions have the exact same supported types and it is also unlikely that we have tests today covering all the types that we claim to support.
No data type checks
In some cases, we have no checks at all. For example, for Sha2 we have no checks:
caseSha2(left, numBits) =>
Sha2 in Spark supports StringType and BinaryType input, but we only have tests for StringType, so there is a gap. Theoretically, we may produce incompatible results for BinaryType (we won't know until we add tests).
Preparing for complex type support
Comet will soon support reading complex types (array, struct, map) from Parquet and Iceberg, so we must update the type checks to determine which expressions support which complex types. We cannot simply update supportedDataType and say that all expressions now also support all complex types.
Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
andygrove
changed the title
[EPIC] Improve supported data type checks and tests for expressions
[EPIC] Improve validation logic and testing for supported data types for expressions
Jan 27, 2025
What is the problem the feature request solves?
Problem Statement
When translating Spark expressions to protocol buffer-encoded native expressions in
QueryPlanSerde
, we should check that we can support the expressions natively and guarantee that we will produce the same results as Spark.One of these checks should be to check that we support the expression for the specified input data types. For example, maybe we can support
Sha2
with string inputs but not with binary inputs (this is just a hypothetical example).We should ensure that we have tests to cover all of the data types that we say that we can support.
We currently have some gaps in this area and this epic is intended to link to issues to improve the situation.
Here are some specific examples where our current approach is lacking.
Shared data type checks for multiple expressions
In some cases, we call a generic
supportedDataType
method:It seems unlikely that all these expressions have the exact same supported types and it is also unlikely that we have tests today covering all the types that we claim to support.
No data type checks
In some cases, we have no checks at all. For example, for
Sha2
we have no checks:Sha2 in Spark supports
StringType
andBinaryType
input, but we only have tests forStringType
, so there is a gap. Theoretically, we may produce incompatible results forBinaryType
(we won't know until we add tests).Preparing for complex type support
Comet will soon support reading complex types (array, struct, map) from Parquet and Iceberg, so we must update the type checks to determine which expressions support which complex types. We cannot simply update
supportedDataType
and say that all expressions now also support all complex types.Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: