Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Improve validation logic and testing for supported data types for expressions #1345

Open
andygrove opened this issue Jan 27, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

What is the problem the feature request solves?

Problem Statement

When translating Spark expressions to protocol buffer-encoded native expressions in QueryPlanSerde, we should check that we can support the expressions natively and guarantee that we will produce the same results as Spark.

One of these checks should be to check that we support the expression for the specified input data types. For example, maybe we can support Sha2 with string inputs but not with binary inputs (this is just a hypothetical example).

We should ensure that we have tests to cover all of the data types that we say that we can support.

We currently have some gaps in this area and this epic is intended to link to issues to improve the situation.

Here are some specific examples where our current approach is lacking.

Shared data type checks for multiple expressions

In some cases, we call a generic supportedDataType method:

      case add @ Add(left, right, _) if supportedDataType(left.dataType) =>

      case s @ execution.ScalarSubquery(_, _) if supportedDataType(s.dataType) =>

      case XxHash64(children, seed) =>
        val firstUnSupportedInput = children.find(c => !supportedDataType(c.dataType))

It seems unlikely that all these expressions have the exact same supported types and it is also unlikely that we have tests today covering all the types that we claim to support.

No data type checks

In some cases, we have no checks at all. For example, for Sha2 we have no checks:

      case Sha2(left, numBits) =>

Sha2 in Spark supports StringType and BinaryType input, but we only have tests for StringType, so there is a gap. Theoretically, we may produce incompatible results for BinaryType (we won't know until we add tests).

Preparing for complex type support

Comet will soon support reading complex types (array, struct, map) from Parquet and Iceberg, so we must update the type checks to determine which expressions support which complex types. We cannot simply update supportedDataType and say that all expressions now also support all complex types.

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added the enhancement New feature or request label Jan 27, 2025
@andygrove andygrove changed the title [EPIC] Improve supported data type checks and tests for expressions [EPIC] Improve validation logic and testing for supported data types for expressions Jan 27, 2025
@andygrove andygrove self-assigned this Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant