Skip to content

[EPIC] Improve validation logic and testing for supported data types for expressions #1345

Open
@andygrove

Description

@andygrove

What is the problem the feature request solves?

Problem Statement

When translating Spark expressions to protocol buffer-encoded native expressions in QueryPlanSerde, we should check that we can support the expressions natively and guarantee that we will produce the same results as Spark.

One of these checks should be to check that we support the expression for the specified input data types. For example, maybe we can support Sha2 with string inputs but not with binary inputs (this is just a hypothetical example).

We should ensure that we have tests to cover all of the data types that we say that we can support.

We currently have some gaps in this area and this epic is intended to link to issues to improve the situation.

Here are some specific examples where our current approach is lacking.

Shared data type checks for multiple expressions

In some cases, we call a generic supportedDataType method:

      case add @ Add(left, right, _) if supportedDataType(left.dataType) =>

      case s @ execution.ScalarSubquery(_, _) if supportedDataType(s.dataType) =>

      case XxHash64(children, seed) =>
        val firstUnSupportedInput = children.find(c => !supportedDataType(c.dataType))

It seems unlikely that all these expressions have the exact same supported types and it is also unlikely that we have tests today covering all the types that we claim to support.

No data type checks

In some cases, we have no checks at all. For example, for Sha2 we have no checks:

      case Sha2(left, numBits) =>

Sha2 in Spark supports StringType and BinaryType input, but we only have tests for StringType, so there is a gap. Theoretically, we may produce incompatible results for BinaryType (we won't know until we add tests).

Preparing for complex type support

Comet will soon support reading complex types (array, struct, map) from Parquet and Iceberg, so we must update the type checks to determine which expressions support which complex types. We cannot simply update supportedDataType and say that all expressions now also support all complex types.

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions