Skip to content

Conversation

@benbellick
Copy link
Member

@benbellick benbellick commented Nov 21, 2025

Support both opaque (google.protobuf.Any) and
structured (Literal.Struct) encodings for user-defined type literals per Substrait spec.

  • Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct
  • Move type parameters to interface level for parameterized types
  • Add first-class POJO representation for type parameters
  • Test coverage including roundtrip tests
  • Throw exception on unhandled struct-based representation in isthmus

Closes #611

@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 133b704 to e4f36ea Compare November 21, 2025 19:08
@benbellick benbellick changed the title Benbellick/handle structured udt2 handle struct-based UDT literals in core Nov 21, 2025
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from e4f36ea to cef9798 Compare November 21, 2025 19:16
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from cef9798 to f3379c1 Compare November 21, 2025 19:51
@benbellick
Copy link
Member Author

FYI, this PR makes no attempt to actually validate if the struct representation provided matches the definition in the yaml file. I think that this is the right thing to do, but it turned out to be slightly more complicated than I thought, as it involves threading an ExtensionCollection through the codebase a bit. Thus, I left it for another issue (#614).

@benbellick benbellick marked this pull request as ready for review November 21, 2025 20:01
@benbellick benbellick requested a review from vbarua November 21, 2025 20:01
Support both opaque (google.protobuf.Any) and
structured (Literal.Struct) encodings for user-defined type literals per Substrait spec.

- Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct
- Move type parameters to interface level for parameterized types
- Comprehensive test coverage including roundtrip tests
- Throw exception on unhandled struct-based representation in isthmus
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 1eb4fb4 to f5b6341 Compare November 24, 2025 16:19
extensionCollector.getTypeReference(SimpleExtension.TypeAnchor.of(expr.urn(), expr.name()));
return lit(
bldr -> {
try {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception doesn't happen anymore because we don't parse the Any here. Instead, we have a reference to the pre-parsed proto directly.

public ParameterizedType userDefined(
int ref, java.util.List<io.substrait.type.Type.Parameter> typeParameters) {
throw new UnsupportedOperationException(
"User defined types are not supported in Parameterized Types for now");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the above, where we don't yet support ParamerizedType conversion.

public DerivationExpression userDefined(
int ref, java.util.List<io.substrait.type.Type.Parameter> typeParameters) {
throw new UnsupportedOperationException(
"User defined types are not supported in Derivation Expressions for now");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also consistent with the above.

public RexNode visit(Expression.UserDefinedStructLiteral expr, Context context)
throws RuntimeException {
throw new UnsupportedOperationException(
"UserDefinedStructLiteral representation is not yet supported in Isthmus");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #612

@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 78488b4 to e8cb862 Compare November 25, 2025 19:27
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started taking a look at this and left some comments. Mostly looks reasonable. Want to come back and look at your tests with fresh eyes, and also think about parameterized types with a fresh 🧠.

return Type.withNullability(nullable()).userDefined(urn(), name());
public abstract List<io.substrait.type.Type.Parameter> typeParameters();

public abstract List<Literal> fields();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use a List<Literal> here instead of just a Expression.StructLiteral, which would map directly to the protobuf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing this comment because I believe it is relevant.


Basically it is the same issue, which is that the POJO class called StructLiteral is representing a special case of the Literal class.

  message Literal { // Both StructLiteral and UserDefinedStructLiteral are representing this _whole_ message
    oneof literal_type {
      ...
      Struct struct = 25;
      ...
      UserDefined user_defined = 33;
    }
    ...
    message Struct {
      // A possibly heterogeneously typed list of literals
      repeated Literal fields = 1;
    }

    message UserDefined {
      oneof type_anchor_type {
        // points to a type_anchor defined in this plan
        uint32 type_reference = 1;

        // points to a type_alias_anchor defined in this plan.
        uint32 type_alias_reference = 5;
      }

      // The parameters to be bound to the type class, if the type class is
      // parameterizable.
      repeated Type.Parameter type_parameters = 3;

      // a user-defined literal can be encoded in one of two ways
      oneof val {
        // the value of the literal, serialized using some type-specific protobuf message
        google.protobuf.Any value = 2;
        // the value of the literal, serialized using the structure definition in its declaration
        Literal.Struct struct = 4;
      }
    }
  }

Back to your comment, switching this member variable to be Expression.StructLiteral would amount to embedding one literal proto inside of another. For example, the proto Struct doesn't actually have a nullable field. But the POJO Expression.StructLiteral does have a nullable field because it inherits from Literal.

This doesn't mean that we couldn't replace the member variable as you suggest, but if we did do that, we would be carrying around extra meaningless data, which I think is more confusing ultimately.

* User-defined literal with value encoded as {@code Literal.Struct}.
*
* <p>This encoding uses a structured list of fields to represent the literal value.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring feels a little inconsistent. You have

literal with value encoded as {@code Literal.Struct}

but in the class the values are encoded as a List<Literal>, and the second part of the docstring is consistent with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this could be made less confusing. The intention was to mean the proto Literal.Struct, no some java class. What do you think about just altering the message to say

User-defined literal with value encoded via the proto message {@code Literal.Struct}.

?

Copy link
Member Author

@benbellick benbellick Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I altered it to instead link to the actual proto. Let me know what you think! I also did the same for the any case.

@Override
public abstract List<io.substrait.type.Type.Parameter> typeParameters();

public abstract com.google.protobuf.Any value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release Notes
Capturing the value as an Any instead of a ByteString does feel nicer ✨

* @see UserDefinedAnyLiteral
* @see UserDefinedStructLiteral
*/
interface UserDefinedLiteral extends Literal {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release Notes
We should call out that we don't construct UserDefinedLiterals anymore.

* parameters (like the {@code 10} in {@code VARCHAR<10>}). This interface provides a type-safe
* representation of all possible parameter kinds.
*/
interface Parameter {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yes that is an interesting point. Looking into it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand it, ParameterizedType.java is used for representing abstract types with parameters in yaml files. Where as the Parameter above being introduced is actually a concrete argument passed into the type.

So for example, List<any1> could be a ParameterizedType, whereas List<int32> is a type with parameters [int32].

@Value.Immutable
abstract class ParameterBooleanValue implements Parameter {
public abstract boolean value();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a usecase for having boolean literal type parameters? I can't imagine a usecase where something like MySpecialType<false> is something you would need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#613 (comment)

🤷 just wanted to be consistent with the spec

@Value.Immutable
abstract class ParameterEnumValue implements Parameter {
public abstract String value();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this type be used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, these were inspired by this portion of the simple extension schema:

  type_param_defs: # an array of compound type parameter definitions
    type: array
    items:
      ...
      properties:
        ...
        type: # expected metatype for the parameter
          type: string
          enum:
            - dataType
            - boolean
            - integer
            - enumeration
            - string

So while I don't understand the usage of it, I thought it was best to include all of them for consistency with the spec.

@Value.Immutable
abstract class ParameterStringValue implements Parameter {
public abstract String value();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this type be used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from fea5921 to 77f0bd7 Compare November 26, 2025 18:14
@benbellick benbellick force-pushed the benbellick/handle-structured-udt2 branch from 3814053 to 4b4f5ee Compare November 26, 2025 18:26
@benbellick
Copy link
Member Author

FYI, I have a WIP PR for implementing this in Isthmus but I split it into two PRs because the code there is more complicated. This keeps the PRs a bit smaller!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle struct-based UDT literals in core

3 participants