Skip to content

feat: introduce Scalar and StructLike interfaces for Transform evaluation #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gty404
Copy link
Contributor

@gty404 gty404 commented Apr 16, 2025

Problems to Solve

  1. nanoarrow does not provide a copy interface for ArrowArray, which is needed when implementing IdentityTransform.
  2. Both Transform and Expression would benefit from a lightweight, uniform data structure for input values — ideally something like a simplified StructLike.

Proposed Solution

  1. Introduce an abstract Scalar interface to represent values (both primitive and complex).
  2. Define a StructLike interface to represent structured values with field-based access.
  3. Allow classes such as ManifestEntity, DataFile, etc., to implement StructLike and serve as evaluation inputs for Transform and Expression.

Benefits

  1. Decouples Transform/Expression from ArrowArray, improving flexibility.
  2. Enables easier testing, mocking, and transformation logic using in-memory representations.
  3. Lays the foundation for future enhancements like expression optimization or code generation.

virtual ~ScalarVisitor() = default;

/// Visit methods for primitive types
virtual void visit(int32_t value) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it accept Int32Scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
The PrimitiveScalar::Accept method directly uses value, and using PrimitiveScalar is also acceptable. I will make some modifications.

Copy link
Member

@wgtmac wgtmac Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it accepts physical values instead of typed values, how can it differentiate between the actual types? For example, long and timestamp both use int64_t.

std::optional<T> value_;
};

// Aliases for common primitive scalar types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are supposed to specialize each type like DecimalType, BinaryType, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be specialized.

/**
* @brief Base class representing a generic scalar value.
*/
class ICEBERG_EXPORT Scalar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make std::shared_ptr<Type> as a member variable to Scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three possible approaches to organizing type information:

  1. Scalar holds built-in types independently, and StructLike does not rely on StructType.
  2. Scalar does not contain any type information, and its type is inferred from the associated StructType within StructLike.
  3. Both Scalar and StructType carry type information, resulting in some redundancy.
    I have chosen the first approach. Which one would you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Scalar does not carry the type info, we have to pass Schema wherever Scalar is used. In the case of schema evolution, then we need both Scalar schema and expected (table) schema.

* @tparam Derived The concrete subclass type (CRTP).
*/
template <typename Derived>
class VectorScalarBase : public Scalar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little bit weird that values in it have different meanings in the StructScalar (horizontal) and ArrayScalar (vertical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit strange because the type information is all in StructLike, with StructScalar/ArrayScalar both being a group of Scalar members and there is not much difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this sounds like an anti-pattern in the OOP. You cannot use the base VectorScalarBase class without knowing its subclass implementation.

/**
* @brief Represents an array-like scalar (list of values).
*/
class ArrayScalar : public VectorScalarBase<ArrayScalar> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we define MapScalar? Should it be array<struct<key,value>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I haven't thought of a good design, perhaps map can still be used, and Scalar needs to add interfaces for hash and equality judgment?

/**
* @brief Abstract interface for structured field access.
*/
class ICEBERG_EXPORT StructLike {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we simply reuse StructScalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructLike has a StructType, while StructScalar does not have one, I think this should depend on the discussion result regarding whether a Scalar should have its own type information.

/**
* @brief Visitor interface for traversing StructLike fields.
*/
class ICEBERG_EXPORT StructLikeVisitor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we can remove this as well.

* @brief Checks if the scalar represents a null value.
* @return true if the value is null, false otherwise.
*/
virtual bool IsNull() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the implication that this is not a non-trivial operation? (Else it might be is_null to imply it's a simple getter?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Scalar class, is_null can be a member variable, or the IsNull interface can be implemented by subclasses, where subclasses might use optional or pointer to determine if it is null. Here, the second option was chosen; I actually don't have a strong preference. StructScalar also uses the is_null field, so it might be more appropriate to directly make is_null a member variable of Scalar. I'll make a modification.

* @brief Creates a deep copy of the scalar.
* @return A shared pointer to the cloned scalar.
*/
virtual std::shared_ptr<Scalar> clone() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Clone? (Ditto in general for the PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will modify it.

Comment on lines +91 to +94
/**
* @brief Called when visiting a null scalar.
*/
virtual void visitNull() = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a specific NullScalar type, or any scalar with a null value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see the behavior below. IMO I'd rather we pass the scalars to the visitor so that the users can tell between different types of nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, visit I will modify it to accept Scalar types as input

* @return Reference to the value.
* @note Behavior is undefined if the scalar is null.
*/
const T& value() const { return *value_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but perhaps in DEBUG it should use optional::value() to force an exception and in NDEBUG it can use operator* to bypass that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I will modify here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants