Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Add variant type #45375

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Conversation

neilechao
Copy link

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@neilechao neilechao changed the title Add variant type [C++] Add variant type Jan 29, 2025
@wgtmac
Copy link
Member

wgtmac commented Feb 6, 2025

Thanks @neilechao for working on this! I saw your reply on the dev@parquet ML. Let me know if you have any question.

@neilechao
Copy link
Author

Thanks @wgtmac! My main question is this - is it possible to add variant type support to Parquet without adding a conversion to and from Arrow? The Variant encoding and shredding spec are in parquet-format, but I don't think the community has spent much time thinking about the on-wire format of variant

@wgtmac
Copy link
Member

wgtmac commented Feb 11, 2025

There was a discussion on it: #42069. IMHO, we can get started with the variant binary format of the Parquet spec. cc @mapleFU @pitrou @emkornfield @wjones127 @westonpace

@emkornfield
Copy link
Contributor

There was a discussion on it: #42069. IMHO, we can get started with the variant binary format of the Parquet spec. cc @mapleFU @pitrou @emkornfield @wjones127 @westonpace

My thoughts here I think mirror @wgtmac lets first be able to read/write encoded version (including having APIs for decoding from binary). Then we can add low-level parquet writes for shredding/deshredding, and for now arrow will can look like struct<metadata, value> with perhaps a logical type. Finally, if there is bandwidth we can discuss standardizing what shredded Arrow would look like. Open to other suggestions.

@neilechao
Copy link
Author

Got it, thanks @wgtmac and @emkornfield!

private:
Variant()
: LogicalType::Impl(LogicalType::Type::VARIANT, SortOrder::UNKNOWN),
LogicalType::Impl::SimpleApplicable(parquet::Type::BYTE_ARRAY) {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @emkornfield - I initially had Variant inherit from SimpleApplicable(BYTE_ARRAY), but Variant should actually be composed of two separate byte arrays - one for metadata (the dict) and one for values. This muddies the applicability of VariantLogicalType to a single parquet type.

parquet column-size variant_basic.parquet VARIANT_COL.value-> Size In Bytes: 69 Size In Ratio: 0.52671754 VARIANT_COL.metadata-> Size In Bytes: 62 Size In Ratio: 0.47328246

  1. One possibility is to create separate VariantMetadataLogicalType and VariantValueLogicalType, with VariantLogicalType containing both as class members. The pros are that this reflects the storage in Parquet, where metadata and values are stored in separate columns, and the cons are that this diverges from parquet.thrift and potentially the other language implementations
  2. Other options would be to have VariantMetadata and VariantValue present but not as logical types

What are your thoughts on these approaches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, VariantLogicalType should be similar to MapLogicalType and ListLogicalType which annotate group type. Though it (the unshredded form) is composed of two separate byte arrays, these two types are under the same group type as below. Can we model it as a struct<binary,binary>?

optional group variant_name (VARIANT) {
  required binary metadata;
  required binary value;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac - I updated the thrift definition of variant to include required metadata and value binary members, and I passed the metadata and value into VariantLogicalType / LogicalType::Impl::Variant to populate the required helper methods.

Afterwards, I saw that Maps and Lists take a different approach - they appear to have barebones MapLogicalType and ListLogicalTypes respectively, and I don't see their structure defined clearly in parquet.thrift. It looks like their structure is listed in LogicalTypes.md and referencing again in some tests.

What's the difference between defining the members in the parquet.thrift strict versus using GroupNodes built on top of PrimitiveNodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I may have not explained it clearly.

Parquet has two different kind of nodes: primitive and group. primitive node is usually for a primitive physical type (e.g. int64, double, binary, etc.) while group is for a complex type (e.g. struct, map, list, etc.). Whatever the node type is, it occupies a SchemaElement in the Parquet schema: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L502

LogicalType can be assigned to both primitive and group node. Most logical types annotate a primitive node, examples are DECIMAL, TIMESTAMP, etc. However, List/Map/Variant are logical types that must annotate a group node.

For List and Map, their group structures are dynamic because of different subtypes (the element type of a List or the key/value types of a Map). For Variant, its group structure is also dynamic depending on whether it is shredded and shredded value types.

I updated the thrift definition of variant to include required metadata and value binary members

Based on above explanation, we cannot modify parquet.thrift. To facilitate the current development, maybe we can define a VariantExtensionType similar to implementations at https://github.com/apache/arrow/tree/01e3f1e6829d6fcc9021ac47aebb6350590ca134/cpp/src/arrow/extension with a storage type of struct<metadata:binary,value:binary>. Once stable, we can make it canonical following the procedure at https://arrow.apache.org/docs/format/CanonicalExtensions.html

Do you have any opinion? @emkornfield @pitrou @mapleFU

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I create a VariantExtensionType with storage type struct<metadata:binary, value:binary>, will the Parquet Reader and Writer break down the struct members (metadata and value) into separate columns for reading and writing?

Just double checking, since the documentation on Arrow Extension type writing Parquet files says "An Arrow Extension type is written out as its storage type", and I want to make sure that separate columns for metadata and binary are read and written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these two binary columns are processed individually by the Parquet writer and reader. You might want to implement a VariantExtensionArray on top of the StructArray to restore the variant-typed values.

@ianmcook ianmcook changed the title [C++] Add variant type [C++][Parquet] Add variant type Feb 12, 2025
Comment on lines 34 to 43
bool VariantExtensionType::IsSupportedStorageType(
std::shared_ptr<DataType> storage_type) {
if (storage_type->id() == Type::STRUCT) {
// TODO(neilechao) assertions for binary types, and non-nullable first field for
// metadata
return storage_type->num_fields() == 3;
}

return false;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac - quick sanity check, is this the proper way to define the underlying DataType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please see my inline comments.

if (storage_type->id() == Type::STRUCT) {
// TODO(neilechao) assertions for binary types, and non-nullable first field for
// metadata
return storage_type->num_fields() == 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it storage_type->num_fields() >= 2?

I see a TODO here so I assume you will check the type later.

Comment on lines 34 to 43
bool VariantExtensionType::IsSupportedStorageType(
std::shared_ptr<DataType> storage_type) {
if (storage_type->id() == Type::STRUCT) {
// TODO(neilechao) assertions for binary types, and non-nullable first field for
// metadata
return storage_type->num_fields() == 3;
}

return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please see my inline comments.

@wgtmac
Copy link
Member

wgtmac commented Feb 26, 2025

BTW, do you plan to add the variant binary encoding in this PR or a separate one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants