-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Add variant type #45375
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
87376cc
to
2e94210
Compare
Thanks @neilechao for working on this! I saw your reply on the dev@parquet ML. Let me know if you have any question. |
Thanks @wgtmac! My main question is this - is it possible to add variant type support to Parquet without adding a conversion to and from Arrow? The Variant encoding and shredding spec are in parquet-format, but I don't think the community has spent much time thinking about the on-wire format of variant |
There was a discussion on it: #42069. IMHO, we can get started with the variant binary format of the Parquet spec. cc @mapleFU @pitrou @emkornfield @wjones127 @westonpace |
My thoughts here I think mirror @wgtmac lets first be able to read/write encoded version (including having APIs for decoding from binary). Then we can add low-level parquet writes for shredding/deshredding, and for now arrow will can look like struct<metadata, value> with perhaps a logical type. Finally, if there is bandwidth we can discuss standardizing what shredded Arrow would look like. Open to other suggestions. |
Got it, thanks @wgtmac and @emkornfield! |
private: | ||
Variant() | ||
: LogicalType::Impl(LogicalType::Type::VARIANT, SortOrder::UNKNOWN), | ||
LogicalType::Impl::SimpleApplicable(parquet::Type::BYTE_ARRAY) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @emkornfield - I initially had Variant inherit from SimpleApplicable(BYTE_ARRAY), but Variant should actually be composed of two separate byte arrays - one for metadata (the dict) and one for values. This muddies the applicability of VariantLogicalType to a single parquet type.
parquet column-size variant_basic.parquet VARIANT_COL.value-> Size In Bytes: 69 Size In Ratio: 0.52671754 VARIANT_COL.metadata-> Size In Bytes: 62 Size In Ratio: 0.47328246
- One possibility is to create separate VariantMetadataLogicalType and VariantValueLogicalType, with VariantLogicalType containing both as class members. The pros are that this reflects the storage in Parquet, where metadata and values are stored in separate columns, and the cons are that this diverges from parquet.thrift and potentially the other language implementations
- Other options would be to have VariantMetadata and VariantValue present but not as logical types
What are your thoughts on these approaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, VariantLogicalType
should be similar to MapLogicalType
and ListLogicalType
which annotate group type. Though it (the unshredded form) is composed of two separate byte arrays, these two types are under the same group type as below. Can we model it as a struct<binary,binary>
?
optional group variant_name (VARIANT) {
required binary metadata;
required binary value;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac - I updated the thrift definition of variant to include required metadata and value binary members, and I passed the metadata and value into VariantLogicalType / LogicalType::Impl::Variant to populate the required helper methods.
Afterwards, I saw that Maps and Lists take a different approach - they appear to have barebones MapLogicalType and ListLogicalTypes respectively, and I don't see their structure defined clearly in parquet.thrift. It looks like their structure is listed in LogicalTypes.md and referencing again in some tests.
What's the difference between defining the members in the parquet.thrift strict versus using GroupNodes built on top of PrimitiveNodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I may have not explained it clearly.
Parquet has two different kind of nodes: primitive
and group
. primitive
node is usually for a primitive physical type (e.g. int64, double, binary, etc.) while group
is for a complex type (e.g. struct, map, list, etc.). Whatever the node type is, it occupies a SchemaElement
in the Parquet schema: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L502
LogicalType
can be assigned to both primitive and group node. Most logical types annotate a primitive node, examples are DECIMAL, TIMESTAMP, etc. However, List/Map/Variant
are logical types that must annotate a group node.
For List
and Map
, their group structures are dynamic because of different subtypes (the element type of a List or the key/value types of a Map). For Variant
, its group structure is also dynamic depending on whether it is shredded and shredded value types.
I updated the thrift definition of variant to include required metadata and value binary members
Based on above explanation, we cannot modify parquet.thrift
. To facilitate the current development, maybe we can define a VariantExtensionType
similar to implementations at https://github.com/apache/arrow/tree/01e3f1e6829d6fcc9021ac47aebb6350590ca134/cpp/src/arrow/extension with a storage type of struct<metadata:binary,value:binary>
. Once stable, we can make it canonical following the procedure at https://arrow.apache.org/docs/format/CanonicalExtensions.html
Do you have any opinion? @emkornfield @pitrou @mapleFU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I create a VariantExtensionType with storage type struct<metadata:binary, value:binary>, will the Parquet Reader and Writer break down the struct members (metadata and value) into separate columns for reading and writing?
Just double checking, since the documentation on Arrow Extension type writing Parquet files says "An Arrow Extension type is written out as its storage type", and I want to make sure that separate columns for metadata and binary are read and written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these two binary columns are processed individually by the Parquet writer and reader. You might want to implement a VariantExtensionArray
on top of the StructArray to restore the variant-typed values.
56de3a9
to
5cad55a
Compare
5cad55a
to
8b48c5d
Compare
cpp/src/arrow/extension/variant.cc
Outdated
bool VariantExtensionType::IsSupportedStorageType( | ||
std::shared_ptr<DataType> storage_type) { | ||
if (storage_type->id() == Type::STRUCT) { | ||
// TODO(neilechao) assertions for binary types, and non-nullable first field for | ||
// metadata | ||
return storage_type->num_fields() == 3; | ||
} | ||
|
||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac - quick sanity check, is this the proper way to define the underlying DataType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please see my inline comments.
cpp/src/arrow/extension/variant.cc
Outdated
if (storage_type->id() == Type::STRUCT) { | ||
// TODO(neilechao) assertions for binary types, and non-nullable first field for | ||
// metadata | ||
return storage_type->num_fields() == 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it storage_type->num_fields() >= 2
?
I see a TODO here so I assume you will check the type later.
cpp/src/arrow/extension/variant.cc
Outdated
bool VariantExtensionType::IsSupportedStorageType( | ||
std::shared_ptr<DataType> storage_type) { | ||
if (storage_type->id() == Type::STRUCT) { | ||
// TODO(neilechao) assertions for binary types, and non-nullable first field for | ||
// metadata | ||
return storage_type->num_fields() == 3; | ||
} | ||
|
||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please see my inline comments.
BTW, do you plan to add the variant binary encoding in this PR or a separate one? |
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?