-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Add variant type #45375
Draft
neilechao
wants to merge
9
commits into
apache:main
Choose a base branch
from
neilechao:variant_type
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,511
−1,945
Draft
Changes from 3 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
ee627e3
Regenerate thrift with variant
neilechao 30c3a40
Variant LogicalType class
neilechao 2e94210
schema tests
neilechao f398e9a
Add binary metadata and json to Variant thrift struct
neilechao 8b48c5d
Pass metadata and value into VariantLogicalType
neilechao 89e90c8
Remove variant thrift members and regenerate
neilechao 3e9c4f7
Revert "Pass metadata and value into VariantLogicalType"
neilechao 5096d74
Variant arrow extension type
neilechao 8c5ec3e
Moving VariantExtensionType to parquet::arrow namespace
neilechao File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @emkornfield - I initially had Variant inherit from SimpleApplicable(BYTE_ARRAY), but Variant should actually be composed of two separate byte arrays - one for metadata (the dict) and one for values. This muddies the applicability of VariantLogicalType to a single parquet type.
parquet column-size variant_basic.parquet VARIANT_COL.value-> Size In Bytes: 69 Size In Ratio: 0.52671754 VARIANT_COL.metadata-> Size In Bytes: 62 Size In Ratio: 0.47328246
What are your thoughts on these approaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO,
VariantLogicalType
should be similar toMapLogicalType
andListLogicalType
which annotate group type. Though it (the unshredded form) is composed of two separate byte arrays, these two types are under the same group type as below. Can we model it as astruct<binary,binary>
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac - I updated the thrift definition of variant to include required metadata and value binary members, and I passed the metadata and value into VariantLogicalType / LogicalType::Impl::Variant to populate the required helper methods.
Afterwards, I saw that Maps and Lists take a different approach - they appear to have barebones MapLogicalType and ListLogicalTypes respectively, and I don't see their structure defined clearly in parquet.thrift. It looks like their structure is listed in LogicalTypes.md and referencing again in some tests.
What's the difference between defining the members in the parquet.thrift strict versus using GroupNodes built on top of PrimitiveNodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I may have not explained it clearly.
Parquet has two different kind of nodes:
primitive
andgroup
.primitive
node is usually for a primitive physical type (e.g. int64, double, binary, etc.) whilegroup
is for a complex type (e.g. struct, map, list, etc.). Whatever the node type is, it occupies aSchemaElement
in the Parquet schema: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L502LogicalType
can be assigned to both primitive and group node. Most logical types annotate a primitive node, examples are DECIMAL, TIMESTAMP, etc. However,List/Map/Variant
are logical types that must annotate a group node.For
List
andMap
, their group structures are dynamic because of different subtypes (the element type of a List or the key/value types of a Map). ForVariant
, its group structure is also dynamic depending on whether it is shredded and shredded value types.Based on above explanation, we cannot modify
parquet.thrift
. To facilitate the current development, maybe we can define aVariantExtensionType
similar to implementations at https://github.com/apache/arrow/tree/01e3f1e6829d6fcc9021ac47aebb6350590ca134/cpp/src/arrow/extension with a storage type ofstruct<metadata:binary,value:binary>
. Once stable, we can make it canonical following the procedure at https://arrow.apache.org/docs/format/CanonicalExtensions.htmlDo you have any opinion? @emkornfield @pitrou @mapleFU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I create a VariantExtensionType with storage type struct<metadata:binary, value:binary>, will the Parquet Reader and Writer break down the struct members (metadata and value) into separate columns for reading and writing?
Just double checking, since the documentation on Arrow Extension type writing Parquet files says "An Arrow Extension type is written out as its storage type", and I want to make sure that separate columns for metadata and binary are read and written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these two binary columns are processed individually by the Parquet writer and reader. You might want to implement a
VariantExtensionArray
on top of the StructArray to restore the variant-typed values.