Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VARIANT] Accept variantType RFC #4096

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -1408,13 +1408,14 @@ metadata | binary | The binary-encoded Variant metadata, as described in [Varian

The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
Struct fields which start with `_` (underscore) can be safely ignored.

Variant shredding will be introduced in a separate `variantShredding` table feature.
richardc-db marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this reference to variant shredding link to the parquet variant shredding spec? Or is that overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep it separate for now because i imagine the parquet variant shredding spec may contain more information than is necessary for delta (i.e. the parquet binary format currently contains information for the binary encoding, which we don't include here).

I'd let @gene-db make the call here, though. Gene, do you have an opinion?


## Writer Requirements for Variant Data Type

When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.
- must not write additional parquet struct fields.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason we need to forbid extra fields in this specific case?

Normally the Delta spec just says readers should ignore any unknown fields they might encounter, because the lack of a table feature to protect such fields means they do not affect correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think theres any particular reason, I think we can follow other features, i.e. modify this to say that readers can ignore fields that aren't metadata or value.

@gene-db or @cashmand do you see any issues with this? I figure this should be ok - the shredding table feature later will specify the "important" columns for shredding

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to fail is that shredding adds a typed_value column. If a shredding-unaware reader doesn't fail when it encounters a column other than value and metadata, it could incorrectly read value and metadata, which might look valid, but would not contain the full value.

Copy link
Contributor Author

@richardc-db richardc-db Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmad, in this case, the VariantShredding table feature would be enabled on the table though, right? So a shredding-unaware reader won't be able to read the table in the first place.

maybe i'm missing something...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Is it possible that a customer could create a delta table using external parquet files, and we wouldn't know that we nee the shredding feature? It seems safer to me for a reader to fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the spec have to require that tho? Seems like an implementation detail, how paranoid to be in validating various aspects of the spec? An implementation could still choose to warn/error if it finds obvious shredding-related columns in a parquet file when the shredding feature is not supported?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar situation could arise if an engine produced deletion vectors when DV feature is not supported.

Is that a scenario we worry about?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. But my understanding is that deletion vectors are a Delta concept, whereas shredding is (expected to be) part of the Parquet spec. So it seems easier to imagine importing some Parquet files written by another tool that unwittingly contain shredded data.

In any case, I think that was the motivation for including this line. If you don't think it's a concern, it should be fine to remove.


## Reader Requirements for Variant Data Type

Expand Down
5 changes: 3 additions & 2 deletions protocol_rfcs/accepted/variant-type.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,14 @@ metadata | binary | The binary-encoded Variant metadata, as described in [Varian

The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
Struct fields which start with `_` (underscore) can be safely ignored.

Variant shredding will be introduced in a separate `variantShredding` table feature.

## Writer Requirements for Variant Data Type

When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.
- must not write additional parquet struct fields.

## Reader Requirements for Variant Data Type

Expand Down
Loading