Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VARIANT] Accept variantType RFC #4096

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,11 @@
- [Reader Requirements for Vacuum Protocol Check](#reader-requirements-for-vacuum-protocol-check)
- [Clustered Table](#clustered-table)
- [Writer Requirements for Clustered Table](#writer-requirements-for-clustered-table)
- [Variant Data Type](#variant-data-type)
- [Variant data in Parquet](#variant-data-in-parquet)
- [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
- [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
- [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
- [Requirements for Writers](#requirements-for-writers)
- [Creation of New Log Entries](#creation-of-new-log-entries)
- [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
Expand Down Expand Up @@ -100,6 +105,7 @@
- [Struct Field](#struct-field)
- [Array Type](#array-type)
- [Map Type](#map-type)
- [Variant Type](#variant-type)
- [Column Metadata](#column-metadata)
- [Example](#example)
- [Checkpoint Schema](#checkpoint-schema)
Expand Down Expand Up @@ -1353,6 +1359,86 @@ The example above converts `configuration` field into JSON format, including esc
}
```


# Variant Data Type

This feature enables support for the Variant data type, for storing semi-structured data.
richardc-db marked this conversation as resolved.
Show resolved Hide resolved
The schema serialization method is described in [Schema Serialization Format](#schema-serialization-format).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit odd to specifically mention the schema serialization format here? AFAIK it didn't change with this table feature?

Meanwhile, it seems like we should specifically state that the new data type is called variant, rather than relying on the example below to show it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- the schema serialization section adds a new entry for variant, that explains the type name is variant.


To support this feature:
- The table must be on Reader Version 3 and Writer Version 7
- The feature `variantType` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`.

## Example JSON-Encoded Delta Table Schema with Variant types

```
{
"type" : "struct",
"fields" : [ {
"name" : "raw_data",
"type" : "variant",
"nullable" : true,
"metadata" : { }
}, {
"name" : "variant_array",
"type" : {
"type" : "array",
"elementType" : {
"type" : "variant"
},
"containsNull" : false
Comment on lines +1386 to +1389
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the variant spec, a variant column could store a JSON null value even if the variant field is non-nullable as in this example.

How is that handled? Does the Delta spec need to say anything about it or is that the query engine's problem, independent of Delta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah I think this is less a storage layer issue and more of an engine issue. That should be addressed in the actual binary layout of variant (which is considered in the parquet spec)

},
"nullable" : false,
"metadata" : { }
} ]
}
```

## Variant data in Parquet

The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
The two binary values are named `value` and `metadata`.

When writing Variant data to parquet files, the Variant data is written as a single Parquet struct, with the following fields:

Struct field name | Parquet primitive type | Description
-|-|-
value | binary | The binary-encoded Variant value, as described in [Variant binary encoding](https://github.com/apache/spark/blob/master/common/variant/README.md)
metadata | binary | The binary-encoded Variant metadata, as described in [Variant binary encoding](https://github.com/apache/spark/blob/master/common/variant/README.md)

The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.

Variant shredding will be introduced in a separate `variantShredding` table feature.
richardc-db marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this reference to variant shredding link to the parquet variant shredding spec? Or is that overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep it separate for now because i imagine the parquet variant shredding spec may contain more information than is necessary for delta (i.e. the parquet binary format currently contains information for the binary encoding, which we don't include here).

I'd let @gene-db make the call here, though. Gene, do you have an opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better way to put it -- should we give some indication here of what "shredding" even is? (a hyperlink to the spec being perhaps the least intrusive approach)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to link to the parquet shredding spec.

Suggested change
Variant shredding will be introduced in a separate `variantShredding` table feature.
[Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) will be introduced in a separate `variantShredding` table feature.

With the link, people can refer to it to get more details about the shredding, but we don't have to put those details in the text here.


## Writer Requirements for Variant Data Type

When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional parquet struct fields.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason we need to forbid extra fields in this specific case?

Normally the Delta spec just says readers should ignore any unknown fields they might encounter, because the lack of a table feature to protect such fields means they do not affect correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think theres any particular reason, I think we can follow other features, i.e. modify this to say that readers can ignore fields that aren't metadata or value.

@gene-db or @cashmand do you see any issues with this? I figure this should be ok - the shredding table feature later will specify the "important" columns for shredding

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to fail is that shredding adds a typed_value column. If a shredding-unaware reader doesn't fail when it encounters a column other than value and metadata, it could incorrectly read value and metadata, which might look valid, but would not contain the full value.

Copy link
Contributor Author

@richardc-db richardc-db Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cashmad, in this case, the VariantShredding table feature would be enabled on the table though, right? So a shredding-unaware reader won't be able to read the table in the first place.

maybe i'm missing something...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Is it possible that a customer could create a delta table using external parquet files, and we wouldn't know that we nee the shredding feature? It seems safer to me for a reader to fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually --

Both the spark and parquet shredding specs specifically address compatibility of extra fields:

Any fields in the same group as typed_value/variant_value that start with _ (underscore) can be ignored. This is intended to allow future backwards-compatible extensions. ... Any extra field names that do not start with an underscore should be assumed to be backwards incompatible, and readers should fail when reading such a schema.

Neither top-level binary variant spec has that provision for value, but the intent seems clear enough and so Delta should probably follow the same convention.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich There's a large PR to change the shredding spec that is removing both of these: apache/parquet-format#461.

The spec has simplified the fields to rename untyped_value to just value, in order to make unshredded variant a special case of shredded variant; the PR also removes the wording about underscore.

@bart-samwel's suggestion to specifically call out typed_value sounds reasonable to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree -- if we know of a common failure mode where invalid data could be silently accepted, it's probably worth calling that out here. Especially if value no longer disappears to make it obvious when shredding is in use.

Alternatively -- should we take it up a level of abstraction? Require the Delta client to check for shredded files, and to reject them unless the variantShredding feature is also enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest way to clarify this spec is to call out the typed_value field.

As for checking for shredded files, that would require opening every file footer. That is too high of an overhead for this requirement, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that requiring clients to check every footer is too much. Enough to say what the rules are, and the reader can decide on their own how to handle the possibility of non-conforming writes.


## Reader Requirements for Variant Data Type

When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
- must recognize and tolerate a `variant` data type in a Delta schema
- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file
- must make the column available to the engine:
- [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
- [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant.
- [Alternate] Convert the struct-of-binary to a string, and expose the string representation, e.g. if the engine does not support Variant.

## Compatibility with other Delta Features

Feature | Support for Variant Data Type
-|-
Partition Columns | **Supported:** A Variant column is allowed to be a non-partitioned column of a partitioned table. <br/> **Unsupported:** Variant is not a comparable data type, so it cannot be included in a partition column.
Clustered Tables | **Supported:** A Variant column is allowed to be a non-clustering column of a clustered table. <br/> **Unsupported:** Variant is not a comparable data type, so it cannot be included in a clustering column.
Delta Column Statistics | **Supported:** A Variant column supports the `nullCount` statistic. <br/> **Unsupported:** Variant is not a comparable data type, so a Variant column does not support the `minValues` and `maxValues` statistics.
Generated Columns | **Supported:** A Variant column is allowed to be used as a source in a generated column expression, as long as the Variant type is not the result type of the generated column expression. <br/> **Unsupported:** The Variant data type is not allowed to be the result type of a generated column expression.
Delta CHECK Constraints | **Supported:** A Variant column is allowed to be used for a CHECK constraint expression.
Default Column Values | **Supported:** A Variant column is allowed to have a default column value.
Change Data Feed | **Supported:** A table using the Variant data type is allowed to enable the Delta Change Data Feed.

# In-Commit Timestamps

The In-Commit Timestamps writer feature strongly associates a monotonically increasing timestamp with each commit by storing it in the commit's metadata.
Expand Down Expand Up @@ -1965,6 +2051,14 @@ type| Always the string "map".
keyType| The type of element used for the key of this map, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
valueType| The type of element used for the key of this map, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition

### Variant Type

Variant data uses the Delta type name `variant` for Delta schema serialization.

Field Name | Description
-|-
type | Always the string "variant"

### Column Metadata
A column metadata stores various information about the column.
For example, this MAY contain some keys like [`delta.columnMapping`](#column-mapping) or [`delta.generationExpression`](#generated-columns) or [`CURRENT_DEFAULT`](#default-columns).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Variant Data Type
**Folded into [PROTOCOL.md](../../protocol.md#variant-data-type)**

**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2864**

This protocol change adds support for the Variant data type.
Expand Down Expand Up @@ -56,13 +58,14 @@ metadata | binary | The binary-encoded Variant metadata, as described in [Varian

The parquet struct must include the two struct fields `value` and `metadata`.
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
Struct fields which start with `_` (underscore) can be safely ignored.

Variant shredding will be introduced in a separate `variantShredding` table feature.

## Writer Requirements for Variant Data Type

When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.
- must not write additional parquet struct fields.

## Reader Requirements for Variant Data Type

Expand Down
Loading