Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VARIANT] Accept
variantType
RFC #4096base: master
Are you sure you want to change the base?
[VARIANT] Accept
variantType
RFC #4096Changes from all commits
2120bc9
cdea198
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit odd to specifically mention the schema serialization format here? AFAIK it didn't change with this table feature?
Meanwhile, it seems like we should specifically state that the new data type is called
variant
, rather than relying on the example below to show it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- the schema serialization section adds a new entry for variant, that explains the type name is
variant
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the variant spec, a variant column could store a JSON
null
value even if the variant field is non-nullable as in this example.How is that handled? Does the Delta spec need to say anything about it or is that the query engine's problem, independent of Delta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah I think this is less a storage layer issue and more of an engine issue. That should be addressed in the actual binary layout of variant (which is considered in the parquet spec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this reference to variant shredding link to the parquet variant shredding spec? Or is that overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to keep it separate for now because i imagine the parquet variant shredding spec may contain more information than is necessary for delta (i.e. the parquet binary format currently contains information for the binary encoding, which we don't include here).
I'd let @gene-db make the call here, though. Gene, do you have an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better way to put it -- should we give some indication here of what "shredding" even is? (a hyperlink to the spec being perhaps the least intrusive approach)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to link to the parquet shredding spec.
With the link, people can refer to it to get more details about the shredding, but we don't have to put those details in the text here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason we need to forbid extra fields in this specific case?
Normally the Delta spec just says readers should ignore any unknown fields they might encounter, because the lack of a table feature to protect such fields means they do not affect correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think theres any particular reason, I think we can follow other features, i.e. modify this to say that readers can ignore fields that aren't
metadata
orvalue
.@gene-db or @cashmand do you see any issues with this? I figure this should be ok - the shredding table feature later will specify the "important" columns for shredding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to fail is that shredding adds a
typed_value
column. If a shredding-unaware reader doesn't fail when it encounters a column other thanvalue
andmetadata
, it could incorrectly readvalue
andmetadata
, which might look valid, but would not contain the full value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cashmad, in this case, the
VariantShredding
table feature would be enabled on the table though, right? So a shredding-unaware reader won't be able to read the table in the first place.maybe i'm missing something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Is it possible that a customer could create a delta table using external parquet files, and we wouldn't know that we nee the shredding feature? It seems safer to me for a reader to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually --
Both the spark and parquet shredding specs specifically address compatibility of extra fields:
Neither top-level binary variant spec has that provision for
value
, but the intent seems clear enough and so Delta should probably follow the same convention.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich There's a large PR to change the shredding spec that is removing both of these: apache/parquet-format#461.
The spec has simplified the fields to rename
untyped_value
to justvalue
, in order to make unshredded variant a special case of shredded variant; the PR also removes the wording about underscore.@bart-samwel's suggestion to specifically call out
typed_value
sounds reasonable to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree -- if we know of a common failure mode where invalid data could be silently accepted, it's probably worth calling that out here. Especially if
value
no longer disappears to make it obvious when shredding is in use.Alternatively -- should we take it up a level of abstraction? Require the Delta client to check for shredded files, and to reject them unless the
variantShredding
feature is also enabled?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplest way to clarify this spec is to call out the
typed_value
field.As for checking for shredded files, that would require opening every file footer. That is too high of an overhead for this requirement, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that requiring clients to check every footer is too much. Enough to say what the rules are, and the reader can decide on their own how to handle the possibility of non-conforming writes.