Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VARIANT] Accept
variantType
RFC #4096base: master
Are you sure you want to change the base?
[VARIANT] Accept
variantType
RFC #4096Changes from 1 commit
2120bc9
cdea198
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this reference to variant shredding link to the parquet variant shredding spec? Or is that overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to keep it separate for now because i imagine the parquet variant shredding spec may contain more information than is necessary for delta (i.e. the parquet binary format currently contains information for the binary encoding, which we don't include here).
I'd let @gene-db make the call here, though. Gene, do you have an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason we need to forbid extra fields in this specific case?
Normally the Delta spec just says readers should ignore any unknown fields they might encounter, because the lack of a table feature to protect such fields means they do not affect correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think theres any particular reason, I think we can follow other features, i.e. modify this to say that readers can ignore fields that aren't
metadata
orvalue
.@gene-db or @cashmand do you see any issues with this? I figure this should be ok - the shredding table feature later will specify the "important" columns for shredding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason to fail is that shredding adds a
typed_value
column. If a shredding-unaware reader doesn't fail when it encounters a column other thanvalue
andmetadata
, it could incorrectly readvalue
andmetadata
, which might look valid, but would not contain the full value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cashmad, in this case, the
VariantShredding
table feature would be enabled on the table though, right? So a shredding-unaware reader won't be able to read the table in the first place.maybe i'm missing something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Is it possible that a customer could create a delta table using external parquet files, and we wouldn't know that we nee the shredding feature? It seems safer to me for a reader to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the spec have to require that tho? Seems like an implementation detail, how paranoid to be in validating various aspects of the spec? An implementation could still choose to warn/error if it finds obvious shredding-related columns in a parquet file when the shredding feature is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar situation could arise if an engine produced deletion vectors when DV feature is not supported.
Is that a scenario we worry about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not. But my understanding is that deletion vectors are a Delta concept, whereas shredding is (expected to be) part of the Parquet spec. So it seems easier to imagine importing some Parquet files written by another tool that unwittingly contain shredded data.
In any case, I think that was the motivation for including this line. If you don't think it's a concern, it should be fine to remove.