-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Write fields instead of spec object #846
Conversation
cc @kevinjqliu @syun64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a few comments
def _meta(self) -> Dict[str, str]: | ||
return { | ||
"schema": self._schema.model_dump_json(), | ||
"partition-spec": to_json(self._spec.fields).decode("utf-8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we dont want to use the same logic? Like
"partition-spec": self._spec.model_dump_json(),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the spec.fields returns a list, which is not a Pydantic object, but a native Python construct. So the method isn't available.
"schema": schema.model_dump_json(), | ||
"partition-spec": spec.model_dump_json(), | ||
"partition-spec-id": str(spec.spec_id), | ||
"format-version": "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is 👍 since the version
function is defined
"schema": schema.model_dump_json(), | ||
"partition-spec": spec.model_dump_json(), | ||
"partition-spec-id": str(spec.spec_id), | ||
"format-version": "2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is 👍 since the version
function is defined
@@ -348,8 +348,8 @@ def test_write_manifest( | |||
|
|||
expected_metadata = { | |||
"schema": test_schema.model_dump_json(), | |||
"partition-spec": test_spec.model_dump_json(), | |||
"partition-spec-id": str(test_spec.spec_id), | |||
"partition-spec": """[{"source-id":1,"field-id":1,"transform":"identity","name":"VendorID"},{"source-id":2,"field-id":2,"transform":"identity","name":"tpep_pickup_datetime"}]""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to not hardcode this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like that we are hardcoding this value because the issue wasn't caught because we inferred it from test_spec before :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to hardcore the expected value so it is clear what is being returned when you go over the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Fokko - thank you for the quick fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late here. @Fokko Great catch! Thanks for fixing this and the refactoring :). @syun64 @kevinjqliu Thanks for reviewing!
Related to projectnessie#9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields. This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.
Related to #9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields. This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.
It should write the fields instead of the full spec: #208 (comment)
Also, did a small OOP refactor.