Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Write fields instead of spec object #846

Merged
merged 1 commit into from
Jun 24, 2024
Merged

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 21, 2024

It should write the fields instead of the full spec: #208 (comment)

Also, did a small OOP refactor.

@Fokko
Copy link
Contributor Author

Fokko commented Jun 21, 2024

cc @kevinjqliu @syun64

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments

def _meta(self) -> Dict[str, str]:
return {
"schema": self._schema.model_dump_json(),
"partition-spec": to_json(self._spec.fields).decode("utf-8"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we dont want to use the same logic? Like

                "partition-spec": self._spec.model_dump_json(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the spec.fields returns a list, which is not a Pydantic object, but a native Python construct. So the method isn't available.

"schema": schema.model_dump_json(),
"partition-spec": spec.model_dump_json(),
"partition-spec-id": str(spec.spec_id),
"format-version": "1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 👍 since the version function is defined

"schema": schema.model_dump_json(),
"partition-spec": spec.model_dump_json(),
"partition-spec-id": str(spec.spec_id),
"format-version": "2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is 👍 since the version function is defined

@@ -348,8 +348,8 @@ def test_write_manifest(

expected_metadata = {
"schema": test_schema.model_dump_json(),
"partition-spec": test_spec.model_dump_json(),
"partition-spec-id": str(test_spec.spec_id),
"partition-spec": """[{"source-id":1,"field-id":1,"transform":"identity","name":"VendorID"},{"source-id":2,"field-id":2,"transform":"identity","name":"tpep_pickup_datetime"}]""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to not hardcode this value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like that we are hardcoding this value because the issue wasn't caught because we inferred it from test_spec before :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to hardcore the expected value so it is clear what is being returned when you go over the tests.

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @Fokko - thank you for the quick fix!

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late here. @Fokko Great catch! Thanks for fixing this and the refactoring :). @syun64 @kevinjqliu Thanks for reviewing!

@HonahX HonahX merged commit 8cdf4ab into apache:main Jun 24, 2024
7 checks passed
@Fokko Fokko deleted the fd-buggg branch June 24, 2024 07:03
snazy added a commit to snazy/nessie that referenced this pull request Jul 18, 2024
Related to projectnessie#9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields.

This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.
snazy added a commit to projectnessie/nessie that referenced this pull request Jul 18, 2024
Related to #9042, Iceberg's `o.a.iceberg.ManifestReader.ManifestReader()` extracts the partition spec either via a provided `Map<Integer, PartitionSpec>` or re-constructs it from Avro metadata attributes. pyiceberg until including version 0.6.1 however writes _invalid_ manifest files (see apache/iceberg-python#846) with the `partition-spec` Avro metadata attribute containing the JSON of the whole partition-spec instead of just the partition-spec fields.

This change propagates the mentioned map down to the manifest-reader to work around this pyiceberg issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants