Skip to content

Conversation

@jeroko
Copy link

@jeroko jeroko commented Oct 27, 2025

Rationale for this change

Closes #2131

The PR relaxes the constraint that prevented adding any file with field IDs, and replaces it with a constraint that prevents adding files which contain field IDs that are inconsistent with the field IDs of the table. If the field IDs are compatible, then they can be added safely, if not, they will be rejected.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

@jeroko jeroko force-pushed the remove-field_id-constraint-on-add_files branch 2 times, most recently from 0b599c6 to d580102 Compare October 27, 2025 14:00
@jeroko jeroko force-pushed the remove-field_id-constraint-on-add_files branch from d580102 to 1addf60 Compare October 27, 2025 14:31
@jeroko jeroko marked this pull request as ready for review October 27, 2025 14:57
@jeroko jeroko requested a review from Fokko October 29, 2025 08:19
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is a great addition. Added a few comments

Comment on lines 2644 to 2647
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should at least check that the parquet field IDs align with the Iceberg field IDs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevinjqliu, what kind of extra ID alignment do you expect that is not already covered by _check_schema_compatible ?

`add_files` can work with Parquet files both with and without field IDs in their metadata:
- **Files with field IDs**: When field IDs are present in the Parquet metadata, they must match the corresponding field IDs in the Iceberg table schema. This is common for files generated by tools like Spark or when using or other libraries with explicit field ID metadata.
- **Files without field IDs**: When field IDs are absent, the table must have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) to map field names to Iceberg field IDs. `add_files` will automatically create a Name Mapping based on the table's current schema if one doesn't already exist.
In both cases, a Name Mapping is created if the table doesn't have one, ensuring compatibility with various readers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parquet files with field ID, i dont think we necessary need the name mapping if its aligned with the table schema field IDs
But we can address this separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add files support for parquet field_ids

3 participants