Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build new TableMetadata without reassigning field IDs #919

Open
jdockerty opened this issue Jan 28, 2025 · 0 comments
Open

Build new TableMetadata without reassigning field IDs #919

jdockerty opened this issue Jan 28, 2025 · 0 comments

Comments

@jdockerty
Copy link
Contributor

jdockerty commented Jan 28, 2025

This is in part a question and open for discussion.

When building TableMetadata through the TableMetadataBuilder, all options of building "from scratch" force a reassignment of field IDs:

I noticed that it would be possible to get any type of TableMetadata that was desired through using the object directly, but all of the fields are restricted to pub(crate) scope. I suspect the reason for this is safety, i.e. ensuring that creation occurs through the builder pattern where the relevant checks are performed on call to build().

Questions:

  1. Would it be problematic to lift the restriction on the TableMetadata fields to be pub1 or allow the creation of TableMetadata without reassigning field IDs?
  2. If the above is not possible, is there an example of creating the iceberg metadata file hierarchy in the correct way?

For extra context, we're currently constructing Iceberg metadata around pre-existing parquet files written by another system; however, there is no Iceberg catalog or prior metadata JSON. I noticed there is also a StaticTable; however, this requires either pre-existing JSON from FileIO or an input TableMetadata, this 2nd option brings us back to the above issue.

This assignment leads to a mismatch in what is shown in the table metadata JSON vs the actual parquet file:

parquet schema
required group field_id=-1 arrow_schema {
  optional binary field_id=2 cpu (String);
  optional binary field_id=3 host1 (String);
  optional int64 field_id=1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}
iceberg metadata JSON schema snippet

This reassignment occurs to the order that they appear within the parquet/arrow Schema, rather than the given field IDs.

  "schemas": [
    {
      "schema-id": 0,
      "type": "struct",
      "fields": [
        {
          "id": 1, <-- field_id=2 in parquet
          "name": "cpu",
          "required": false,
          "type": "string"
        },
        {
          "id": 2, <-- field_id=3 in parquet
          "name": "host1",
          "required": false,
          "type": "string"
        },
        {
          "id": 3, <-- field_id=1 in parquet
          "name": "time",
          "required": false,
          "type": "timestamp"
        }
      ]
    }
  ],

Footnotes

  1. Considering this conflicts with the native Java implementation, I would also suspect it is problematic to do in the Rust version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant