This is in part a question and open for discussion.
When building TableMetadata through the TableMetadataBuilder, all options of building "from scratch" force a reassignment of field IDs:
I noticed that it would be possible to get any type of TableMetadata that was desired through using the object directly, but all of the fields are restricted to pub(crate) scope. I suspect the reason for this is safety, i.e. ensuring that creation occurs through the builder pattern where the relevant checks are performed on call to build().
Questions:
- Would it be problematic to lift the restriction on the
TableMetadata fields to be pub1 or allow the creation of TableMetadata without reassigning field IDs?
- If the above is not possible, is there an example of creating the iceberg metadata file hierarchy in the correct way?
For extra context, we're currently constructing Iceberg metadata around pre-existing parquet files written by another system; however, there is no Iceberg catalog or prior metadata JSON. I noticed there is also a StaticTable; however, this requires either pre-existing JSON from FileIO or an input TableMetadata, this 2nd option brings us back to the above issue.
This assignment leads to a mismatch in what is shown in the table metadata JSON vs the actual parquet file:
parquet schema
required group field_id=-1 arrow_schema {
optional binary field_id=2 cpu (String);
optional binary field_id=3 host1 (String);
optional int64 field_id=1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}
iceberg metadata JSON schema snippet
This reassignment occurs to the order that they appear within the parquet/arrow Schema, rather than the given field IDs.
"schemas": [
{
"schema-id": 0,
"type": "struct",
"fields": [
{
"id": 1, <-- field_id=2 in parquet
"name": "cpu",
"required": false,
"type": "string"
},
{
"id": 2, <-- field_id=3 in parquet
"name": "host1",
"required": false,
"type": "string"
},
{
"id": 3, <-- field_id=1 in parquet
"name": "time",
"required": false,
"type": "timestamp"
}
]
}
],
This is also referenced by a question in the iceberg slack
This is in part a question and open for discussion.
When building
TableMetadatathrough theTableMetadataBuilder, all options of building "from scratch" force a reassignment of field IDs:TableMetadataBuilder::newTableMetadataBuilder::from_table_creation, as this is a wrapper overTableMetadataBuilder::newusing theTableCreationstruct.I noticed that it would be possible to get any type of
TableMetadatathat was desired through using the object directly, but all of the fields are restricted topub(crate)scope. I suspect the reason for this is safety, i.e. ensuring that creation occurs through the builder pattern where the relevant checks are performed on call tobuild().Questions:
TableMetadatafields to bepub1 or allow the creation ofTableMetadatawithout reassigning field IDs?For extra context, we're currently constructing Iceberg metadata around pre-existing parquet files written by another system; however, there is no Iceberg catalog or prior metadata JSON. I noticed there is also a
StaticTable; however, this requires either pre-existing JSON from FileIO or an inputTableMetadata, this 2nd option brings us back to the above issue.This assignment leads to a mismatch in what is shown in the table metadata JSON vs the actual parquet file:
parquet schema
iceberg metadata JSON schema snippet
This reassignment occurs to the order that they appear within the parquet/arrow
Schema, rather than the given field IDs.This is also referenced by a question in the iceberg slack
Footnotes
Considering this conflicts with the native Java implementation, I would also suspect it is problematic to do in the Rust version. ↩