Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #305: ManifestEntry partition field schema should be dynamically … #307

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

arnaudbriche
Copy link

Avro schemas for Iceberg metadata files are now built dynamically with code.
I know this changes a lot of code, but I did not find other less intrusive ways to fix the issue.

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of comments plus you need to add the Apache license blurb at the top of the new files.

utils.go Outdated
Comment on lines 277 to 294
switch f.Type.(type) {
case *StringType:
sch = avro.NewPrimitiveSchema(avro.String, nil)
case *Int32Type:
sch = avro.NewPrimitiveSchema(avro.Int, nil)
case *Int64Type:
sch = avro.NewPrimitiveSchema(avro.Long, nil)
case *BinaryType:
sch = avro.NewPrimitiveSchema(avro.Bytes, nil)
case *BooleanType:
sch = avro.NewPrimitiveSchema(avro.Boolean, nil)
case *Float32Type:
sch = avro.NewPrimitiveSchema(avro.Float, nil)
case *Float64Type:
sch = avro.NewPrimitiveSchema(avro.Double, nil)
case *DateType:
sch = avro.NewPrimitiveSchema(avro.Int, avro.NewPrimitiveLogicalSchema(avro.Date))
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing some types here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Do we agree that we only need to handle primitive/scalar types here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, yes. You can't partition on a struct/list/map type to my knowledge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you see missing ones.

return b.m
}

type manifestFileV2 struct {
partitionType *StructType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be integrated with the FieldSummary instead of being at the top level here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case, we should probably be populating this after reading in a manifest file too, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it belongs to FieldSummary.
We also need to be able to do Avro -> Iceberg schema conversion to populate the field on read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done Avro <--> Iceberg schema conversions.
Can you please help me with the populate part ? Maybe point me to the right part of the code ?

… single generic Must function

- create constants of non-parametric Avro schema to prevent wasting resources on rebuilding them each time
- add license blurb to all new files
…etadata: https://iceberg.apache.org/spec/#manifests

- fix a few bug in Metadata builder that triggers on table creation
@arnaudbriche
Copy link
Author

arnaudbriche commented Feb 20, 2025

I'm trying to create and maintain Iceberg tables from externally generated Parquet files with this lib.
The goal is for Clickhouse to be able to query the table.

While writing the various Avro metadata files with the lib, I encountered many errors from ClickHouse, which complained about missing metadata on the Avro manifest files.

As of the spec, Manifest files must have specific metadata to be valid: https://iceberg.apache.org/spec/#manifests

I thus created the WriteManifestEntries that writes the Avro Manifest files with proper metadata.
Not sure how to make the WriteEntries method of manifestFileV1 and manifestFileV2 write compliant Avro Manifest files.

As of now, the table produced is queryable by ClickHouse.

@zeroshade
Copy link
Member

@arnaudbriche i'll take a look and see if I can condense and simplify this a bit for you, probably quicker than going back and forth like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants