-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifests table scan should return iceberg schema rather arrow schema #868
Comments
Makes sense to me. |
👍 let me fix this |
Also happy to help with this but not sure why / if we want this. Looking at the DataFusion integration, it seems useful for the metadata If we need to expose the Arrow schema for 1) the DataFusion integration and 2) to build |
The reason I suggest returning iceberg schema is that metadata table is a concept in iceberg library, not only in datafusion integration. The difference is that, iceberg library will be used by more engines like datafuse, polars, etc. The reason we provide an record batch stream is for convenience, IMO we should provid similar plan files api so that other engines could consume it, but since arrow is the defact standard for in memory data exchange, I'm fine with keeping the scan api. |
The Iceberg schema requirement really complicates things because of how we handle Iceberg field ids in Arrow types. I'm trying to explain in #863 (comment). Basically it's difficult to build Arrow arrays with nested types if you need the nested types to have the right ids in field metadata. (Certainly possible it just seems complicated to me because I'm not seeing the simple solution; I'm new to arrow-rs and iceberg-rust.) I do get your argument of Iceberg wanting to be agnostic of Arrow. But then the engines that Iceberg Rust is going to integrate with any time soon do rely on Arrow (DataFusion, Polars, PyIceberg-on-Rust). So they're well-served if we return Arrow tables, regardless of what Maybe we can avoid the discussion of the @liurenjie1024, would appreciate you thoughts on this. |
Similar to @liurenjie1024 I'm in favor of exposing the schema as Iceberg instead of Arrow. I think in general it is not good to expose 3rd party libraries to your public API (since you don't control them, and what happens if something else comes along). |
Originally posted by @liurenjie1024 in #861 (comment)
The text was updated successfully, but these errors were encountered: