Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce C FFI for iceberg rust #966

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

feat: Introduce C FFI for iceberg rust #966

wants to merge 10 commits into from

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Feb 13, 2025

This PR introduces a C FFI for iceberg-rust, allowing users to interact with Iceberg from C.

This PR is a PoC demonstrating how this can be implemented and what it will look like. Most of the code is inspired by https://github.com/apache/opendal.


The iceberg-c project will be written in Rust but will expose a C ABI as .so and .a files. This design ensures that rust developers don't need to worry about the C part or handle C-related tasks. Everything related to the C API will be contained within bindings/c.

We also use cbindgen to generate a header file based on our public API. This allows users to link iceberg-c just as they would with a C or C++ library.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 261 files.

Valid Invalid Ignored Fixed
221 1 39 0
Click to see the invalid file list
  • bindings/c/.gitignore
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Xuanwo for this great pr, it looks like a good starting point!


#[repr(C)]
#[derive(Debug, Copy, Clone)]
pub enum PrimitiveLiteralTag {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worring about maintainability of this approach? Should we have things like:

pub extern "C" fn datatum_int(val: i32, datum_out: *mut *mut Datum) {
 ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that we can't expose Datum to C side directly.

liurenjie1024
liurenjie1024 previously approved these changes Feb 18, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Xuanwo for this pr! LGTM as initial move toward c ffi, let's wait for a moment to have others to take a look.

@manuzhang
Copy link
Contributor

@Xuanwo Just curious how would this interface work with iceberg-cpp? Will there be collaboration or duplication?

@Xuanwo
Copy link
Member Author

Xuanwo commented Feb 18, 2025

Just curious how would this interface work with iceberg-cpp? Will there be collaboration or duplication?

Thanks for the question! I have to admit that there is some duplication for specific use cases, such as when users only need a C ABI, which both the iceberg-rust C binding and iceberg-cpp can provide—although neither of them is fully functional at the moment.

However, I started this C binding with the intention of fostering collaboration:

  • The iceberg-rust C binding exposes a C ABI, unlocking significant potential for broader community engagement with languages like Lua, Swift, Zig, PHP, and more.
  • iceberg-cpp could leverage our C binding for certain components like file I/O and catalog management.

@manuzhang
Copy link
Contributor

I'd suggest notifying the dev list on this feature and open early discussion with iceberg-cpp folks for collaboration.

# specific language governing permissions and limitations
# under the License.

build/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a separate .gitignore file needed instead of updating that under root directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep bindings/c as a mostly separate project so that C developers can manage it as they see fit without impacting the entire project.

[package]
name = "iceberg-c"
publish = false
edition = "2021"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this edition mean? why is it 2021?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it's the rust edition, and 2021 is the latest one. The 2024 edition will be next, but it hasn't been released yet.

@wgtmac
Copy link
Member

wgtmac commented Feb 20, 2025

With regard to the possible integration with iceberg-cpp (or even replace it), I have some initial questions (might be dumb since I'm a newbie to the Rust world):

  1. Is it a battery-included solution or is flexible enough to let downstreams plugin their own implementations of I/O (cloud object store, hdfs, async I/O, etc.), file formats (parquet/avro/orc)? This has been discussed a lot in the iceberg-cpp project and should be the 1st priority.
  2. Is there any limitation to support different architectures (e.g. ARM) or platforms (e.g. Windows)?
  3. Is it possible to automatically generate C++ headers to be more friendly to C++?
  4. Is there any performance penalty compared to a pure C/C++ implementation?

@lidavidm
Copy link
Member

I'm also curious if there's any potential for linking issues if say, we use this (with OpenDAL enabled) and also OpenDAL's C++ bindings separately. (For another project, we have received reports of multiple Go-built shared objects being linked into one application causing problems on some platforms - presumably Rust, lacking a heavy language runtime, wouldn't cause such issues but it'd be good to think about).

I'm also assuming (given Rust has no stable ABI) that this builds one shared object with all enabled features; given that maybe another thing to think about (down the line!) is integration with CMake etc. so consumers can choose which features to build and link more easily?

@Xuanwo
Copy link
Member Author

Xuanwo commented Feb 21, 2025

For @wgtmac

  • Is it a battery-included solution or is flexible enough to let downstreams plugin their own implementations of I/O (cloud object store, hdfs, async I/O, etc.), file formats (parquet/avro/orc)? This has been discussed a lot in the iceberg-cpp project and should be the 1st priority.

It will be battery-included solution but can split into different modules so that users can use their own IO.

  • Is there any limitation to support different architectures (e.g. ARM) or platforms (e.g. Windows)?

No limitation on mainstream platforms. (x86_64, ARM) × (Linux, macOS, Windows) are known to be supported.

The complete list of supported platforms can be found at https://doc.rust-lang.org/nightly/rustc/platform-support.html.

  • Is it possible to automatically generate C++ headers to be more friendly to C++?

cbindgen also supports generating C++ headers, but I haven't tried it yet. I'll take a look if needed.

It's also possible to use cxx to create a pure C++ binding, but it depends on the direction we want to take for iceberg-cpp.

  • Is there any performance penalty compared to a pure C/C++ implementation?

This FFI bridge operates at zero or negligible overhead, i.e. no copying, no serialization, no runtime checks needed.


For @lidavidm

I'm also curious if there's any potential for linking issues if say, we use this (with OpenDAL enabled) and also OpenDAL's C++ bindings separately.

I believe they will work seamlessly since neither iceberg-rust nor opendal has runtime or global variables to share.

I'm also assuming (given Rust has no stable ABI) that this builds one shared object with all enabled features; given that maybe another thing to think about (down the line!) is integration with CMake etc. so consumers can choose which features to build and link more easily?

Yes, it is definitely possible. We have explored this in OpenDAL, where we can build it using only selected services or disable async support. Users can control these options through different CMake profiles.

@lidavidm
Copy link
Member

Good to know, thanks!

I have run into cxx in an internal project and IMO even with that it takes some manual wrapping to generate "nice" APIs.

As far as I am concerned, as long as it is possible to disable most/all of the IO then this would provide the same functionality I'm interested in from the C++ library. I'd like to parse snapshots/manifests (preferably with a C++-provided reader) but would want to do all the reading of data files myself (to leverage cuDF etc)

@manuzhang
Copy link
Contributor

It's also possible to use cxx to create a pure C++ binding, but it depends on the direction we want to take for iceberg-cpp.

Maybe we can prioritize on a C++ binding?


# custom target for cargo build
add_custom_target(cargo_build
COMMAND sh -c "cargo build ${CARGO_BUILD_TYPE}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite seem to work for me. The following does:

Suggested change
COMMAND sh -c "cargo build ${CARGO_BUILD_TYPE}"
COMMAND cargo build ${CARGO_BUILD_TYPE}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs install commands so that the expectations of CMake users are met. Also, ideally it would generate CMake targets and possibly a pkg-config file as well.

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CARGO_DIST_DIR "${PROJECT_SOURCE_DIR}/target/debug")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for me they end up under target/x86_64-unknown-linux-gnu/debug/ not target/debug

@lidavidm
Copy link
Member

Ah, and the bindings are fairly minimal I see? I was going to try to experiment with them + cuDF but it seems more work is necessary. In any case it's neat that we can get a C/potentially C++ interface out of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants