Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add files to add existing Parquet files to a table #932

Open
ZENOTME opened this issue Feb 1, 2025 · 4 comments
Open

Add files to add existing Parquet files to a table #932

ZENOTME opened this issue Feb 1, 2025 · 4 comments

Comments

@ZENOTME
Copy link
Contributor

ZENOTME commented Feb 1, 2025

In #345, we support writing new data files and appending them to the table. But we haven't support appending existing data files which need to support reading existing data files and generating corresponding metadata DataFile.

@jonathanc-n
Copy link
Contributor

I would like to try working on this.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 5, 2025

I would like to try working on this.

Thanks @jonathanc-n! Feel free to send the PR for this.

@jonathanc-n
Copy link
Contributor

@ZENOTME When appending existing data files, should the system load file metadata by reading the current snapshot’s manifest lists from an existing Iceberg table, or would you prefer to specify a file path from which the system scans and infers metadata? I'm looking to just perform a TableScan based the answer and have it just add the DataFiles with the add_data_file.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 6, 2025

@ZENOTME When appending existing data files, should the system load file metadata by reading the current snapshot’s manifest lists from an existing Iceberg table, or would you prefer to specify a file path from which the system scans and infers metadata? I'm looking to just perform a TableScan based the answer and have it just add the DataFiles with the add_data_file.

Hi @jonathanc-n, I think we can refer the implementation of pyiceberg: https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L669C9-L669C18.

should the system load file metadata by reading the current snapshot’s manifest lists from an existing Iceberg table, or would you prefer to specify a file path from which the system scans and infers metadata?

I think the user will add file using transaction API so we can know which table it will be append and related metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants