Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet/file): Add SeekToRow for Column readers #283

Merged
merged 5 commits into from
Feb 20, 2025

Conversation

zeroshade
Copy link
Member

Rationale for this change

Addressing the comments in #278 (comment) to allow for optimizing reads by skipping entire pages and leveraging the offset index if it exists.

What changes are included in this PR?

Deprecating the old NewColumnChunkReader and NewPageReader methods as they really aren't safe to use outside of the package, and have proved difficult to evolve without breaking changes. Instead users should rely on using the RowGroupReader to perform the creation of the column readers and page readers, which is generally what is done by consumers already.

Adding SeekToRow method on the ColumnChunkReader to allow skipping to a particular row in the column chunk (which also allows quickly resetting back to the beginning of a column!) along with SeekToPageWithRow method on the page reader. Also updates the Skip method to properly skip rows in a repeated column, not just values.

Are these changes tested?

Yes, tests are included.

Are there any user-facing changes?

Just the new methods. The deprecated methods are not removed currently.

@zeroshade zeroshade merged commit 6dc6926 into apache:main Feb 20, 2025
23 checks passed
@zeroshade zeroshade deleted the seek-to-row branch February 20, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants