Skip to content

Add arrow_reader_clickbench benchmark #7470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 5, 2025

Which issue does this PR close?

Rationale for this change

We are trying to improve the performance of row filter application in the Parquet arrow reader and part of that is a benchmark that we can use to guide optimization efforts.

However, as discussed in #7428 the arrow_reader_row_filter microbenchmark doesn't currently reflect the actual performance we see in our end to end application (DataFusion).

cargo bench --all-features --bench arrow_reader_row_filter

Thus, we think we need to create a benchmark that uses the actual ClickBench dataset with appropriate filtering

What changes are included in this PR?

  1. Adds a new arrow_reader_clickbench benchmark

This benchmark tests applying the actual clickbench filters (and column materialization):

  1. Single file and partitioned (100 file) datasets
  2. async and sync readers
  3. All clickbench query patterns

Are there any user-facing changes?

New benchmark, and hopefully thus improved filter / projection performance

TODO

  • Change String types to use Utf8View
  • Add sync/async reader
  • Add hits_partitioned / hits
  • Complete other predicate types

@alamb
Copy link
Contributor Author

alamb commented May 6, 2025

This benchmark is now looking pretty nice -- it tests just the parquet reading and has all the query predicate patterns. Tomorrow I need to finish adding all the other query patterns and give it a final polish.

@alamb alamb force-pushed the alamb/clickbench_filter_benchmark branch from 85fff8d to 139f2a4 Compare May 7, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arrow_reader_row_filter benchmark doesn't capture page cache improvements
1 participant