Add `arrow_reader_clickbench` benchmark #7470

alamb · 2025-05-05T17:14:44Z

Which issue does this PR close?

Closes arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460
Part of [EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Rationale for this change

We are trying to improve the performance of row filter application in the Parquet arrow reader and part of that is a benchmark that we can use to guide optimization efforts.

However, as discussed in #7428 the arrow_reader_row_filter microbenchmark doesn't currently reflect the actual performance we see in our end to end application (DataFusion).

cargo bench --all-features --bench arrow_reader_row_filter

Thus, we think we need to create a benchmark that uses the actual ClickBench dataset with appropriate filtering

See arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460 for more details

What changes are included in this PR?

Adds a new arrow_reader_clickbench benchmark

This benchmark tests applying the actual clickbench filters (and column materialization):

Single file and partitioned (100 file) datasets
async and sync readers
All clickbench query patterns

Are there any user-facing changes?

New benchmark, and hopefully thus improved filter / projection performance

TODO

Change String types to use Utf8View
Add sync/async reader
Add hits_partitioned / hits
Complete other predicate types

alamb · 2025-05-06T20:11:41Z

This benchmark is now looking pretty nice -- it tests just the parquet reading and has all the query predicate patterns. Tomorrow I need to finish adding all the other query patterns and give it a final polish.

github-actions bot added the parquet Changes to the parquet crate label May 5, 2025

This was referenced May 5, 2025

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

Open

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Open

alamb force-pushed the alamb/clickbench_filter_benchmark branch from fef38a7 to 85fff8d Compare May 6, 2025 20:10

alamb mentioned this pull request May 7, 2025

Improve documentation and add examples for ArrowPredicateFn #7480

Open

Add arrow_reader_clickbench

139f2a4

alamb force-pushed the alamb/clickbench_filter_benchmark branch from 85fff8d to 139f2a4 Compare May 7, 2025 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `arrow_reader_clickbench` benchmark #7470

Add `arrow_reader_clickbench` benchmark #7470

alamb commented May 5, 2025 •

edited

Loading

alamb commented May 6, 2025

Add arrow_reader_clickbench benchmark #7470

Are you sure you want to change the base?

Add arrow_reader_clickbench benchmark #7470

Conversation

alamb commented May 5, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

TODO

alamb commented May 6, 2025

Add `arrow_reader_clickbench` benchmark #7470

Add `arrow_reader_clickbench` benchmark #7470

alamb commented May 5, 2025 •

edited

Loading