Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

alamb · 2025-05-01T01:45:06Z

Which issue does this PR close?

Closes arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

Rationale for this change

We would like a benchmark that accurately reflects the predicate patterns we see in ClickBench.

See full analysis here #7460 (comment)

ClickHouse Data in hits.parquet:

Selectivity is: 13172392 / 99997497 = 0.132
Number of RowSelections = 14054784
Average run length of each RowSelection: 99997497 / 14054784 = 7.114

Data in the arrow_reader_row_filter benchmark:

Selectivity is: 80147 / 100000 = 0.8
Number of RowSelections = 67989
Average run length of each RowSelection: 100000 / 32010 = 3.1

What changes are included in this PR?

Change distribution of the string view so it is much closer to parquet data (the selectivity especially)

Selectivity is: 15144 / 100000 = 0.15144
Number of RowSelections = 12904
Average run length of each RowSelection: 100000 / 12904 = 7.75

Are there any user-facing changes?

No, this is a benchmark only

Here is the new file generated by this change: new_test.zip

…bution

alamb · 2025-05-01T01:51:50Z

Unfortunately, even after adjusting the benchmark on this branch I still don't see major changes in #7428.

I will look more deeply tomorrow

cargo bench --all-features --bench arrow_reader_row_filter -- Utf8ViewNonEmpty

Main compared to #7428

arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async
                        time:   [4.1253 ms 4.1553 ms 4.1881 ms]
                        change: [-4.8097% -3.9190% -3.0939%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  14 (14.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/sync
                        time:   [4.2269 ms 4.2340 ms 4.2419 ms]
                        change: [-1.5246% -1.1130% -0.7616%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/async
                        time:   [3.0754 ms 3.0802 ms 3.0857 ms]
                        change: [-3.2754% -2.7568% -2.2574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/sync
                        time:   [3.0774 ms 3.0839 ms 3.0909 ms]
                        change: [-1.4528% -1.1133% -0.7921%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

zhuqi-lucas · 2025-05-01T05:46:32Z

Unfortunately, even after adjusting the benchmark on this branch I still don't see major changes in #7428.

I will look more deeply tomorrow

cargo bench --all-features --bench arrow_reader_row_filter -- Utf8ViewNonEmpty

Main compared to #7428

arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async
                        time:   [4.1253 ms 4.1553 ms 4.1881 ms]
                        change: [-4.8097% -3.9190% -3.0939%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  14 (14.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/sync
                        time:   [4.2269 ms 4.2340 ms 4.2419 ms]
                        change: [-1.5246% -1.1130% -0.7616%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/async
                        time:   [3.0754 ms 3.0802 ms 3.0857 ms]
                        change: [-3.2754% -2.7568% -2.2574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/sync
                        time:   [3.0774 ms 3.0839 ms 3.0909 ms]
                        change: [-1.4528% -1.1133% -0.7921%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

alamb · 2025-05-01T12:32:30Z

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

That is an interesting idea -- to make make a benchmark in arrow-rs that runs against hits.parquet (and hits_partitioned) directly 🤔 (and e.g. could require downloading those files before running).

zhuqi-lucas · 2025-05-02T13:19:44Z

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

That is an interesting idea -- to make make a benchmark in arrow-rs that runs against hits.parquet (and hits_partitioned) directly 🤔 (and e.g. could require downloading those files before running).

I am trying to do the first step, may be we can download a partition hit.parquet, and pick it as the data set to arrow-rs, because we have 100 partition file, it seems about %1 data which can be mocked.

alamb · 2025-05-02T20:24:02Z

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

That is an interesting idea -- to make make a benchmark in arrow-rs that runs against hits.parquet (and hits_partitioned) directly 🤔 (and e.g. could require downloading those files before running).

I am trying to do the first step, may be we can download a partition hit.parquet, and pick it as the data set to arrow-rs, because we have 100 partition file, it seems about %1 data which can be mocked.

I have been thinking about it too -- I probably won't have a chance this weekend but can work on it next week

alamb · 2025-05-05T19:46:10Z

I have been thinking about it too -- I probably won't have a chance this weekend but can work on it next week

I started one here: Add arrow_reader_clickbench benchmark #7470

Will keep at it tomorrow

alamb · 2025-05-15T15:44:22Z

I am going to merge as I am beginning to try and optimize these code paths and I think this PR mirrors the original intent better

Update arrow_reader_row_filter benchmark to reflect ClickBench distri…

192203a

…bution

github-actions bot added the parquet Changes to the parquet crate label May 1, 2025

alamb merged commit 4770573 into apache:main May 15, 2025
17 checks passed

alamb deleted the alamb/better_bench branch May 15, 2025 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Uh oh!

alamb commented May 1, 2025 •

edited

Loading

Uh oh!

alamb commented May 1, 2025

Uh oh!

zhuqi-lucas commented May 1, 2025

Uh oh!

alamb commented May 1, 2025

Uh oh!

zhuqi-lucas commented May 2, 2025

Uh oh!

alamb commented May 2, 2025

Uh oh!

alamb commented May 5, 2025

Uh oh!

alamb commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Uh oh!

Conversation

alamb commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb commented May 1, 2025

Uh oh!

zhuqi-lucas commented May 1, 2025

Uh oh!

alamb commented May 1, 2025

Uh oh!

zhuqi-lucas commented May 2, 2025

Uh oh!

alamb commented May 2, 2025

Uh oh!

alamb commented May 5, 2025

Uh oh!

alamb commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

alamb commented May 1, 2025 •

edited

Loading