Skip to content

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 1, 2025

Which issue does this PR close?

Rationale for this change

We would like a benchmark that accurately reflects the predicate patterns we see in ClickBench.

See full analysis here #7460 (comment)

ClickHouse Data in hits.parquet:

  • Selectivity is: 13172392 / 99997497 = 0.132
  • Number of RowSelections = 14054784
  • Average run length of each RowSelection: 99997497 / 14054784 = 7.114

Data in the arrow_reader_row_filter benchmark:

  • Selectivity is: 80147 / 100000 = 0.8
  • Number of RowSelections = 67989
  • Average run length of each RowSelection: 100000 / 32010 = 3.1

What changes are included in this PR?

Change distribution of the string view so it is much closer to parquet data (the selectivity especially)

  • Selectivity is: 15144 / 100000 = 0.15144
  • Number of RowSelections = 12904
  • Average run length of each RowSelection: 100000 / 12904 = 7.75

Are there any user-facing changes?

No, this is a benchmark only

Here is the new file generated by this change: new_test.zip

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 1, 2025
@alamb
Copy link
Contributor Author

alamb commented May 1, 2025

Unfortunately, even after adjusting the benchmark on this branch I still don't see major changes in #7428.

I will look more deeply tomorrow

cargo bench --all-features --bench arrow_reader_row_filter -- Utf8ViewNonEmpty

Main compared to #7428

arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async
                        time:   [4.1253 ms 4.1553 ms 4.1881 ms]
                        change: [-4.8097% -3.9190% -3.0939%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  14 (14.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/sync
                        time:   [4.2269 ms 4.2340 ms 4.2419 ms]
                        change: [-1.5246% -1.1130% -0.7616%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/async
                        time:   [3.0754 ms 3.0802 ms 3.0857 ms]
                        change: [-3.2754% -2.7568% -2.2574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/sync
                        time:   [3.0774 ms 3.0839 ms 3.0909 ms]
                        change: [-1.4528% -1.1133% -0.7921%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

@zhuqi-lucas
Copy link
Contributor

Unfortunately, even after adjusting the benchmark on this branch I still don't see major changes in #7428.

I will look more deeply tomorrow

cargo bench --all-features --bench arrow_reader_row_filter -- Utf8ViewNonEmpty

Main compared to #7428

arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async
                        time:   [4.1253 ms 4.1553 ms 4.1881 ms]
                        change: [-4.8097% -3.9190% -3.0939%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  14 (14.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/sync
                        time:   [4.2269 ms 4.2340 ms 4.2419 ms]
                        change: [-1.5246% -1.1130% -0.7616%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/async
                        time:   [3.0754 ms 3.0802 ms 3.0857 ms]
                        change: [-3.2754% -2.7568% -2.2574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/sync
                        time:   [3.0774 ms 3.0839 ms 3.0909 ms]
                        change: [-1.4528% -1.1133% -0.7921%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

@alamb
Copy link
Contributor Author

alamb commented May 1, 2025

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

That is an interesting idea -- to make make a benchmark in arrow-rs that runs against hits.parquet (and hits_partitioned) directly 🤔 (and e.g. could require downloading those files before running).

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

That is an interesting idea -- to make make a benchmark in arrow-rs that runs against hits.parquet (and hits_partitioned) directly 🤔 (and e.g. could require downloading those files before running).

I am trying to do the first step, may be we can download a partition hit.parquet, and pick it as the data set to arrow-rs, because we have 100 partition file, it seems about %1 data which can be mocked.

@alamb
Copy link
Contributor Author

alamb commented May 2, 2025

Thank you @alamb for this work, so we still need to investigate more. Is it possible that we can get a 10% data set from hit.parquet and do some benchmark from arrow-rs side.

That is an interesting idea -- to make make a benchmark in arrow-rs that runs against hits.parquet (and hits_partitioned) directly 🤔 (and e.g. could require downloading those files before running).

I am trying to do the first step, may be we can download a partition hit.parquet, and pick it as the data set to arrow-rs, because we have 100 partition file, it seems about %1 data which can be mocked.

I have been thinking about it too -- I probably won't have a chance this weekend but can work on it next week

@alamb
Copy link
Contributor Author

alamb commented May 5, 2025

I have been thinking about it too -- I probably won't have a chance this weekend but can work on it next week

Will keep at it tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arrow_reader_row_filter benchmark doesn't capture page cache improvements
2 participants