Skip to content

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #7456 ...
alamb opened this issue Apr 30, 2025 · 4 comments · May be fixed by #7461 or #7470
Open
Tracked by #7456 ...

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

alamb opened this issue Apr 30, 2025 · 4 comments · May be fixed by #7461 or #7470
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

We are trying to improve the performance of row filter application and part of that is a benchmark that we can use to guide optimization efforts.

cargo bench --all-features --bench arrow_reader_row_filter

However, as shown in #7428 we have a case where we see the performance benefit when running an end to end query in datafusion but the same improvement is not seen in the benchmark.

This ticket tracks figuring out why the benchmark doesn't show an improvement even when the end to end query does.

Interesting, the decoder cache doesn't seem to help much on my test machine (which is some crappy gcp VM). I couldn't reproduce the results listed on #7363 (comment) 🤔

Thank you @alamb , it seems no obvious improvement compares to main. This branch only improve PointLookup for 1000000 line big data set comparing to original better-decode.

I agree, we need to find how to mock clickbench result from arrow-rs side.

Originally posted by @zhuqi-lucas in #7428 (comment)

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2025

I did some analysis on hits.parquet:

  • Selectivity is: 13172392 / 99997497 = 0.132
  • Number of RowSelections = 14054784
  • Average run length of each RowSelection: 99997497 / 14054784 = 7.114
Worked Details

> SELECT count("SearchPhrase") FROM hits_partitioned;
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 99997497                             |
+--------------------------------------+

> SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <> '';
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 13172392                             |
+--------------------------------------+

So selectivity is: 13172392 / 99997497 = 0.1317272171

To figure out the pattern of passing rows, we want to know how many contiguous rows there are in the matched filter

(t)
(t)
(t)
(f)
(f)

Would have two selections (3 true, 2 false)

Here is the sql:

SET datafusion.execution.target_partitions = 1;

WITH
hits as (
  SELECT
    "SearchPhrase",
    row_number() OVER () as rn
  FROM
    'hits.parquet'
)
,results as (
  SELECT
    rn,
    "SearchPhrase",
    "SearchPhrase" <> '',
    ("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as "filter_same_as_previous"
  FROM
   hits
)
SELECT
  filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
  filter_same_as_previous
--LIMIT 10
;

+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL                    | 1        |
| true                    | 85942712 |
| false                   | 14054784 |
+-------------------------+----------+

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2025

Here is the test.parquet file being created by the benchmark: test.zip

The equivalent numbers are:

  • Selectivity is: 80147 / 100000 = 0.8
  • Number of RowSelections = 67989
  • Average run length of each RowSelection: 100000 / 32010 = 3.1

So in other words I think the filter benchmark doesn't quite match what is in the ClickBench file

Details

> select count(*)  from '/tmp/test.parquet' where "utf8View" <> '';
+----------+
| count(*) |
+----------+
| 80147    |
+----------+
1 row(s) fetched.
Elapsed 0.015 seconds.

> select count(*)  from '/tmp/test.parquet';
+----------+
| count(*) |
+----------+
| 100000   |
+----------+
1 row(s) fetched.
Elapsed 0.004 seconds.

>
WITH
hits as (
  SELECT
    "utf8View",
    row_number() OVER () as rn
  FROM
    '/tmp/test.parquet'
)
,results as (
  SELECT
    rn,
    "utf8View",
    "utf8View" <> '',
    ("utf8View" <> '') = (LAG("utf8View" <> '', 1) OVER ()) as "filter_same_as_previous"
  FROM
   hits
)
SELECT
  filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
  filter_same_as_previous
--LIMIT 10
;

+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL                    | 1        |
| true                    | 67989    |
| false                   | 32010    |
+-------------------------+----------+
3 row(s) fetched.
Elapsed 0.017 seconds.

@alamb
Copy link
Contributor Author

alamb commented May 5, 2025

I am starting to write a benchmark that uses the actual ClickBench data

@alamb alamb linked a pull request May 5, 2025 that will close this issue
4 tasks
@alamb
Copy link
Contributor Author

alamb commented May 5, 2025

I have begun work on a new benchmark and it is promising so far -- though it has a ways to go:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant