-
Notifications
You must be signed in to change notification settings - Fork 916
arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I did some analysis on
Worked Details
> SELECT count("SearchPhrase") FROM hits_partitioned;
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 99997497 |
+--------------------------------------+
> SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <> '';
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 13172392 |
+--------------------------------------+ So selectivity is: 13172392 / 99997497 = 0.1317272171 To figure out the pattern of passing rows, we want to know how many contiguous rows there are in the matched filter
Would have two selections (3 true, 2 false) Here is the sql: SET datafusion.execution.target_partitions = 1;
WITH
hits as (
SELECT
"SearchPhrase",
row_number() OVER () as rn
FROM
'hits.parquet'
)
,results as (
SELECT
rn,
"SearchPhrase",
"SearchPhrase" <> '',
("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as "filter_same_as_previous"
FROM
hits
)
SELECT
filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
filter_same_as_previous
--LIMIT 10
;
+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL | 1 |
| true | 85942712 |
| false | 14054784 |
+-------------------------+----------+ |
Here is the The equivalent numbers are:
So in other words I think the filter benchmark doesn't quite match what is in the ClickBench file Details
> select count(*) from '/tmp/test.parquet' where "utf8View" <> '';
+----------+
| count(*) |
+----------+
| 80147 |
+----------+
1 row(s) fetched.
Elapsed 0.015 seconds.
> select count(*) from '/tmp/test.parquet';
+----------+
| count(*) |
+----------+
| 100000 |
+----------+
1 row(s) fetched.
Elapsed 0.004 seconds.
>
WITH
hits as (
SELECT
"utf8View",
row_number() OVER () as rn
FROM
'/tmp/test.parquet'
)
,results as (
SELECT
rn,
"utf8View",
"utf8View" <> '',
("utf8View" <> '') = (LAG("utf8View" <> '', 1) OVER ()) as "filter_same_as_previous"
FROM
hits
)
SELECT
filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
filter_same_as_previous
--LIMIT 10
;
+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL | 1 |
| true | 67989 |
| false | 32010 |
+-------------------------+----------+
3 row(s) fetched.
Elapsed 0.017 seconds. |
I am starting to write a benchmark that uses the actual ClickBench data |
I have begun work on a new benchmark and it is promising so far -- though it has a ways to go: |
We are trying to improve the performance of row filter application and part of that is a benchmark that we can use to guide optimization efforts.
However, as shown in #7428 we have a case where we see the performance benefit when running an end to end query in datafusion but the same improvement is not seen in the benchmark.
This ticket tracks figuring out why the benchmark doesn't show an improvement even when the end to end query does.
Originally posted by @zhuqi-lucas in #7428 (comment)
The text was updated successfully, but these errors were encountered: