arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

alamb · 2025-04-30T12:04:45Z

We are trying to improve the performance of row filter application and part of that is a benchmark that we can use to guide optimization efforts.

cargo bench --all-features --bench arrow_reader_row_filter

However, as shown in #7428 we have a case where we see the performance benefit when running an end to end query in datafusion but the same improvement is not seen in the benchmark.

This ticket tracks figuring out why the benchmark doesn't show an improvement even when the end to end query does.

Interesting, the decoder cache doesn't seem to help much on my test machine (which is some crappy gcp VM). I couldn't reproduce the results listed on #7363 (comment) 🤔

Thank you @alamb , it seems no obvious improvement compares to main. This branch only improve PointLookup for 1000000 line big data set comparing to original better-decode.

I agree, we need to find how to mock clickbench result from arrow-rs side.

Originally posted by @zhuqi-lucas in #7428 (comment)

alamb · 2025-04-30T20:02:41Z

I did some analysis on hits.parquet:

Selectivity is: 13172392 / 99997497 = 0.132
Number of RowSelections = 14054784
Average run length of each RowSelection: 99997497 / 14054784 = 7.114

Worked Details

> SELECT count("SearchPhrase") FROM hits_partitioned;
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 99997497                             |
+--------------------------------------+

> SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <> '';
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 13172392                             |
+--------------------------------------+

So selectivity is: 13172392 / 99997497 = 0.1317272171

To figure out the pattern of passing rows, we want to know how many contiguous rows there are in the matched filter

(t)
(t)
(t)
(f)
(f)

Would have two selections (3 true, 2 false)

Here is the sql:

SET datafusion.execution.target_partitions = 1;

WITH
hits as (
  SELECT
    "SearchPhrase",
    row_number() OVER () as rn
  FROM
    'hits.parquet'
)
,results as (
  SELECT
    rn,
    "SearchPhrase",
    "SearchPhrase" <> '',
    ("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as "filter_same_as_previous"
  FROM
   hits
)
SELECT
  filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
  filter_same_as_previous
--LIMIT 10
;

+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL                    | 1        |
| true                    | 85942712 |
| false                   | 14054784 |
+-------------------------+----------+

alamb · 2025-04-30T22:27:34Z

Here is the test.parquet file being created by the benchmark: test.zip

The equivalent numbers are:

Selectivity is: 80147 / 100000 = 0.8
Number of RowSelections = 67989
Average run length of each RowSelection: 100000 / 32010 = 3.1

So in other words I think the filter benchmark doesn't quite match what is in the ClickBench file

Details

> select count(*)  from '/tmp/test.parquet' where "utf8View" <> '';
+----------+
| count(*) |
+----------+
| 80147    |
+----------+
1 row(s) fetched.
Elapsed 0.015 seconds.

> select count(*)  from '/tmp/test.parquet';
+----------+
| count(*) |
+----------+
| 100000   |
+----------+
1 row(s) fetched.
Elapsed 0.004 seconds.

>
WITH
hits as (
  SELECT
    "utf8View",
    row_number() OVER () as rn
  FROM
    '/tmp/test.parquet'
)
,results as (
  SELECT
    rn,
    "utf8View",
    "utf8View" <> '',
    ("utf8View" <> '') = (LAG("utf8View" <> '', 1) OVER ()) as "filter_same_as_previous"
  FROM
   hits
)
SELECT
  filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
  filter_same_as_previous
--LIMIT 10
;

+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL                    | 1        |
| true                    | 67989    |
| false                   | 32010    |
+-------------------------+----------+
3 row(s) fetched.
Elapsed 0.017 seconds.

alamb · 2025-05-05T15:10:33Z

I am starting to write a benchmark that uses the actual ClickBench data

alamb · 2025-05-05T17:15:14Z

I have begun work on a new benchmark and it is promising so far -- though it has a ways to go:

Add arrow_reader_clickbench benchmark #7470

This was referenced Apr 30, 2025

[EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456

Open

Parquet decoder / decoded page Cache #7363

Open

Polish better decoder (Testing) #7428

Open

Weekly Plan: Andrew Lamb 2025-04-28 apache/datafusion#15880

Closed

alamb self-assigned this Apr 30, 2025

alamb mentioned this issue Apr 30, 2025

WIP: Testing parquet page cache reader apache/datafusion#15903

Closed

alamb linked a pull request May 1, 2025 that will close this issue

Update arrow_reader_row_filter benchmark to reflect ClickBench distribution #7461

Open

alamb linked a pull request May 5, 2025 that will close this issue

Add arrow_reader_clickbench benchmark #7470

Draft

4 tasks

This was referenced May 7, 2025

Improve documentation and add examples for ArrowPredicateFn #7480

Open

Improve error messages if schema hint mismatches with parquet schema #7481

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

alamb commented Apr 30, 2025 •

edited

Loading

alamb commented Apr 30, 2025 •

edited

Loading

alamb commented Apr 30, 2025

alamb commented May 5, 2025

alamb commented May 5, 2025

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

arrow_reader_row_filter benchmark doesn't capture page cache improvements #7460

Comments

alamb commented Apr 30, 2025 • edited Loading

alamb commented Apr 30, 2025 • edited Loading

alamb commented Apr 30, 2025

alamb commented May 5, 2025

alamb commented May 5, 2025

alamb commented Apr 30, 2025 •

edited

Loading

alamb commented Apr 30, 2025 •

edited

Loading