Skip to content

Parquet decoder / decoded page Cache #7363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 5 tasks
Tracked by #7456
alamb opened this issue Mar 31, 2025 · 45 comments
Open
2 of 5 tasks
Tracked by #7456

Parquet decoder / decoded page Cache #7363

alamb opened this issue Mar 31, 2025 · 45 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There are some times where enabling row filter actually slows down the parquet decoding and one reason for this is having to decompress (e.g. ZSTD) the pages twice

Describe the solution you'd like
#6921 has a cache to avoid this second decode

I would like to get that PR merged

Steps:

Describe alternatives you've considered

Additional context

I am filing this as a separate ticket as there are a lot of other ideas on #6921 that make it kind of hard to follow

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Mar 31, 2025
@alamb alamb self-assigned this Mar 31, 2025
@alamb alamb changed the title Parquet decoded page Cache Parquet decoder / decoded page Cache Mar 31, 2025
@alamb
Copy link
Contributor Author

alamb commented Mar 31, 2025

I have run some performance tests, see:

@alamb
Copy link
Contributor Author

alamb commented Apr 1, 2025

Here are the current clickbench results for the code in #6921

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_filter_pushdown ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.57ms │                0.58ms │     no change │
│ QQuery 1     │    68.16ms │               70.64ms │     no change │
│ QQuery 2     │   116.52ms │              118.28ms │     no change │
│ QQuery 3     │   123.86ms │              123.76ms │     no change │
│ QQuery 4     │   776.15ms │              797.72ms │     no change │
│ QQuery 5     │   848.82ms │              880.01ms │     no change │
│ QQuery 6     │    64.80ms │               67.14ms │     no change │
│ QQuery 7     │    77.36ms │               93.04ms │  1.20x slower │
│ QQuery 8     │   957.21ms │              983.79ms │     no change │
│ QQuery 9     │  1239.49ms │             1275.73ms │     no change │
│ QQuery 10    │   299.68ms │              318.01ms │  1.06x slower │
│ QQuery 11    │   344.19ms │              360.58ms │     no change │
│ QQuery 12    │   945.45ms │             1058.68ms │  1.12x slower │
│ QQuery 13    │  1323.58ms │             1548.38ms │  1.17x slower │
│ QQuery 14    │   885.86ms │             1063.57ms │  1.20x slower │
│ QQuery 15    │  1110.56ms │             1134.68ms │     no change │
│ QQuery 16    │  1834.24ms │             1789.95ms │     no change │
│ QQuery 17    │  1662.90ms │             1650.05ms │     no change │
│ QQuery 18    │  3176.45ms │             3164.17ms │     no change │
│ QQuery 19    │   116.43ms │              123.84ms │  1.06x slower │
│ QQuery 20    │  1206.01ms │             1204.47ms │     no change │
│ QQuery 21    │  1445.16ms │             1351.41ms │ +1.07x faster │
│ QQuery 22    │  2708.56ms │             2401.18ms │ +1.13x faster │
│ QQuery 23    │  8690.72ms │             5234.73ms │ +1.66x faster │
│ QQuery 24    │   509.28ms │              684.55ms │  1.34x slower │
│ QQuery 25    │   426.36ms │              553.26ms │  1.30x slower │
│ QQuery 26    │   581.56ms │              802.46ms │  1.38x slower │
│ QQuery 27    │  1797.38ms │             2464.11ms │  1.37x slower │
│ QQuery 28    │ 13274.54ms │            14650.91ms │  1.10x slower │
│ QQuery 29    │   629.26ms │              598.10ms │     no change │
│ QQuery 30    │   970.56ms │             1286.68ms │  1.33x slower │
│ QQuery 31    │  1008.49ms │             1398.40ms │  1.39x slower │
│ QQuery 32    │  3220.54ms │             3249.25ms │     no change │
│ QQuery 33    │  3948.22ms │             3595.57ms │ +1.10x faster │
│ QQuery 34    │  3968.56ms │             3536.51ms │ +1.12x faster │
│ QQuery 35    │  1477.42ms │             1317.28ms │ +1.12x faster │
│ QQuery 36    │   310.58ms │              277.12ms │ +1.12x faster │
│ QQuery 37    │   144.55ms │              136.09ms │ +1.06x faster │
│ QQuery 38    │   188.40ms │              167.13ms │ +1.13x faster │
│ QQuery 39    │   537.25ms │              419.03ms │ +1.28x faster │
│ QQuery 40    │    73.01ms │              110.69ms │  1.52x slower │
│ QQuery 41    │    83.59ms │              105.50ms │  1.26x slower │
│ QQuery 42    │    91.66ms │               96.24ms │     no change │
└──────────────┴────────────┴───────────────────────┴───────────────┘

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)               │ 63263.96ms │
│ Total Time (alamb_filter_pushdown)   │ 62263.28ms │
│ Average Time (main_base)             │  1471.25ms │
│ Average Time (alamb_filter_pushdown) │  1447.98ms │
│ Queries Faster                       │         10 │
│ Queries Slower                       │         15 │
│ Queries with No Change               │         18 │
└──────────────────────────────────────┴────────────┘

@alamb
Copy link
Contributor Author

alamb commented Apr 1, 2025

I did some analysis on the queries that show the largest slowdown:

│ QQuery 24    │   509.28ms │              684.55ms │  1.34x slower │
│ QQuery 25    │   426.36ms │              553.26ms │  1.30x slower │
│ QQuery 26    │   581.56ms │              802.46ms │  1.38x slower │
...
│ QQuery 30    │   970.56ms │             1286.68ms │  1.33x slower │
│ QQuery 31    │  1008.49ms │             1398.40ms │  1.39x slower │

Query 24-26
sql
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime"), "SearchPhrase" LIMIT 10;

Queries 30-31

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;
SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;

Basically they all have the filter SearchPhrase <> '' (cleaning out empty strings)

How selective is SearchPhrase <> ''?

This predicate select 13/99M

> select count(*) from 'hits.parquet' WHERE "SearchPhrase" <> '';
+----------+
| count(*) |
+----------+
| 13172392 |
+----------+
1 row(s) fetched.
Elapsed 0.303 seconds.

> select count(*) from 'hits.parquet';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.

I am now profiling to determine the root cause

@alamb
Copy link
Contributor Author

alamb commented Apr 1, 2025

I profiled this command:

./datafusion-cli-filter-pushdown -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM hits WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

And I see that about 15% of the time is being spent "skipping" records -- aka evaluating the RowSelection in the parquet reader
Image

This is likely the same observation that lead @XiangpengHao to propose adding bitmask support in

I need to think about what the right next steps are

@XiangpengHao
Copy link
Contributor

This is likely the same observation that lead @XiangpengHao to propose adding bitmask support in

Exactly, the predicate SearchPhase <> '' is not selective, meaning that we have lots of small selections:

select 2 
skip 3
select 4
....

Each selector is 16 bytes, causing a lot of memory overhead.

This also means that for each small select and skip, we will call the corresponding read_records and skip_records. These methods are currently not super optimized for millions of calls.

The associated and_then method is also expensive to evaluate.

@XiangpengHao
Copy link
Contributor

I need to think about what the right next steps are

One potential item I can help is to make the reader a optional opt-in reader, and merge it in.

With that, we will have a testbed for future optimization like better selection mechanisms.

@alamb
Copy link
Contributor Author

alamb commented Apr 2, 2025

I need to think about what the right next steps are

One potential item I can help is to make the reader a optional opt-in reader, and merge it in.

With that, we will have a testbed for future optimization like better selection mechanisms.

I think this is a good idea

@alamb
Copy link
Contributor Author

alamb commented Apr 2, 2025

Actually, I think all we need to show is that the code in arrow-rs helps in some cases and doesn't degrade performance for others -- we don't need to show that it completely fixes the issue downstream in DataFusion

I will try and gather evidence to determine if the decoded page cache can be always be used

@zhuqi-lucas
Copy link
Contributor

I am interested for this topic, if anything i can help the testing to compare the performance or code improvement as a follow-up?

@alamb
Copy link
Contributor Author

alamb commented Apr 8, 2025

I am interested for this topic, if anything i can help the testing to compare the performance or code improvement as a follow-up?

Thank you so much @zhuqi-lucas ! that is great news.

I think the first thing we should do is

  1. Run the existing arrow_reader benchmarks against Experimental parquet decoder with first-class selection pushdown support #6921 and see if it shows any regressions
  2. Add new benchmarks in arrow_reader (or maybe in a new arrow_reader_row_filter) that test reading parquet data with a row filter (aka with_row_filter). There should be benchmarks both with 1) a filter on a column that is also selected and 2) a filter on a column that is not also selected (aka projection=a, filter=b > 1 or something)

Does that make sense?

@zhuqi-lucas
Copy link
Contributor

It makes sense @alamb , thank you for the guide, i will try to do it.

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 8, 2025

  1. Here is the result for benchmark compare, almost no regression for the benchmark:
group                                                                                                      better-decoder                         main
-----                                                                                                      --------------                         ----
arrow_array_reader/BYTE_ARRAY/Decimal128Array/plain encoded, mandatory, no NULLs                           1.01   889.0±13.52µs        ? ?/sec    1.00   879.2±31.80µs        ? ?/sec
arrow_array_reader/BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                          1.13   776.7±36.97µs        ? ?/sec    1.00   687.5±31.70µs        ? ?/sec
arrow_array_reader/BYTE_ARRAY/Decimal128Array/plain encoded, optional, no NULLs                            1.00   890.6±12.19µs        ? ?/sec    1.00   887.0±22.95µs        ? ?/sec
arrow_array_reader/BinaryArray/dictionary encoded, mandatory, no NULLs                                     1.01   403.8±12.84µs        ? ?/sec    1.00    399.8±7.63µs        ? ?/sec
arrow_array_reader/BinaryArray/dictionary encoded, optional, half NULLs                                    1.16   376.2±21.20µs        ? ?/sec    1.00   323.7±14.98µs        ? ?/sec
arrow_array_reader/BinaryArray/dictionary encoded, optional, no NULLs                                      1.01   403.7±11.99µs        ? ?/sec    1.00    398.3±5.42µs        ? ?/sec
arrow_array_reader/BinaryArray/plain encoded, mandatory, no NULLs                                          1.05   535.7±30.66µs        ? ?/sec    1.00   509.6±21.13µs        ? ?/sec
arrow_array_reader/BinaryArray/plain encoded, optional, half NULLs                                         1.18   444.1±19.58µs        ? ?/sec    1.00   377.1±21.79µs        ? ?/sec
arrow_array_reader/BinaryArray/plain encoded, optional, no NULLs                                           1.02   536.7±37.88µs        ? ?/sec    1.00   527.5±39.52µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, mandatory, no NULLs                                 1.00    105.9±1.16µs        ? ?/sec    1.18    125.0±6.83µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, half NULLs                                1.00    155.1±7.94µs        ? ?/sec    1.08    168.0±7.36µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, no NULLs                                  1.00    111.3±1.80µs        ? ?/sec    1.18    130.8±7.56µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs                                      1.01    165.7±3.21µs        ? ?/sec    1.00    164.7±2.88µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs, short string                        1.01    228.8±5.38µs        ? ?/sec    1.00    226.3±2.31µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs                                     1.02    186.4±7.74µs        ? ?/sec    1.00    182.8±3.39µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs                                       1.01    170.8±3.93µs        ? ?/sec    1.00    169.3±2.11µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, mandatory, no NULLs     1.03   576.2±22.74µs        ? ?/sec    1.00    561.3±5.71µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, half NULLs    1.00   562.9±10.24µs        ? ?/sec    1.04   586.0±27.51µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/byte_stream_split encoded, optional, no NULLs      1.03   587.3±16.88µs        ? ?/sec    1.00   570.5±10.65µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, mandatory, no NULLs                 1.03   268.0±10.85µs        ? ?/sec    1.00   259.3±10.75µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs                1.00    415.7±8.55µs        ? ?/sec    1.03   428.1±16.34µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Decimal128Array/plain encoded, optional, no NULLs                  1.04    275.0±8.26µs        ? ?/sec    1.00    264.4±4.68µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, mandatory, no NULLs        1.02     57.8±2.04µs        ? ?/sec    1.00     56.8±0.76µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, half NULLs       1.00    122.1±0.79µs        ? ?/sec    1.01    123.5±2.70µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/byte_stream_split encoded, optional, no NULLs         1.00     61.9±0.67µs        ? ?/sec    1.00     61.7±0.50µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, mandatory, no NULLs                    1.00     19.9±0.23µs        ? ?/sec    1.00     19.8±0.36µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, half NULLs                   1.00    104.9±0.78µs        ? ?/sec    1.00    104.8±3.11µs        ? ?/sec
arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array/plain encoded, optional, no NULLs                     1.00     24.9±0.17µs        ? ?/sec    1.00     24.8±0.32µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, mandatory, no NULLs                    1.03    347.9±4.17µs        ? ?/sec    1.00    339.2±6.45µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, half NULLs                   1.04   359.6±10.81µs        ? ?/sec    1.00    345.4±7.48µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/byte_stream_split encoded, optional, no NULLs                     1.02    352.9±3.67µs        ? ?/sec    1.00   345.8±13.92µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, mandatory, no NULLs                                1.05     40.7±4.30µs        ? ?/sec    1.00     38.6±2.02µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, half NULLs                               1.01    208.0±6.85µs        ? ?/sec    1.00    206.2±9.71µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(16)/plain encoded, optional, no NULLs                                 1.01     45.4±2.26µs        ? ?/sec    1.00     44.9±2.58µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, mandatory, no NULLs                     1.00     45.7±0.53µs        ? ?/sec    1.01     46.3±2.38µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, half NULLs                    1.00    111.5±1.41µs        ? ?/sec    1.00    111.3±1.94µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/byte_stream_split encoded, optional, no NULLs                      1.03     50.9±2.17µs        ? ?/sec    1.00     49.3±0.55µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, mandatory, no NULLs                                 1.01      8.4±0.32µs        ? ?/sec    1.00      8.3±0.31µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, half NULLs                                1.01     93.1±2.14µs        ? ?/sec    1.00     92.1±1.41µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(2)/plain encoded, optional, no NULLs                                  1.02     12.6±0.37µs        ? ?/sec    1.00     12.4±0.33µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, mandatory, no NULLs                     1.02     88.9±0.74µs        ? ?/sec    1.00     87.0±2.80µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, half NULLs                    1.04    150.1±1.55µs        ? ?/sec    1.00    144.1±1.36µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/byte_stream_split encoded, optional, no NULLs                      1.03     93.2±3.52µs        ? ?/sec    1.00     90.9±0.86µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, mandatory, no NULLs                                 1.02     13.5±0.18µs        ? ?/sec    1.00     13.3±0.32µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, half NULLs                                1.04    112.6±1.35µs        ? ?/sec    1.00    107.7±3.41µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(4)/plain encoded, optional, no NULLs                                  1.02     18.1±0.52µs        ? ?/sec    1.00     17.7±0.28µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, mandatory, no NULLs                     1.00    172.9±1.77µs        ? ?/sec    1.01   174.7±10.11µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, half NULLs                    1.01    251.7±5.00µs        ? ?/sec    1.00    249.7±5.55µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/byte_stream_split encoded, optional, no NULLs                      1.02    178.4±6.38µs        ? ?/sec    1.00    174.3±6.38µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, mandatory, no NULLs                                 1.00     22.8±0.76µs        ? ?/sec    1.01     23.0±1.19µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, half NULLs                                1.00    176.8±6.67µs        ? ?/sec    1.01    178.2±4.92µs        ? ?/sec
arrow_array_reader/FixedLenByteArray(8)/plain encoded, optional, no NULLs                                  1.01     27.7±0.66µs        ? ?/sec    1.00     27.5±0.93µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/binary packed skip, mandatory, no NULLs                           1.01     70.8±0.53µs        ? ?/sec    1.00     70.3±2.16µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/binary packed skip, optional, half NULLs                          1.02     90.1±0.63µs        ? ?/sec    1.00     88.4±0.85µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/binary packed skip, optional, no NULLs                            1.03     74.4±2.08µs        ? ?/sec    1.00     72.3±0.56µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/binary packed, mandatory, no NULLs                                1.00    101.8±0.71µs        ? ?/sec    1.01    102.8±3.79µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/binary packed, optional, half NULLs                               1.02    153.8±1.49µs        ? ?/sec    1.00    150.9±2.13µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/binary packed, optional, no NULLs                                 1.03    107.5±2.69µs        ? ?/sec    1.00    104.8±0.94µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/byte_stream_split encoded, mandatory, no NULLs                    1.00     45.2±0.38µs        ? ?/sec    1.00     45.0±1.96µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/byte_stream_split encoded, optional, half NULLs                   1.03    125.4±3.75µs        ? ?/sec    1.00    122.1±1.83µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/byte_stream_split encoded, optional, no NULLs                     1.00     50.5±0.50µs        ? ?/sec    1.00     50.5±1.81µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/dictionary encoded, mandatory, no NULLs                           1.00     80.7±0.65µs        ? ?/sec    1.00     80.9±3.00µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/dictionary encoded, optional, half NULLs                          1.03    145.7±4.25µs        ? ?/sec    1.00    141.4±1.62µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/dictionary encoded, optional, no NULLs                            1.01     85.6±0.74µs        ? ?/sec    1.00     84.7±2.72µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/plain encoded, mandatory, no NULLs                                1.02     47.3±0.47µs        ? ?/sec    1.00     46.6±0.75µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/plain encoded, optional, half NULLs                               1.02    125.1±1.31µs        ? ?/sec    1.00    122.3±1.26µs        ? ?/sec
arrow_array_reader/INT32/Decimal128Array/plain encoded, optional, no NULLs                                 1.03     53.7±2.12µs        ? ?/sec    1.00     52.3±1.58µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/binary packed skip, mandatory, no NULLs                           1.01     67.6±0.66µs        ? ?/sec    1.00     67.0±2.51µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/binary packed skip, optional, half NULLs                          1.02     92.6±0.75µs        ? ?/sec    1.00     90.4±0.72µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/binary packed skip, optional, no NULLs                            1.04     70.9±2.17µs        ? ?/sec    1.00     68.4±0.73µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/binary packed, mandatory, no NULLs                                1.00    100.1±0.94µs        ? ?/sec    1.00     99.9±4.14µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/binary packed, optional, half NULLs                               1.02    161.3±1.50µs        ? ?/sec    1.00    157.6±1.58µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/binary packed, optional, no NULLs                                 1.03    105.9±3.80µs        ? ?/sec    1.00    103.1±1.41µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/byte_stream_split encoded, mandatory, no NULLs                    1.01    105.6±1.79µs        ? ?/sec    1.00    104.3±6.07µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/byte_stream_split encoded, optional, half NULLs                   1.04    163.6±4.37µs        ? ?/sec    1.00    157.6±1.60µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/byte_stream_split encoded, optional, no NULLs                     1.05    112.4±4.12µs        ? ?/sec    1.00    106.8±1.38µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/dictionary encoded, mandatory, no NULLs                           1.02     88.0±1.14µs        ? ?/sec    1.00     86.4±3.21µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/dictionary encoded, optional, half NULLs                          1.03    156.2±3.92µs        ? ?/sec    1.00    151.5±1.50µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/dictionary encoded, optional, no NULLs                            1.04     93.4±3.51µs        ? ?/sec    1.00     90.0±1.12µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/plain encoded, mandatory, no NULLs                                1.02     60.5±0.61µs        ? ?/sec    1.00     59.1±0.73µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/plain encoded, optional, half NULLs                               1.03    139.9±3.52µs        ? ?/sec    1.00    136.2±1.57µs        ? ?/sec
arrow_array_reader/INT64/Decimal128Array/plain encoded, optional, no NULLs                                 1.06     68.2±3.92µs        ? ?/sec    1.00     64.5±1.95µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed skip, mandatory, no NULLs                                      1.00     56.3±0.45µs        ? ?/sec    1.02     57.5±1.03µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed skip, optional, half NULLs                                     1.00     75.4±2.12µs        ? ?/sec    1.01     76.0±0.52µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed skip, optional, no NULLs                                       1.01     59.9±2.22µs        ? ?/sec    1.00     59.3±0.50µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                           1.00     73.9±0.52µs        ? ?/sec    1.02     75.1±0.36µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                          1.00    124.3±3.51µs        ? ?/sec    1.03    127.8±1.18µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                            1.00     78.5±0.85µs        ? ?/sec    1.02     79.8±0.68µs        ? ?/sec
arrow_array_reader/Int32Array/byte_stream_split encoded, mandatory, no NULLs                               1.00     18.4±0.15µs        ? ?/sec    1.00     18.4±0.34µs        ? ?/sec
arrow_array_reader/Int32Array/byte_stream_split encoded, optional, half NULLs                              1.00     96.0±1.05µs        ? ?/sec    1.01     97.0±2.01µs        ? ?/sec
arrow_array_reader/Int32Array/byte_stream_split encoded, optional, no NULLs                                1.00     23.5±0.72µs        ? ?/sec    1.00     23.6±0.34µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                                      1.00     53.1±0.47µs        ? ?/sec    1.02     53.9±1.07µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                                     1.00    115.0±1.44µs        ? ?/sec    1.01    115.9±1.38µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                                       1.00     58.3±2.06µs        ? ?/sec    1.00     58.3±0.51µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                                           1.00     19.7±0.71µs        ? ?/sec    1.08     21.4±0.14µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                                          1.00     94.7±3.61µs        ? ?/sec    1.05     99.7±1.06µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                                            1.00     24.1±0.34µs        ? ?/sec    1.08     26.2±0.43µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed skip, mandatory, no NULLs                                      1.02     54.0±1.51µs        ? ?/sec    1.00     53.0±0.70µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed skip, optional, half NULLs                                     1.02     77.7±0.81µs        ? ?/sec    1.00     76.5±0.94µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed skip, optional, no NULLs                                       1.01     55.5±0.35µs        ? ?/sec    1.00     55.0±0.71µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                           1.02     74.4±2.54µs        ? ?/sec    1.00     72.8±0.63µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                          1.00    131.5±0.87µs        ? ?/sec    1.00    131.5±2.56µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                            1.01     77.9±0.92µs        ? ?/sec    1.00     77.1±0.87µs        ? ?/sec
arrow_array_reader/Int64Array/byte_stream_split encoded, mandatory, no NULLs                               1.03     78.5±3.34µs        ? ?/sec    1.00     76.1±1.67µs        ? ?/sec
arrow_array_reader/Int64Array/byte_stream_split encoded, optional, half NULLs                              1.02    133.7±1.21µs        ? ?/sec    1.00    130.8±2.32µs        ? ?/sec
arrow_array_reader/Int64Array/byte_stream_split encoded, optional, no NULLs                                1.02     82.5±0.74µs        ? ?/sec    1.00     80.6±0.95µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                                      1.01     61.3±2.01µs        ? ?/sec    1.00     60.4±1.18µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                                     1.01    126.5±0.93µs        ? ?/sec    1.00    125.2±2.15µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                                       1.01     65.0±0.95µs        ? ?/sec    1.00     64.3±0.92µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                                           1.02     34.8±1.92µs        ? ?/sec    1.00     34.2±0.84µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                                          1.00    110.5±1.13µs        ? ?/sec    1.00    110.8±2.15µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                                            1.01     39.0±1.13µs        ? ?/sec    1.00     38.9±0.39µs        ? ?/sec
arrow_array_reader/ListArray/plain encoded optional strings half NULLs                                     1.01      4.2±0.04ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
arrow_array_reader/ListArray/plain encoded optional strings no NULLs                                       1.01      6.6±0.04ms        ? ?/sec    1.00      6.5±0.06ms        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                                     1.01   406.1±19.03µs        ? ?/sec    1.00    403.2±7.59µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs                                    1.12   391.0±24.22µs        ? ?/sec    1.00   349.8±17.00µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                                      1.00    399.3±8.95µs        ? ?/sec    1.02   406.4±11.74µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                                          1.01   591.7±43.01µs        ? ?/sec    1.00   585.5±35.36µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                                         1.16   461.1±20.71µs        ? ?/sec    1.00   396.9±20.56µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                                           1.02   595.6±27.25µs        ? ?/sec    1.00   585.6±41.04µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs                                1.03    292.0±6.07µs        ? ?/sec    1.00    284.8±2.80µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs                               1.02    333.6±3.46µs        ? ?/sec    1.00    326.1±4.68µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs                                 1.02    294.6±3.21µs        ? ?/sec    1.00    288.4±3.38µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs                                 1.00    108.4±3.67µs        ? ?/sec    1.18    127.4±6.85µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs                                1.00    154.6±8.05µs        ? ?/sec    1.11   172.2±12.16µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs                                  1.00    113.2±3.89µs        ? ?/sec    1.16    131.6±7.17µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs                                      1.00    223.4±5.59µs        ? ?/sec    1.02    227.7±6.06µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs                                     1.01    220.0±8.45µs        ? ?/sec    1.00    217.9±8.67µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs                                       1.00    233.2±7.53µs        ? ?/sec    1.01    235.5±4.78µs        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Apr 10, 2025

Thanks @zhuqi-lucas !

Is there any way you could look into creating a benchmark for evaluating filters? I can do so too if you prefer

The idea is to create a benchmarks for evaluating row filters (what @XiangpengHao is trying to optimize) that captures the common use case and is what we are trying to optimize in DataFusion

You add a row filter with this API:
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter

I suggest a benchmarks that

  1. Writes a parquet file with 100K rows and four columns (int64, float64, Utf8View, and Timestamp) into memory
  2. Adds filters + projections
  3. Benchmarks how fast it is to read the data back

For the filters, it is important to capture both selective filters (that select a small number of contiguous ranges) as well as non selective filters (that select rows that are scattered througout the data). Here are suggestions

Filters:

  1. A string filter like col <> '' that selects about 1/2 of the data
  2. String Filter like col = 'const' that is selective and selects only a few rows
  3. Integer filter like col = '' (both selective and non selective)
  4. Timestamp filter like 'ts > time'

For the projections, it is important to capture both when the column with the predicate appears at the output as well as when the column wihtout the predicate appears. Here are suggestions

Projections (which columns are selected out):

  1. All 4 columns
  2. Some column other than the filter column

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb , i submitted a draft version

#7401

And meanwhile, i will submit the comparing testing result soon.

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 10, 2025

The result shows a little improvement for caching page, almost no difference between main branch, may be i am missing something? Or i need to increase batch size or page size for caching?

critcmp main better_decode
group                                                                                                      better_decode                          main
-----                                                                                                      -------------                          ----
arrow_reader_row_filter/filter_case: int64 = 0 project_case: all_columns/                                  1.02    802.1±8.17µs        ? ?/sec    1.00   788.9±12.42µs        ? ?/sec
arrow_reader_row_filter/filter_case: int64 = 0 project_case: exclude_filter_column/                        1.02    731.3±6.36µs        ? ?/sec    1.00   716.1±13.29µs        ? ?/sec
arrow_reader_row_filter/filter_case: int64 even project_case: all_columns/                                 1.02     18.5±0.56ms        ? ?/sec    1.00     18.1±0.09ms        ? ?/sec
arrow_reader_row_filter/filter_case: int64 even project_case: exclude_filter_column/                       1.02     13.6±0.26ms        ? ?/sec    1.00     13.3±0.08ms        ? ?/sec
arrow_reader_row_filter/filter_case: ts > 50_000 project_case: all_columns/                                1.00  1254.5±30.86µs        ? ?/sec    1.01  1263.4±12.15µs        ? ?/sec
arrow_reader_row_filter/filter_case: ts > 50_000 project_case: exclude_filter_column/                      1.00  1075.5±43.17µs        ? ?/sec    1.00   1074.6±9.40µs        ? ?/sec
arrow_reader_row_filter/filter_case: utf8View <> '' project_case: all_columns/                             1.00     18.4±0.22ms        ? ?/sec    1.00     18.3±0.06ms        ? ?/sec
arrow_reader_row_filter/filter_case: utf8View <> '' project_case: exclude_filter_column/                   1.00     15.2±0.18ms        ? ?/sec    1.01     15.4±0.10ms        ? ?/sec
arrow_reader_row_filter/filter_case: utf8View = 'const' project_case: all_columns/                         1.00   893.4±15.74µs        ? ?/sec    1.07    959.3±8.81µs        ? ?/sec
arrow_reader_row_filter/filter_case: utf8View = 'const' project_case: exclude_filter_column/               1.00    851.5±7.54µs        ? ?/sec    1.07   912.2±10.13µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 11, 2025

I may know the reason, the cache now only used by async read, so the sync read testing case will not hit the cache, i will try to add to the sync read and testing again, or find a good way to use async read.

@alamb
Copy link
Contributor Author

alamb commented Apr 11, 2025

I am starting to check it out...

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb , i already changed the benchmark testing to trigger cache logic, and the testing works well for main branch, but the testing seems stuck when change to the cache page branch. I am trying to debug it.

@alamb
Copy link
Contributor Author

alamb commented Apr 11, 2025

I spent some more time today thinking about what filter patterns are important to test and came up with the following six patterns (to replace the 4 I suggested above).

What do you think @zhuqi-lucas ? @XiangpengHao did I miss any important filter patterns?

 ┌───────────────┐    ┌───────────────┐                                                                     
 │               │    │               │                                                                     
 │               │    │      ...      │                                                                     
 │               │    │               │                                                                     
 │               │    │               │                                                                     
 │     ...       │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                                                                     
 │               │    │               │                                                                     
 │               │    │      ...      │                                                                     
 │               │    │               │                                                                     
 │               │    │               │                                                                     
 └───────────────┘    └───────────────┘                                                                     
                                                                                                            
 "Point Lookup": selects a single row                                                                       
 (1 RowSelection of 1 row)                                                                                  
                                                                                                            
 ┌───────────────┐    ┌───────────────┐                                                                     
 │      ...      │    │               │                                                                     
 │               │    │               │                                                                     
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │               │                                                                     
 │               │    │      ...      │                                                                     
 │               │    │               │                                                                     
 │               │    │               │                                                                     
 │      ...      │    │               │                                                                     
 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                                                                     
 │               │    │               │                                                                     
 └───────────────┘    └───────────────┘                                                                     
 selective (1%) unclustered filter                                                                          
 (1000 RowSelection of 10 rows each)                                                                        
                                                                                                            
                                                                                                            
 ┌───────────────┐    ┌───────────────┐                 ┌───────────────┐    ┌───────────────┐              
 │      ...      │    │               │                 │               │    │               │              
 │               │    │               │                 │               │    │               │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │     ...       │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │               │                 │               │    │               │              
 │               │    │               │                 │     ...       │    │               │              
 │               │    │      ...      │                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │      ...      │    │               │                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │               │    │               │                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 └───────────────┘    └───────────────┘                 └───────────────┘    └───────────────┘              
                                                        moderately selective (10%) clustered filter         
moderately selective (10%) unclustered filter           (10 RowSelections of 10,000 rows each)              
(10000 RowSelection of 10 rows each)                                                                        
 ┌───────────────┐    ┌───────────────┐                 ┌───────────────┐    ┌───────────────┐              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │               │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │               │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │     ...       │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │               │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │     ...       │    │               │              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │               │                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 │               │    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│              
 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│    │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│                 └───────────────┘    └───────────────┘              
 └───────────────┘    └───────────────┘                                                                     
 unselective (99%) unclustered filter                   unselective (90%) clustered filter                  
 (99,000 RowSelections of 10 rows each)                 (99 RowSelection of 10,000 rows each)               
                                                                                                            

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb , LGTM, i will try to address those cases, one more question:

Do we need to change page size to test it and compare since we cache the page?

@alamb
Copy link
Contributor Author

alamb commented Apr 11, 2025

Do we need to change page size to test it and compare since we cache the page?

I think what is important in this case is that that the test parquet file that is created has multiple pages for each column -- I think the default writer's limit is 20k rows per page, so I don't think any changes are needed

however, we should double check the file that it has multiple pages -- maybe using @XiangpengHao 's great https://parquet-viewer.xiangpeng.systems/

@zhuqi-lucas
Copy link
Contributor

Do we need to change page size to test it and compare since we cache the page?

I think what is important in this case is that that the test parquet file that is created has multiple pages for each column -- I think the default writer's limit is 20k rows per page, so I don't think any changes are needed

however, we should double check the file that it has multiple pages -- maybe using @XiangpengHao 's great https://parquet-viewer.xiangpeng.systems/

Thank you @alamb , good to know the default writer's limit is 20k rows per page. And a good tool to know!

https://parquet-viewer.xiangpeng.systems/

@alamb
Copy link
Contributor Author

alamb commented Apr 12, 2025

Those are very nice figures and summarize the common filter patterns I can think of! @alamb
As a side note, not directly related to this issue, but it would be great to also have benchmarks for two+ filters, which will exercise the performance of row selector operations, such as and_then, union, etc.

Thank you @XiangpengHao , this is a good point to add for testing.

I am also thinking it might make more sense to make specific microbenchmarks for RowSelection operations (like and_then and combine) as we will only be able to cover so much using a filter application benchmark

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 13, 2025

Thank you @alamb , i already changed the benchmark testing to trigger cache logic, and the testing works well for main branch, but the testing seems stuck when change to the cache page branch. I am trying to debug it.

@alamb @XiangpengHao
I found the bug is caused by the depending usage for skip page for the cache page branch, it will not skip dict page so the has_next will always in the loop, i created the PR try to fix the issue:

#7409

And why main branch did not trigger it, because currently the skip dict page usage are not triggering besides the benchmark for cache page logic.

@zhuqi-lucas
Copy link
Contributor

When i increase the data set from 100000 to 1000000, there were regressions about the PointLookup:

And i try to fix it with this commit:
0c3aa9b
branch polish_better_decoder
#7428

It seems it will fix the regression about PointLookup with big data set:

critcmp better-decoder polish_better_decoder
group                                                                                 better-decoder                         polish_better_decoder
-----                                                                                 --------------                         ---------------------
arrow_reader_row_filter/Composite/all_columns/async                                   1.03     12.9±0.14ms        ? ?/sec    1.00     12.6±0.12ms        ? ?/sec
arrow_reader_row_filter/Composite/all_columns/sync                                    1.00     14.5±0.10ms        ? ?/sec    1.00     14.6±0.31ms        ? ?/sec
arrow_reader_row_filter/Composite/exclude_filter_column/async                         1.03     12.7±0.44ms        ? ?/sec    1.00     12.4±0.20ms        ? ?/sec
arrow_reader_row_filter/Composite/exclude_filter_column/sync                          1.02     13.3±0.32ms        ? ?/sec    1.00     13.1±0.21ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/all_columns/async                1.00     12.3±0.11ms        ? ?/sec    1.00     12.3±0.16ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/all_columns/sync                 1.00     13.1±0.12ms        ? ?/sec    1.00     13.2±0.14ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/exclude_filter_column/async      1.01     12.3±0.44ms        ? ?/sec    1.00     12.1±0.17ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/exclude_filter_column/sync       1.00     12.7±0.10ms        ? ?/sec    1.01     12.8±0.17ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/all_columns/async              1.01     41.7±0.59ms        ? ?/sec    1.00     41.4±0.51ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/all_columns/sync               1.00     41.7±0.34ms        ? ?/sec    1.00     41.6±0.36ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/exclude_filter_column/async    1.00     33.9±0.28ms        ? ?/sec    1.00     33.8±0.29ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/exclude_filter_column/sync     1.00     34.3±0.28ms        ? ?/sec    1.00     34.2±0.33ms        ? ?/sec
arrow_reader_row_filter/PointLookup/all_columns/async                                 3.89     10.2±0.07ms        ? ?/sec    1.00      2.6±0.03ms        ? ?/sec
arrow_reader_row_filter/PointLookup/all_columns/sync                                  1.02      2.9±0.14ms        ? ?/sec    1.00      2.8±0.04ms        ? ?/sec
arrow_reader_row_filter/PointLookup/exclude_filter_column/async                       3.91     10.2±0.08ms        ? ?/sec    1.00      2.6±0.04ms        ? ?/sec
arrow_reader_row_filter/PointLookup/exclude_filter_column/sync                        1.00      2.8±0.03ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/all_columns/async                        1.00     15.1±0.56ms        ? ?/sec    1.01     15.2±0.23ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/all_columns/sync                         1.00     16.3±0.21ms        ? ?/sec    1.01     16.5±0.18ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/exclude_filter_column/async              1.00     14.0±0.16ms        ? ?/sec    1.01     14.1±0.15ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/exclude_filter_column/sync               1.01     14.8±0.54ms        ? ?/sec    1.00     14.6±0.13ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/all_columns/async                        1.00     20.6±0.14ms        ? ?/sec    1.03     21.3±0.25ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/all_columns/sync                         1.00     21.8±0.19ms        ? ?/sec    1.03     22.4±0.17ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/exclude_filter_column/async              1.00     19.8±0.24ms        ? ?/sec    1.02     20.2±0.18ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/exclude_filter_column/sync               1.00     20.5±0.22ms        ? ?/sec    1.03     21.1±0.16ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/all_columns/async                      1.00     15.1±0.13ms        ? ?/sec    1.01     15.2±0.17ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/all_columns/sync                       1.00     16.5±0.15ms        ? ?/sec    1.00     16.4±0.14ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/exclude_filter_column/async            1.01     14.1±0.13ms        ? ?/sec    1.00     14.0±0.18ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/exclude_filter_column/sync             1.00     14.6±0.10ms        ? ?/sec    1.00     14.6±0.18ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async                            1.00     95.9±0.73ms        ? ?/sec    1.00     95.8±0.94ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/sync                             1.01    104.4±0.78ms        ? ?/sec    1.00    103.1±1.08ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/async                  1.01     72.7±0.44ms        ? ?/sec    1.00     71.8±0.57ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/sync                   1.02     73.6±0.94ms        ? ?/sec    1.00     72.1±0.74ms        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Apr 19, 2025

Amazing -- thank you @zhuqi-lucas

I suggest we start a new PR with the proposal (e.g. the contents of #6921 and your fixes).

It seems like it is quite close now -- so we should start preparing a PR to merge. (edit -- I see you have done this in #7428)

Also, I will unfortunately be offline until next weekend starting in a few hours so I will likely not be able to respond until then

Maybe @XiangpengHao @Dandandan can help review the changes in the mean time

and again thank you for all your work on this issue

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 19, 2025

Thank you @alamb, i think following steps need:

  1. To check why the CI testing fail.
  2. We will solve the regression about PointLookup with big data set using above fix, but we still need to investigate why benchmark no obvious performance improvement same with the clickbench in datafusion.
critcmp main polish_better_decoder
group                                                                                 main                                   polish_better_decoder
-----                                                                                 ----                                   ---------------------
arrow_reader_row_filter/Composite/all_columns/async                                   1.05     13.1±0.17ms        ? ?/sec    1.00     12.6±0.12ms        ? ?/sec
arrow_reader_row_filter/Composite/all_columns/sync                                    1.00     14.5±0.13ms        ? ?/sec    1.01     14.6±0.31ms        ? ?/sec
arrow_reader_row_filter/Composite/exclude_filter_column/async                         1.02     12.6±0.55ms        ? ?/sec    1.00     12.4±0.20ms        ? ?/sec
arrow_reader_row_filter/Composite/exclude_filter_column/sync                          1.01     13.2±0.17ms        ? ?/sec    1.00     13.1±0.21ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/all_columns/async                1.00     11.7±0.09ms        ? ?/sec    1.05     12.3±0.16ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/all_columns/sync                 1.00     13.1±0.69ms        ? ?/sec    1.00     13.2±0.14ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/exclude_filter_column/async      1.00     11.7±0.56ms        ? ?/sec    1.03     12.1±0.17ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveClustered/exclude_filter_column/sync       1.00     12.5±0.11ms        ? ?/sec    1.03     12.8±0.17ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/all_columns/async              1.01     41.8±1.50ms        ? ?/sec    1.00     41.4±0.51ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/all_columns/sync               1.01     42.0±1.56ms        ? ?/sec    1.00     41.6±0.36ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/exclude_filter_column/async    1.02     34.5±1.28ms        ? ?/sec    1.00     33.8±0.29ms        ? ?/sec
arrow_reader_row_filter/ModeratelySelectiveUnclustered/exclude_filter_column/sync     1.01     34.4±0.67ms        ? ?/sec    1.00     34.2±0.33ms        ? ?/sec
arrow_reader_row_filter/PointLookup/all_columns/async                                 1.00      2.4±0.02ms        ? ?/sec    1.09      2.6±0.03ms        ? ?/sec
arrow_reader_row_filter/PointLookup/all_columns/sync                                  1.00      2.7±0.04ms        ? ?/sec    1.07      2.8±0.04ms        ? ?/sec
arrow_reader_row_filter/PointLookup/exclude_filter_column/async                       1.00      2.5±0.05ms        ? ?/sec    1.06      2.6±0.04ms        ? ?/sec
arrow_reader_row_filter/PointLookup/exclude_filter_column/sync                        1.00      2.7±0.05ms        ? ?/sec    1.04      2.8±0.05ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/all_columns/async                        1.01     15.4±0.91ms        ? ?/sec    1.00     15.2±0.23ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/all_columns/sync                         1.00     15.9±0.15ms        ? ?/sec    1.04     16.5±0.18ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/exclude_filter_column/async              1.00     13.4±0.17ms        ? ?/sec    1.05     14.1±0.15ms        ? ?/sec
arrow_reader_row_filter/SelectiveUnclustered/exclude_filter_column/sync               1.00     14.3±0.66ms        ? ?/sec    1.02     14.6±0.13ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/all_columns/async                        1.01     21.4±0.84ms        ? ?/sec    1.00     21.3±0.25ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/all_columns/sync                         1.01     22.6±0.93ms        ? ?/sec    1.00     22.4±0.17ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/exclude_filter_column/async              1.00     20.0±0.20ms        ? ?/sec    1.01     20.2±0.18ms        ? ?/sec
arrow_reader_row_filter/UnselectiveClustered/exclude_filter_column/sync               1.00     21.0±0.19ms        ? ?/sec    1.01     21.1±0.16ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/all_columns/async                      1.00     15.1±0.13ms        ? ?/sec    1.00     15.2±0.17ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/all_columns/sync                       1.02     16.7±0.80ms        ? ?/sec    1.00     16.4±0.14ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/exclude_filter_column/async            1.00     13.6±0.15ms        ? ?/sec    1.03     14.0±0.18ms        ? ?/sec
arrow_reader_row_filter/UnselectiveUnclustered/exclude_filter_column/sync             1.00     14.5±0.13ms        ? ?/sec    1.01     14.6±0.18ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async                            1.10    105.4±2.71ms        ? ?/sec    1.00     95.8±0.94ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/sync                             1.02    105.5±3.65ms        ? ?/sec    1.00    103.1±1.08ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/async                  1.04     74.9±1.85ms        ? ?/sec    1.00     71.8±0.57ms        ? ?/sec
arrow_reader_row_filter/Utf8ViewNonEmpty/exclude_filter_column/sync                   1.03     73.9±2.76ms        ? ?/sec    1.00     72.1±0.74ms        ? ?/sec

Updated:

arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async                            1.10    105.4±2.71ms        ? ?/sec    1.00     95.8±0.94ms        ? ?/sec

arrow_reader_row_filter/Utf8ViewNonEmpty/all_columns/async has 10% improvement, it seems in theory we are right, because all_columns/async are our page cache improvement!

  1. Check datafusion clickbench result with new PR.

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 22, 2025

It looks like the performance still need to be improved, tested locally for my mac, a lot of still slower compared to the older_push_down which without page cache:

./bench.sh compare  older_push_down test_default_parquet_push_down
Comparing older_push_down and test_default_parquet_push_down
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ older_push_down ┃ test_default_parquet_push_down ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 00.29ms │                         0.29ms │     no change │
│ QQuery 150.39ms │                        47.96ms │     no change │
│ QQuery 277.56ms │                        78.10ms │     no change │
│ QQuery 377.95ms │                        85.75ms │  1.10x slower │
│ QQuery 4579.45ms │                       520.55ms │ +1.11x faster │
│ QQuery 5560.83ms │                       583.54ms │     no change │
│ QQuery 60.33ms │                         0.37ms │  1.10x slower │
│ QQuery 763.95ms │                        57.39ms │ +1.11x faster │
│ QQuery 8718.15ms │                       730.24ms │     no change │
│ QQuery 9786.44ms │                       791.21ms │     no change │
│ QQuery 10206.47ms │                       178.59ms │ +1.16x faster │
│ QQuery 11213.81ms │                       205.09ms │     no change │
│ QQuery 12679.51ms │                       611.23ms │ +1.11x faster │
│ QQuery 131002.85ms │                       915.22ms │ +1.10x faster │
│ QQuery 14763.19ms │                       655.43ms │ +1.16x faster │
│ QQuery 15649.33ms │                       642.34ms │     no change │
│ QQuery 161390.12ms │                      1409.99ms │     no change │
│ QQuery 171219.49ms │                      1296.90ms │  1.06x slower │
│ QQuery 182932.03ms │                      2748.39ms │ +1.07x faster │
│ QQuery 1963.44ms │                        72.62ms │  1.14x slower │
│ QQuery 20752.05ms │                       687.70ms │ +1.09x faster │
│ QQuery 21927.06ms │                       724.69ms │ +1.28x faster │
│ QQuery 221558.98ms │                      1284.78ms │ +1.21x faster │
│ QQuery 233402.25ms │                      3753.29ms │  1.10x slower │
│ QQuery 24471.04ms │                       378.47ms │ +1.24x faster │
│ QQuery 25420.43ms │                       317.64ms │ +1.32x faster │
│ QQuery 26522.57ms │                       405.79ms │ +1.29x faster │
│ QQuery 271375.58ms │                      1144.70ms │ +1.20x faster │
│ QQuery 288767.85ms │                      8608.97ms │     no change │
│ QQuery 29458.33ms │                       451.06ms │     no change │
│ QQuery 30689.41ms │                       662.65ms │     no change │
│ QQuery 31758.38ms │                       755.04ms │     no change │
│ QQuery 322981.93ms │                      2424.51ms │ +1.23x faster │
│ QQuery 332886.54ms │                      2747.83ms │     no change │
│ QQuery 343298.54ms │                      3322.28ms │     no change │
│ QQuery 35834.81ms │                       876.42ms │     no change │
│ QQuery 3640.41ms │                        94.12ms │  2.33x slower │
│ QQuery 3735.59ms │                        50.69ms │  1.42x slower │
│ QQuery 3839.18ms │                        92.59ms │  2.36x slower │
│ QQuery 3935.07ms │                       133.18ms │  3.80x slower │
│ QQuery 4036.55ms │                        43.06ms │  1.18x slower │
│ QQuery 4136.18ms │                        42.14ms │  1.16x slower │
│ QQuery 4237.00ms │                        40.48ms │  1.09x slower │
└──────────────┴─────────────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (older_push_down)42401.30ms │
│ Total Time (test_default_parquet_push_down)40673.29ms │
│ Average Time (older_push_down)986.08ms │
│ Average Time (test_default_parquet_push_down)945.89ms │
│ Queries Faster15 │
│ Queries Slower12 │
│ Queries with No Change16 │
└───────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ older_push_down ┃ test_default_parquet_push_down ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 01.12ms │                         1.16ms │     no change │
│ QQuery 125.44ms │                        27.52ms │  1.08x slower │
│ QQuery 254.65ms │                        60.26ms │  1.10x slower │
│ QQuery 356.82ms │                        61.16ms │  1.08x slower │
│ QQuery 4484.52ms │                       482.47ms │     no change │
│ QQuery 5524.38ms │                       551.00ms │  1.05x slower │
│ QQuery 61.24ms │                         1.25ms │     no change │
│ QQuery 743.00ms │                        44.17ms │     no change │
│ QQuery 8632.15ms │                       676.51ms │  1.07x slower │
│ QQuery 9658.82ms │                       692.69ms │  1.05x slower │
│ QQuery 10154.14ms │                       149.79ms │     no change │
│ QQuery 11183.86ms │                       173.48ms │ +1.06x faster │
│ QQuery 12664.23ms │                       619.06ms │ +1.07x faster │
│ QQuery 13916.31ms │                       897.86ms │     no change │
│ QQuery 14717.15ms │                       650.81ms │ +1.10x faster │
│ QQuery 15561.85ms │                       579.71ms │     no change │
│ QQuery 161430.77ms │                      1388.65ms │     no change │
│ QQuery 171249.19ms │                      1150.01ms │ +1.09x faster │
│ QQuery 182739.07ms │                      2768.37ms │     no change │
│ QQuery 1941.63ms │                        53.35ms │  1.28x slower │
│ QQuery 20724.72ms │                       717.24ms │     no change │
│ QQuery 21819.39ms │                       728.06ms │ +1.13x faster │
│ QQuery 221551.09ms │                      1232.23ms │ +1.26x faster │
│ QQuery 233565.46ms │                      3089.17ms │ +1.15x faster │
│ QQuery 24457.95ms │                       365.63ms │ +1.25x faster │
│ QQuery 25381.44ms │                       306.09ms │ +1.25x faster │
│ QQuery 26526.80ms │                       409.19ms │ +1.29x faster │
│ QQuery 271415.36ms │                      1278.68ms │ +1.11x faster │
│ QQuery 288528.35ms │                      8521.98ms │     no change │
│ QQuery 29401.68ms │                       389.67ms │     no change │
│ QQuery 30728.25ms │                       654.12ms │ +1.11x faster │
│ QQuery 31744.53ms │                       784.18ms │  1.05x slower │
│ QQuery 323257.23ms │                      2953.12ms │ +1.10x faster │
│ QQuery 333187.75ms │                      2755.04ms │ +1.16x faster │
│ QQuery 343735.45ms │                      3459.67ms │ +1.08x faster │
│ QQuery 35877.66ms │                       885.68ms │     no change │
│ QQuery 3626.23ms │                        83.86ms │  3.20x slower │
│ QQuery 3723.04ms │                        39.59ms │  1.72x slower │
│ QQuery 3821.55ms │                        80.23ms │  3.72x slower │
│ QQuery 3922.02ms │                       124.36ms │  5.65x slower │
│ QQuery 4021.54ms │                        32.89ms │  1.53x slower │
│ QQuery 4121.64ms │                        30.26ms │  1.40x slower │
│ QQuery 4223.77ms │                        28.83ms │  1.21x slower │
└──────────────┴─────────────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (older_push_down)42203.20ms │
│ Total Time (test_default_parquet_push_down)39979.07ms │
│ Average Time (older_push_down)981.47ms │
│ Average Time (test_default_parquet_push_down)929.75ms │
│ Queries Faster15 │
│ Queries Slower15 │
│ Queries with No Change13 │
└───────────────────────────────────────────────┴────────────┘

And compared with no filter push down, it's more slower cases:

./bench.sh  compare  main test_default_parquet_push_down
Comparing main and test_default_parquet_push_down
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ test_default_parquet_push_down ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 00.32ms │                         0.29ms │ +1.09x faster │
│ QQuery 146.96ms │                        47.96ms │     no change │
│ QQuery 275.59ms │                        78.10ms │     no change │
│ QQuery 374.12ms │                        85.75ms │  1.16x slower │
│ QQuery 4556.03ms │                       520.55ms │ +1.07x faster │
│ QQuery 5563.52ms │                       583.54ms │     no change │
│ QQuery 60.31ms │                         0.37ms │  1.18x slower │
│ QQuery 752.23ms │                        57.39ms │  1.10x slower │
│ QQuery 8720.15ms │                       730.24ms │     no change │
│ QQuery 9741.10ms │                       791.21ms │  1.07x slower │
│ QQuery 10171.95ms │                       178.59ms │     no change │
│ QQuery 11187.66ms │                       205.09ms │  1.09x slower │
│ QQuery 12597.16ms │                       611.23ms │     no change │
│ QQuery 13877.71ms │                       915.22ms │     no change │
│ QQuery 14605.11ms │                       655.43ms │  1.08x slower │
│ QQuery 15630.66ms │                       642.34ms │     no change │
│ QQuery 161422.47ms │                      1409.99ms │     no change │
│ QQuery 171221.90ms │                      1296.90ms │  1.06x slower │
│ QQuery 182773.23ms │                      2748.39ms │     no change │
│ QQuery 1966.30ms │                        72.62ms │  1.10x slower │
│ QQuery 20682.62ms │                       687.70ms │     no change │
│ QQuery 21800.86ms │                       724.69ms │ +1.11x faster │
│ QQuery 221521.09ms │                      1284.78ms │ +1.18x faster │
│ QQuery 234223.95ms │                      3753.29ms │ +1.13x faster │
│ QQuery 24286.83ms │                       378.47ms │  1.32x slower │
│ QQuery 25274.47ms │                       317.64ms │  1.16x slower │
│ QQuery 26320.45ms │                       405.79ms │  1.27x slower │
│ QQuery 27945.72ms │                      1144.70ms │  1.21x slower │
│ QQuery 288206.32ms │                      8608.97ms │     no change │
│ QQuery 29459.59ms │                       451.06ms │     no change │
│ QQuery 30493.35ms │                       662.65ms │  1.34x slower │
│ QQuery 31585.82ms │                       755.04ms │  1.29x slower │
│ QQuery 322436.43ms │                      2424.51ms │     no change │
│ QQuery 332916.52ms │                      2747.83ms │ +1.06x faster │
│ QQuery 342975.16ms │                      3322.28ms │  1.12x slower │
│ QQuery 35866.75ms │                       876.42ms │     no change │
│ QQuery 36104.22ms │                        94.12ms │ +1.11x faster │
│ QQuery 3762.50ms │                        50.69ms │ +1.23x faster │
│ QQuery 38107.57ms │                        92.59ms │ +1.16x faster │
│ QQuery 39167.64ms │                       133.18ms │ +1.26x faster │
│ QQuery 4046.49ms │                        43.06ms │ +1.08x faster │
│ QQuery 4145.49ms │                        42.14ms │ +1.08x faster │
│ QQuery 4242.51ms │                        40.48ms │     no change │
└──────────────┴───────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)39956.84ms │
│ Total Time (test_default_parquet_push_down)40673.29ms │
│ Average Time (main)929.23ms │
│ Average Time (test_default_parquet_push_down)945.89ms │
│ Queries Faster12 │
│ Queries Slower15 │
│ Queries with No Change16 │
└───────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ test_default_parquet_push_down ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 01.10ms │                         1.16ms │  1.05x slower │
│ QQuery 124.27ms │                        27.52ms │  1.13x slower │
│ QQuery 258.03ms │                        60.26ms │     no change │
│ QQuery 358.91ms │                        61.16ms │     no change │
│ QQuery 4478.56ms │                       482.47ms │     no change │
│ QQuery 5549.38ms │                       551.00ms │     no change │
│ QQuery 61.18ms │                         1.25ms │  1.06x slower │
│ QQuery 740.05ms │                        44.17ms │  1.10x slower │
│ QQuery 8645.79ms │                       676.51ms │     no change │
│ QQuery 9671.10ms │                       692.69ms │     no change │
│ QQuery 10133.89ms │                       149.79ms │  1.12x slower │
│ QQuery 11159.81ms │                       173.48ms │  1.09x slower │
│ QQuery 12561.74ms │                       619.06ms │  1.10x slower │
│ QQuery 13750.14ms │                       897.86ms │  1.20x slower │
│ QQuery 14525.03ms │                       650.81ms │  1.24x slower │
│ QQuery 15553.88ms │                       579.71ms │     no change │
│ QQuery 161417.77ms │                      1388.65ms │     no change │
│ QQuery 171104.70ms │                      1150.01ms │     no change │
│ QQuery 183037.46ms │                      2768.37ms │ +1.10x faster │
│ QQuery 1945.81ms │                        53.35ms │  1.16x slower │
│ QQuery 20733.71ms │                       717.24ms │     no change │
│ QQuery 21789.70ms │                       728.06ms │ +1.08x faster │
│ QQuery 221299.84ms │                      1232.23ms │ +1.05x faster │
│ QQuery 233952.24ms │                      3089.17ms │ +1.28x faster │
│ QQuery 24273.40ms │                       365.63ms │  1.34x slower │
│ QQuery 25274.14ms │                       306.09ms │  1.12x slower │
│ QQuery 26320.12ms │                       409.19ms │  1.28x slower │
│ QQuery 27900.06ms │                      1278.68ms │  1.42x slower │
│ QQuery 287812.82ms │                      8521.98ms │  1.09x slower │
│ QQuery 29390.07ms │                       389.67ms │     no change │
│ QQuery 30420.68ms │                       654.12ms │  1.55x slower │
│ QQuery 31571.58ms │                       784.18ms │  1.37x slower │
│ QQuery 322585.80ms │                      2953.12ms │  1.14x slower │
│ QQuery 332621.59ms │                      2755.04ms │  1.05x slower │
│ QQuery 343144.83ms │                      3459.67ms │  1.10x slower │
│ QQuery 35855.71ms │                       885.68ms │     no change │
│ QQuery 3680.10ms │                        83.86ms │     no change │
│ QQuery 3735.71ms │                        39.59ms │  1.11x slower │
│ QQuery 3879.06ms │                        80.23ms │     no change │
│ QQuery 39123.78ms │                       124.36ms │     no change │
│ QQuery 4029.03ms │                        32.89ms │  1.13x slower │
│ QQuery 4127.67ms │                        30.26ms │  1.09x slower │
│ QQuery 4225.98ms │                        28.83ms │  1.11x slower │
└──────────────┴───────────┴────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                             ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)38166.21ms │
│ Total Time (test_default_parquet_push_down)39979.07ms │
│ Average Time (main)887.59ms │
│ Average Time (test_default_parquet_push_down)929.75ms │
│ Queries Faster4 │
│ Queries Slower24 │
│ Queries with No Change15 │
└───────────────────────────────────────────────┴────────────┘

@zhuqi-lucas
Copy link
Contributor

I profiled this command:

./datafusion-cli-filter-pushdown -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM hits WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"

And I see that about 15% of the time is being spent "skipping" records -- aka evaluating the RowSelection in the parquet reader Image

This is likely the same observation that lead @XiangpengHao to propose adding bitmask support in

I need to think about what the right next steps are

I am trying to pick up this PR #6624 , and to apply and then to see the performance.

@alamb
Copy link
Contributor Author

alamb commented Apr 28, 2025

Thanks @zhuqi-lucas -- I am back and plan to spend time on this project this week

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb , welcome back!

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Apr 29, 2025

I created a draft poc of bitmap/range enum select PR, it's not ready for review and try to make all clickbench no regression, but it's independent of page cache PR.

#7454

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2025

Since our benchmark doesn't seem to track the actual performance of ClickBench in datafusion I am going to try and figure out why:

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented May 5, 2025

@alamb @XiangpengHao
Updated the polish PR with new commit:

d26de88

I found some of the regression comes from the page cache missing, so it will cause more time to decode page even we enable page cache, for example our default batch size for the clickbench is 8192, in Q 27 clickbench benchmark result, it will cause more than 20% page cache missing due to some batch > one page size , with above commit, it's performance will not have regression.

The original flamegraph for q27 clickbench:

#7454 (comment)

Explanation details:

Most cases:

Assumption & observation: each page consists multiple batches.
Then our pipeline looks like this:
Load Page 1
Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1
Load batch 2 -> evaluate predicates -> filter 2 -> load & emit batch 2
Load batch 3 -> evaluate predicates -> filter 3 -> load & emit batch 3
Load Page 2
Load batch 4 -> evaluate predicates -> filter 4 -> load & emit batch 4
Load batch 5 -> evaluate predicates -> filter 5 -> load & emit batch 5


But some cases:
Load Page1
Load Page2
Load Page3
Load batch 1 -> evaluate predicates -> filter 1 -> load & emit batch 1


When we load Page2, the cache will update to Page2, and next time we access the Page1, it will miss.

@XiangpengHao
Copy link
Contributor

so it will cause more time to decode page even we enable page cache, for example our default batch size for the clickbench is 8192, in Q 27 clickbench benchmark result, it will cause more than 20% page cache missing due to some batch > one page size

This is a really nice finding, maybe we should cache 2 data pages instead of one?

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented May 6, 2025

Thank you @XiangpengHao for double check, i tried 2 data pages, it's still regression, and 3 data pages it's ok, 3 pages caching is the smallest for Q27 with no regression, so i add 3 data pages to the above commit. I think it's enough for most cases.

@alamb
Copy link
Contributor Author

alamb commented May 6, 2025

Thank you @XiangpengHao for double check, i tried 2 data pages, it's still regression, and 3 data pages it's ok, 3 pages caching is the smallest for Q27 with no regression, so i add 3 data pages to the above commit. I think it's enough for most cases.

Maybe we can change the cache so it is based on row count (aka caches enough pages for the total output record batch) rather than some fixed number

@zhuqi-lucas
Copy link
Contributor

Thank you @alamb for the suggestion, i agree, this is the perfect solution, but it seems no row count information available for the parquet default V1 version page it only has num_values, i will investigate if we can do it.

#[derive(Clone)]
pub enum Page {
    /// Data page Parquet format v1.
    DataPage {
        /// The underlying data buffer
        buf: Bytes,
        /// Number of values in this page
        num_values: u32,
        /// Encoding for values in this page
        encoding: Encoding,
        /// Definition level encoding
        def_level_encoding: Encoding,
        /// Repetition level encoding
        rep_level_encoding: Encoding,
        /// Optional statistics for this page
        statistics: Option<Statistics>,
    },
    /// Data page Parquet format v2.
    DataPageV2 {
        /// The underlying data buffer
        buf: Bytes,
        /// Number of values in this page
        num_values: u32,
        /// Encoding for values in this page
        encoding: Encoding,
        /// Number of null values in this page
        num_nulls: u32,
        /// Number of rows in this page
        num_rows: u32,
        /// Length of definition levels
        def_levels_byte_len: u32,
        /// Length of repetition levels
        rep_levels_byte_len: u32,
        /// Is this page compressed
        is_compressed: bool,
        /// Optional statistics for this page
        statistics: Option<Statistics>,
    },
    /// Dictionary page.
    DictionaryPage {
        /// The underlying data buffer
        buf: Bytes,
        /// Number of values in this page
        num_values: u32,
        /// Encoding for values in this page
        encoding: Encoding,
        /// Is dictionary page sorted
        is_sorted: bool,
    },
}

@alamb
Copy link
Contributor Author

alamb commented May 7, 2025

I Isn't num_values enough? You could cache pages until you have read the target_batch_size number of pages 🤔

@zhuqi-lucas
Copy link
Contributor

@alamb Let me try num_values, i can see all page types have this field, if it works, it's good!

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented May 8, 2025

I tried it now, and it seems some regression from testing result, the 3 page caches have better performance, i am not sure why, need to further investigate...

commit 3321f73128022e2868af91a7fe87799a2a0b06ad
Author: zhuqi-lucas <821684824@qq.com>
Date:   Thu May 8 14:42:27 2025 +0800

    Add dynamic page cache based on target batch size

diff --git a/parquet/src/arrow/async_reader/arrow_reader.rs b/parquet/src/arrow/async_reader/arrow_reader.rs
index 92e585756..4cfec5552 100644
--- a/parquet/src/arrow/async_reader/arrow_reader.rs
+++ b/parquet/src/arrow/async_reader/arrow_reader.rs
@@ -19,7 +19,6 @@ use std::collections::hash_map::Entry;
 use std::collections::HashMap;
 use std::sync::{Mutex, MutexGuard};
 use std::{collections::VecDeque, sync::Arc};
-
 use arrow_array::ArrayRef;
 use arrow_array::{cast::AsArray, Array, RecordBatch, RecordBatchReader};
 use arrow_schema::{ArrowError, DataType, Schema, SchemaRef};
@@ -220,53 +219,95 @@ impl RecordBatchReader for FilteredParquetRecordBatchReader {

 struct CachedPage {
     dict: Option<(usize, Page)>, // page offset -> page
-    data: Option<(usize, Page)>, // page offset -> page
+    data: VecDeque<(usize, Page)>, // page offset -> page, use dynamic pages according to the batch size
+    total_data_rows:   usize,
 }

 struct PredicatePageCacheInner {
     pages: HashMap<usize, CachedPage>, // col_id (Parquet's leaf column index) -> CachedPage
+    /// How many rows the reader is currently asking for per batch
+    batch_size: usize,
 }

 impl PredicatePageCacheInner {
     pub(crate) fn get_page(&self, col_id: usize, offset: usize) -> Option<Page> {
         self.pages.get(&col_id).and_then(|pages| {
-            pages
-                .dict
-                .iter()
-                .chain(pages.data.iter())
-                .find(|(page_offset, _)| *page_offset == offset)
-                .map(|(_, page)| page.clone())
+
+            if let Some((off, page)) = &pages.dict {
+                if *off == offset {
+                    return Some(page.clone());
+                }
+            }
+
+            pages.data.iter().find(|(off, _)| *off == offset).map(|(_, page)| page.clone())
         })
     }

-    /// Insert a page into the cache.
-    /// Inserting a page will override the existing page, if any.
-    /// This is because we only need to cache 2 pages per column, see below.
+    /// Insert or refresh a page in the per-column cache.
+    ///
+    /// This cache maintains:
+    /// 1. **One dictionary page** (`PageType::DICTIONARY_PAGE`), replacing any prior dict page.
+    /// 2. **A dynamic set of data pages**, each tracked by its file offset and row count,
+    ///    stored in a FIFO `VecDeque<(offset, Page)>`.
+    ///
+    /// As you process batches across multiple pages, we must keep every page
+    /// needed to emit one full batch. After inserting a new data page, we:
+    /// 1. Increase `total_data_rows` by that page’s `num_values()`.
+    /// 2. **Evict the oldest page** only while:
+    ///    - Removing it still leaves **at least** `batch_size` rows in cache, **and**
+    ///    - More than **one** data page remains.
+    ///
+    /// This policy ensures:
+    /// - **Complete coverage**: all pages required to serve one batch stay cached.
+    /// - **Memory boundedness**: total cached rows ≥ `batch_size`, never far above it.
+    /// - **At least one page retained**: even if a single page exceeds `batch_size`.
+    ///
+    /// # Parameters
+    /// - `col_id`: Parquet leaf column index, used as the cache key.
+    /// - `offset`: File offset for this page (unique identifier).
+    /// - `page`:   The decoded `Page` to insert or refresh.
     pub(crate) fn insert_page(&mut self, col_id: usize, offset: usize, page: Page) {
         let is_dict = page.page_type() == PageType::DICTIONARY_PAGE;
+        let rows_in_page = page.num_values() as usize;

-        let cached_pages = self.pages.entry(col_id);
-        match cached_pages {
-            Entry::Occupied(mut entry) => {
+        match self.pages.entry(col_id) {
+            Entry::Occupied(mut occ) => {
+                let cp = occ.get_mut();
                 if is_dict {
-                    entry.get_mut().dict = Some((offset, page));
+                    // refresh dictionary page
+                    cp.dict = Some((offset, page));
                 } else {
-                    entry.get_mut().data = Some((offset, page));
+                    // add new data page
+                    cp.data.push_back((offset, page.clone()));
+                    cp.total_data_rows += rows_in_page;
+
+                    // evict only while the oldest page is not needed to cover batch_size
+                    while cp.data.len() > 1 {
+                        // look at the front (oldest) page’s row count
+                        let &(_, ref oldest) = cp.data.front().unwrap();
+                        let oldest_rows = oldest.num_values() as usize;
+                        // if removing it still leaves enough rows for one batch, pop it
+                        if cp.total_data_rows - oldest_rows >= self.batch_size {
+                            cp.data.pop_front();
+                            cp.total_data_rows -= oldest_rows;
+                        } else {
+                            break;
+                        }
+                    }
                 }
             }
-            Entry::Vacant(entry) => {
-                let cached_page = if is_dict {
-                    CachedPage {
-                        dict: Some((offset, page)),
-                        data: None,
-                    }
+
+            Entry::Vacant(vac) => {
+                // first insertion for this column
+                let mut data = VecDeque::new();
+                let dict = if is_dict {
+                    Some((offset, page))
                 } else {
-                    CachedPage {
-                        dict: None,
-                        data: Some((offset, page)),
-                    }
+                    data.push_back((offset, page.clone()));
+                    None
                 };
-                entry.insert(cached_page);
+                let total_data_rows = if is_dict { 0 } else { rows_in_page };
+                vac.insert(CachedPage { dict, data, total_data_rows });
             }
         }
     }
@@ -328,10 +369,11 @@ pub(crate) struct PredicatePageCache {
 }

 impl PredicatePageCache {
-    pub(crate) fn new(capacity: usize) -> Self {
+    pub(crate) fn new(capacity: usize, batch_size: usize) -> Self {
         Self {
             inner: Mutex::new(PredicatePageCacheInner {
                 pages: HashMap::with_capacity(capacity),
+                batch_size
             }),
         }
     }
@@ -513,7 +555,7 @@ mod tests {
     fn test_predicate_page_cache_basic_operations() {
         use super::*;

-        let cache = PredicatePageCache::new(2);
+        let cache = PredicatePageCache::new(2, 8192);
         let page1 = Page::dummy_page(PageType::DATA_PAGE, 100);
         let page2 = Page::dummy_page(PageType::DICTIONARY_PAGE, 200);

@@ -538,7 +580,7 @@ mod tests {
     fn test_predicate_page_cache_replacement() {
         use super::*;

-        let cache = PredicatePageCache::new(2);
+        let cache = PredicatePageCache::new(2, 8192);
         let data_page1 = Page::dummy_page(PageType::DATA_PAGE, 100);
         let data_page2 = Page::dummy_page(PageType::DATA_PAGE_V2, 200);

@@ -556,7 +598,7 @@ mod tests {
     fn test_predicate_page_cache_multiple_columns() {
         use super::*;

-        let cache = PredicatePageCache::new(2);
+        let cache = PredicatePageCache::new(2, 8192);
         let page1 = Page::dummy_page(PageType::DATA_PAGE, 100);
         let page2 = Page::dummy_page(PageType::DATA_PAGE_V2, 200);

diff --git a/parquet/src/arrow/async_reader/mod.rs b/parquet/src/arrow/async_reader/mod.rs
index a034e927f..3f2936826 100644
--- a/parquet/src/arrow/async_reader/mod.rs
+++ b/parquet/src/arrow/async_reader/mod.rs
@@ -599,6 +599,7 @@ where
             self.metadata.as_ref(),
             offset_index,
             projection_to_cache,
+            batch_size
         );

         let mut selection =
@@ -902,6 +903,7 @@ impl<'a> InMemoryRowGroup<'a> {
         metadata: &'a ParquetMetaData,
         offset_index: Option<&'a [OffsetIndexMetaData]>,
         projection_to_cache: Option<ProjectionMask>,
+        batch_size: usize,
     ) -> Self {
         let rg_metadata = metadata.row_group(row_group_idx);
         let to_cache_column_cnt = projection_to_cache
@@ -920,7 +922,7 @@ impl<'a> InMemoryRowGroup<'a> {
             row_count: rg_metadata.num_rows() as usize,
             metadata,
             row_group_idx,
-            cache: Arc::new(PredicatePageCache::new(to_cache_column_cnt)),
+            cache: Arc::new(PredicatePageCache::new(to_cache_column_cnt, batch_size)),
             projection_to_cache,
         }
     }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants