[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672

simonvandel · 2023-06-14T20:22:33Z

Describe the bug

I'm testing performance of querying a number of Parquet files, where I can make some assumptions about the Parquet files.

Each Parquet file is already sorted on the column "timestamp".
Each Parquet file does not overlap values on the column "timestamp". For instance, file A has values for timestamps for 2022, and file B has values for timestamps 2023.

The schema of the files are:

"timestamp": TimestampMillisecond
"value": Float64

Consider the following query and it's query plan:

SELECT timestamp, value 
FROM samples 
ORDER BY timestamp ASC

+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | SortPreservingMergeExec: [timestamp@0 ASC], metrics=[output_rows=1000000, elapsed_compute=572.526968ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |   ParquetExec: file_groups={20 groups: [[0.parquet], [1.parquet], [2.parquet], [3.parquet], [4.parquet], ...]}, projection=[timestamp, value], output_ordering=[timestamp@0 ASC], metrics=[output_rows=1000000, elapsed_compute=20ns, num_predicate_creation_errors=0, predicate_evaluation_errors=0, bytes_scanned=57972, page_index_rows_filtered=0, row_groups_pruned=0, pushdown_rows_filtered=0, time_elapsed_processing=51.918935ms, page_index_eval_time=40ns, time_elapsed_scanning_total=48.94925ms, time_elapsed_opening=2.996325ms, time_elapsed_scanning_until_data=48.311008ms, pushdown_eval_time=40ns] |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The 572 milliseconds on the SortPreservingMergeExec seems to be the bottleneck in the query, so I would like to optimize it.

Given the assumptions I can make about the Parquet files, I think that the SortPreservingMergeExec can be replaced by what is essentially a concatenation of each of the Parquet files.

What would be the best approach to remove the SortPreservingMergeExec?
My ideas:

Manually re-partition the Parquet files into a single Parquet file using this new API: https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column
I have an idea of implementing a custom PhysicalOptimizerRule that looks for the SortPreservingMergeExec ParquetExec pattern, and replaces it with a concatenation instead.

But I would like to hear if there are any better ways.

Infrastructure Tasks 🚧

Major Tasks

Related (though not necessarily required)

The text was updated successfully, but these errors were encountered:

alamb · 2023-06-15T14:18:14Z

Given the assumptions I can make about the Parquet files, I think that the SortPreservingMergeExec can be replaced by what is essentially a concatenation of each of the Parquet files.

I agree

I have an idea of implementing a custom PhysicalOptimizerRule that looks for the SortPreservingMergeExec ParquetExec pattern, and replaces it with a concatenation instead.

Yes, I think this would work. We do some similar things in IOx (interestingly also for the timeseries usecase with non-overlapping timeranges).

It was implemented by @crepererum which you can see in https://github.com/influxdata/influxdb_iox/tree/main/iox_query/src/physical_optimizer

Manually re-partition the Parquet files into a single Parquet file using this new API: https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column

I think this is likely the solution that would be the fastest for querying because then time predicates could be used to prune out entire row groups and you would have lower file opening overhead

The downside, is of course, you would need to rewrite the parquet files

alamb · 2023-06-15T14:18:58Z

I am marking this as a question as I am not sure it is really a bug -- though please let me know if you disagree

simonvandel · 2023-06-15T20:01:26Z

I think this is likely the solution that would be the fastest for querying because then time predicates could be used to prune out entire row groups and you would have lower file opening overhead

Thanks, I'll try this.

I am marking this as a question as I am not sure it is really a bug -- though please let me know if you disagree

My bad, it was a question.

Although one could argue it is also a feature request for an inbuilt optimization that removes sorts if it can detect non-overlaps using either hints or directly looking at min/max statistics on inputs.
Do you think that is reasonable, or is it too specific for just my use case?

alamb · 2023-06-16T15:59:28Z

Although one could argue it is also a feature request for an inbuilt optimization that removes sorts if it can detect non-overlaps using either hints or directly looking at min/max statistics on inputs.

Do you think that is reasonable, or is it too specific for just my use case?

I think it is a reasonable request as having data sorted by date is so common, though the trick would be making the API reasonable and general purpose 🤔

suremarc · 2023-06-26T18:39:05Z

I have had a somewhat overlapping (no pun intended) issue where DataFusion abandons the SortPreservingMergeStream and does a global sort if there are multiple files in any file groups. It should be possible for DataFusion to realize that, if the files are non-overlapping, the file groups can be re-ordered to satisfy the required output ordering. We would be partitioning a poset of files into a series of chains, where A < B if they are non-overlapping, and every row in A goes before every row in B. Then each chain becomes one file group in the physical plan, which would be read sequentially. Using statistics and partition columns it should be possible to generate a reasonable execution plan without reading any rows.

alamb · 2023-06-26T19:34:40Z

I have had a somewhat overlapping (no pun intended) issue where DataFusion abandons the SortPreservingMergeStream and does a global sort if there are multiple files in any file groups. It should be possible for DataFusion to realize that, if the files are non-overlapping, the file groups can be re-ordered to satisfy the required output ordering.

Yes, that is correct -- each partition stream from the parquet reader is produced back to back, so if there are multiple files, the resulting stream is not ordered even if all the input files were

We would be partitioning a poset of files into a series of chains, where A < B if they are non-overlapping, and every row in A goes before every row in B.

Indeed as long as each output group was ordered in non overlapping time the parquet reader would not need to be changed at all

alamb · 2023-11-27T20:05:45Z

FWIW @NGA-TRAN is working on something similar downstream in InfluxDB IOx

alamb · 2024-04-30T17:36:32Z

FWIW @NGA-TRAN is working on something similar downstream in InfluxDB IOx

Follow up: Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

alamb · 2025-04-01T18:04:59Z

There is quite a bit of work related to optimizing this kind of query recently. Since this ticket lays out the problem and request so well I am going to start collecting the work needed to complete the optimization as tasks on this issue

@xudong963 has a POC for this work

POC: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions xudong963/arrow-datafusion#4

There is a related discord thread here:

https://discord.com/channels/885562378132000778/1356122416258220114/1356122427591098378

simonvandel added the bug Something isn't working label Jun 14, 2023

alamb added question Further information is requested and removed bug Something isn't working labels Jun 15, 2023

trueleo mentioned this issue Jun 25, 2023

Datafusion Optimizations and Integrations parseablehq/parseable#447

Closed

5 tasks

suremarc mentioned this issue Sep 7, 2023

Use file statistics in query planning to avoid sorting when unecessary #7490

Closed

alamb mentioned this issue Nov 15, 2023

Epic: Statistics improvements #8227

Open

20 tasks

NGA-TRAN mentioned this issue Mar 21, 2024

feat: Determine ordering of file groups #9593

Merged

This was referenced Apr 30, 2024

[Epic] A Collection of Sort Based Optimizations #10313

Open

Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

Open

edmondop mentioned this issue Jan 21, 2025

EPIC: Statistics improvements edmondop/arrow-datafusion#5

Open

alamb mentioned this issue Mar 12, 2025

Analysis to supportSortPreservingMerge --> ProgressiveEval #15191

Open

alamb added enhancement New feature or request performance Make DataFusion faster labels Apr 1, 2025

alamb mentioned this issue Apr 1, 2025

POC: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions xudong963/arrow-datafusion#4

Open

alamb changed the title ~~Optimization: Avoid sort for already sorted Parquet files that do not overlap values on condition~~ [EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition Apr 1, 2025

alamb mentioned this issue Apr 11, 2025

ListingTable statistics improperly merges statistics when files have different schemas #15689

Closed

xudong963 mentioned this issue Apr 29, 2025

[DISCUSSION] DataFusion Road Map: Q3-Q4 2025 #15878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672

[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672

simonvandel commented Jun 14, 2023 •

edited by alamb

Loading

alamb commented Jun 15, 2023 •

edited

Loading

alamb commented Jun 15, 2023

simonvandel commented Jun 15, 2023

alamb commented Jun 16, 2023

suremarc commented Jun 26, 2023 •

edited

Loading

alamb commented Jun 26, 2023

alamb commented Nov 27, 2023

alamb commented Apr 30, 2024 •

edited

Loading

alamb commented Apr 1, 2025 •

edited

Loading

[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672

[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672

Comments

simonvandel commented Jun 14, 2023 • edited by alamb Loading

Describe the bug

Related

Infrastructure Tasks 🚧

Major Tasks

Related (though not necessarily required)

alamb commented Jun 15, 2023 • edited Loading

alamb commented Jun 15, 2023

simonvandel commented Jun 15, 2023

alamb commented Jun 16, 2023

suremarc commented Jun 26, 2023 • edited Loading

alamb commented Jun 26, 2023

alamb commented Nov 27, 2023

alamb commented Apr 30, 2024 • edited Loading

alamb commented Apr 1, 2025 • edited Loading

simonvandel commented Jun 14, 2023 •

edited by alamb

Loading

alamb commented Jun 15, 2023 •

edited

Loading

suremarc commented Jun 26, 2023 •

edited

Loading

alamb commented Apr 30, 2024 •

edited

Loading

alamb commented Apr 1, 2025 •

edited

Loading