-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Analysis to supportSortPreservingMerge
--> ProgressiveEval
#15191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
SortPreservingMerge
--> ProgressiveEval
SortPreservingMerge
--> ProgressiveEval
Thanks for the writeup. The core idea makes sense, but I have a couple of comments on the design, in particular I think we can do better in a few ways. 1. Some but not all partitions overlapAs I understand it, the current implementation gives up if any two partitions are overlapping. This is fine if your target use case involves queries with no overlapping partitions, but we can do better. Let me give an example: SELECT * FROM recent_table_1
WHERE time > now() - INTERVAL 1 DAY
UNION ALL
SELECT * FROM recent_table_2
WHERE time > now() - INTERVAL 1 DAY
UNION ALL
SELECT * FROM historic_table
WHERE time <= now() - INTERVAL 1 DAY
ORDER BY time ASC Incidentally this is the use case I am targeting. Anyway, this query would result in at least 3 partitions, two of which are overlapping in SortPreservingMergeExec: time ASC
ProgressiveEval: partitions=[2, 0], [1]
UnionExec: partitions=[0, 1, 2]
TableExec: recent_table_1
TableExec: recent_table_2
TableExec: historic_table This would concatenate partitions 2 and 0, and partition 1 remains unchanged. Then a final The "first fit" algorithm has actually already been implemented in I believe this change could be retrofitted onto 2. Scanning non-overlapping Parquet filesI see that one of the target use cases for Here is the issue I am worried about. When you use > SET datafusion.execution.target_partitions=2;
0 row(s) fetched.
Elapsed 0.000 seconds.
> CREATE EXTERNAL TABLE t1 (id INT, date DATE) STORED AS PARQUET LOCATION './data/' PARTITIONED BY (date) WITH ORDER (id ASC);
0 row(s) fetched.
Elapsed 0.002 seconds.
> INSERT INTO t1 VALUES (4, '2025-03-01'), (3, '2025-3-02'), (2, '2025-03-03'), (1, '2025-03-04');
+-------+
| count |
+-------+
| 4 |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> EXPLAIN SELECT * FROM t1 ORDER BY id ASC;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.id ASC NULLS LAST |
| | TableScan: t1 projection=[id, date] |
| physical_plan | SortPreservingMergeExec: [id@0 ASC NULLS LAST] |
| | SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[true] |
| | DataSourceExec: file_groups={2 groups: [[./data/date=2025-03-01/9nxILoicy2uUAt7r.parquet, ./data/date=2025-03-02/9nxILoicy2uUAt7r.parquet], [./data/date=2025-03-03/9nxILoicy2uUAt7r.parquet, ./data/date=2025-03-04/9nxILoicy2uUAt7r.parquet]]}, projection=[id, date], file_type=parquet |
| | |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.
> EXPLAIN SELECT * FROM t1 WHERE date > '2025-03-02' ORDER BY id ASC;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.id ASC NULLS LAST |
| | TableScan: t1 projection=[id, date], full_filters=[t1.date > Date32("2025-03-02")] |
| physical_plan | SortPreservingMergeExec: [id@0 ASC NULLS LAST] |
| | DataSourceExec: file_groups={2 groups: [[./data/date=2025-03-03/9nxILoicy2uUAt7r.parquet], [./data/date=2025-03-04/9nxILoicy2uUAt7r.parquet]]}, projection=[id, date], output_ordering=[id@0 ASC NULLS LAST], file_type=parquet |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds. When there are more than 2 files in the scan, the partitions get merged in no particular order. In particular any ordering on IMO, for this use case, I don't mean to tout my own horn too much, but in fact this exact use case is what ConclusionBasically I think the design as-is is good enough to include in DataFusion, but I would like to see it generalized a bit, but I also think it may not completely solve #6672. That said I think it will solve other problems, including optimizing queries with non-overlapping unions in them. I apologize if I come off as a bit overbearing 😅 but this issue is near and dear to my heart. Eliminating sorts has been one of the most important things in my team's project, and it sounds like InfluxDB has been dealing with the same issue. |
This is a neat idea -- basically a cascade of progressive evals to avoid some (but not all) merging
Our progressive eval implementation is here https://github.com/influxdata/influxdb3_core/blob/26a30bf8d6e2b6b3f1dd905c4ec27e3db6e20d5f/iox_query/src/provider/progressive_eval.rs (there is a PR to add it to Datafusion here: #10490)
I think it uses You can define an order using DataFusion CLI v46.0.1
> copy (values (4, '2025-03-01')) to '/tmp/test/1.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> copy (values (3, '2025-03-02')) to '/tmp/test/2.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> create external table test stored as parquet location '/tmp/test' with order (column2 ASC);
0 row(s) fetched.
Elapsed 0.006 seconds. Then the sort is not needed: > explain select * from test order by column2 asc;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: test.column2 ASC NULLS LAST |
| | TableScan: test projection=[column1, column2] |
| physical_plan | SortPreservingMergeExec: [column2@1 ASC NULLS LAST] |
| | DataSourceExec: file_groups={16 groups: [[tmp/test/1.parquet:0..107], [tmp/test/2.parquet:0..107], [tmp/test/1.parquet:107..214], [tmp/test/2.parquet:107..214], [tmp/test/1.parquet:214..321], ...]}, projection=[column1, column2], output_ordering=[column2@1 ASC NULLS LAST], file_type=parquet |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds. Though datafusion will put it when needed > explain select * from test order by column2 desc;
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: test.column2 DESC NULLS FIRST |
| | TableScan: test projection=[column1, column2] |
| physical_plan | SortPreservingMergeExec: [column2@1 DESC] |
| | SortExec: expr=[column2@1 DESC], preserve_partitioning=[true] |
| | DataSourceExec: file_groups={16 groups: [[tmp/test/1.parquet:0..107], [tmp/test/2.parquet:0..107], [tmp/test/1.parquet:107..214], [tmp/test/2.parquet:107..214], [tmp/test/1.parquet:214..321], ...]}, projection=[column1, column2], output_ordering=[column2@1 ASC NULLS LAST], file_type=parquet |
| | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.012 seconds.
Not at all! Yes we have done a lot of this (as have @berkaysynnada @ozankabak and @akurmustafa at Synnada). It is a very important optimization |
I agree -- in my mind this is all related -- when trying to take maximum advantage of pre-existing orderings I do think the optimizer should be more careful. |
AFAICT For a more realistic example where the table order and object store path order don't match, consider a horizontally partitioned table, something like this: DataFusion CLI v46.0.0
> SET datafusion.execution.target_partitions=2;
0 row(s) fetched.
Elapsed 0.001 seconds.
> CREATE EXTERNAL TABLE t1 (time TIMESTAMP, date DATE, shard INT) STORED AS PARQUET LOCATION '/tmp/data/' PARTITIONED BY (date, shard) WITH ORDER (time ASC);
0 row(s) fetched.
Elapsed 0.003 seconds.
> INSERT INTO t1 VALUES
('2025-03-01 00:00:01', '2025-03-01', 0),
('2025-03-01 00:00:00', '2025-03-01', 1),
('2025-03-02 00:00:00', '2025-03-02', 0),
('2025-03-02 00:00:02', '2025-03-02', 1);
+-------+
| count |
+-------+
| 4 |
+-------+
1 row(s) fetched.
Elapsed 0.011 seconds.
> EXPLAIN SELECT * FROM t1 ORDER BY time ASC;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: t1.time ASC NULLS LAST |
| | TableScan: t1 projection=[time, date, shard] |
| physical_plan | SortPreservingMergeExec: [time@0 ASC NULLS LAST] |
| | SortExec: expr=[time@0 ASC NULLS LAST], preserve_partitioning=[true] |
| | DataSourceExec: file_groups={2 groups: [[tmp/data/date=2025-03-01/shard=0/8eTZY2WyyhnV7Klv.parquet, tmp/data/date=2025-03-01/shard=1/8eTZY2WyyhnV7Klv.parquet], [tmp/data/date=2025-03-02/shard=0/8eTZY2WyyhnV7Klv.parquet, tmp/data/date=2025-03-02/shard=1/8eTZY2WyyhnV7Klv.parquet]]}, projection=[time, date, shard], file_type=parquet |
| | |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.001 seconds. Sorting by the object store paths gives us a lexsort by
The only reason it is not needed here is because there are fewer files than > SET datafusion.execution.target_partitions=1;
0 row(s) fetched.
Elapsed 0.000 seconds.
> EXPLAIN SELECT * FROM test ORDER BY column2 ASC;
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: test.column2 ASC NULLS LAST |
| | TableScan: test projection=[column1, column2] |
| physical_plan | SortExec: expr=[column2@1 ASC NULLS LAST], preserve_partitioning=[false] |
| | DataSourceExec: file_groups={1 group: [[tmp/test/1.parquet, tmp/test/2.parquet]]}, projection=[column1, column2], file_type=parquet |
| | |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
I reread the codebase, and also think so. FileScanConfig::split_groups_by_statistics definitely can solve the problem, then we can remove unnecessary One question: Is there something that makes it difficult to turn on |
@xudong963 see #10336, that flag groups Parquet scanning to a single partition. |
Got it, thank you |
Hi @alamb @wiedld , how's it going? Can I do something to help? |
I left my full answer on that issue as I don't want to take over this issue too much, but TL;DR we need benchmarks for tables with large numbers of files. |
We are a bit blocked on some of the overlap analysis -- I am going to try and pitch in and see if we can push something forward |
@alamb Fyi: here is a POC: xudong963#4 The PR glues three PRs:
|
@wiedld - can you please review the above PRs and work with @xudong963 to make sure our implementations are aligned? |
@wiedld has created a PR with a version of the lexical range PR for review |
This is a potential design to support
SortPreservingMerge
that doesn't actually compare sort keys of the key ranges are ordered #10316It is largely copy/paste from an internal design I wrote for a project at InfluxData
We are planning to propose upstreaming what we do, and @wiedld is working on this
ProgressiveEval
operator #10490I purposely wrote it in markdown to make it easier to copy/paste the diagrams and explanation into code.
Background:
📖 The following description uses the DataFusions definition of a partition , not the IOx one (how data is divided across files)
What is
SortPreservingMerge
?SortPreservingMerge
is DataFusion operator that merges data row by row from multiple sorted input partitions. In order to produce any output rows, theSortPreservingMerge
must open all its inputs (e.g. must open several parquet files). Here is a diagram:What is
ProgressiveEval
?ProgressiveEval
is a special operator. See the blog post Making Most Recent Value Queries Hundreds of Times Faster for more detailsProgressiveEval
outputs in order"Note that
ProgressiveEval
only starts [2 (configurable)] inputs at a time. Here is a diagramWhy is
SortPreservingMerge
better thenProgressiveEval
?When possible,
ProgressiveEval
should be used instead ofSortPreservingMerge
because:limit
(as it does not start all the input streams at once).Under what conditions
SortPreservingMerge
be converted toProgressiveEval
?In order to convert a
SortPreservingMerge
(SPM) toProgressiveEval
, the plans must still produce the same results. We know all input partitions to the SPM are sorted on the sort expressions (this is required for correctness) and the output of the SPM will also be sorted on these expressionsWe define the "Lexical Space" as the space of all possible values of the sort expressions. For example, given data with a sort order of
A ASC, B ASC
(A
ascending,B
ascending), then the lexical space is all the unique combinations of(A, B)
. The "range" of an input in this lexical space is the minimum and maximum sort key values.For example, for data like
a
b
The lexical range is
min --> max
:(1,100) --> (3,50)
Using a
ProgressiveEval
instead ofSortPreservingMerge
requiresWhen this is the case, concatenating such partitions together results in the same otuput as a sorted stream, and thus the output of
ProgressiveEval
andSortPreservingMerge
are the sameExample: Using
ProgressiveEval
can be used instead of aSortPreservingMerge
.In the following example, the input streams have non overlaping lexical ranges in order and thus
SortPreservingMerge
andProgressiveEval
produce the same output(1,100) --> (2,200)
(2,200) --> (2,200)
(2,300) --> (3,100)
Counter Example 1: Out of order Partitions
In the following example, the input partitions still have non overlaping lexical ranges, but they are NOT in order. Therefore the the output of the
ProgressiveEval
(which concatenates the streams) is different thanSortPreservingMerge
, and thus in this case we can NOT useProgressiveEvalExec
:Counter Example 2: Overlapping Partitions
When we have partitions that do overlap in lexical ranges, it is more clear that the output of the two operators is different. When
ProgressiveEval
appends the input streams together they will not be sorted as shown in the following figureProposed Algorithm
Step 1: find min/max values for each sort key column for for each input partiton to a SortPreservingMerge.
We can not get this information reliably from DataFusion statistics yet due to
However, internally at influx we have a special analysis that works for
time
.We can also use the
EquivalenceProperties::constants
to determine min/max values for constants (what is needed forOrderUnionSortedInputsForConstants
)If we can't determine all min/max values ==> no transform
Step 2: Determine if the Lexical Space overlaps
The algorithm that converts mins/maxes to arrow arrays and then calls the rank kernel on it
Details
We will need to order the inputs by minimum value and then ensure:
i
>j
If we the lexical space overlaps ==> no transform
Step 3: Reorder inputs, if needed and possible
If the input partitions are non overlapping, attempt to reorder the input partitons if needed
It is possible to reorder input partitions for certian plans such as UnionExec and
ParquetExec
, but it is not possible to reorder the input partitions for others (likeRepartitionExec
)If we cannot reorder the partitions to have a lexical overlap ==> no transform
The text was updated successfully, but these errors were encountered: