-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree
Yes, I think this would work. We do some similar things in IOx (interestingly also for the timeseries usecase with non-overlapping timeranges). It was implemented by @crepererum which you can see in https://github.com/influxdata/influxdb_iox/tree/main/iox_query/src/physical_optimizer
I think this is likely the solution that would be the fastest for querying because then time predicates could be used to prune out entire row groups and you would have lower file opening overhead The downside, is of course, you would need to rewrite the parquet files |
I am marking this as a question as I am not sure it is really a bug -- though please let me know if you disagree |
Thanks, I'll try this.
My bad, it was a question. Although one could argue it is also a feature request for an inbuilt optimization that removes sorts if it can detect non-overlaps using either hints or directly looking at min/max statistics on inputs. |
I think it is a reasonable request as having data sorted by date is so common, though the trick would be making the API reasonable and general purpose 🤔 |
I have had a somewhat overlapping (no pun intended) issue where DataFusion abandons the |
Yes, that is correct -- each partition stream from the parquet reader is produced back to back, so if there are multiple files, the resulting stream is not ordered even if all the input files were
Indeed as long as each output group was ordered in non overlapping time the parquet reader would not need to be changed at all |
FWIW @NGA-TRAN is working on something similar downstream in InfluxDB IOx |
|
There is quite a bit of work related to optimizing this kind of query recently. Since this ticket lays out the problem and request so well I am going to start collecting the work needed to complete the optimization as tasks on this issue @xudong963 has a POC for this work There is a related discord thread here: |
Describe the bug
I'm testing performance of querying a number of Parquet files, where I can make some assumptions about the Parquet files.
The schema of the files are:
Consider the following query and it's query plan:
The 572 milliseconds on the
SortPreservingMergeExec
seems to be the bottleneck in the query, so I would like to optimize it.Given the assumptions I can make about the Parquet files, I think that the
SortPreservingMergeExec
can be replaced by what is essentially a concatenation of each of the Parquet files.What would be the best approach to remove the
SortPreservingMergeExec
?My ideas:
PhysicalOptimizerRule
that looks for theSortPreservingMergeExec ParquetExec
pattern, and replaces it with a concatenation instead.But I would like to hear if there are any better ways.
Related
Infrastructure Tasks 🚧
merge
forDistribution
#15290FileGroup
structure forVec<PartitionedFile>
#15379SortPreservingMerge
-->ProgressiveEval
#15191PartitionedFile
andFileGroup
statistics should be inexact/recomputed #15539Major Tasks
statistics_by_partition
API toExecutionPlan
#15495SortPreservingMerge
that doesn't actually compare sort keys of the key ranges are ordered #10316Related (though not necessarily required)
split_file_groups_by_statistics
by default #10336ProgressiveEval
operator for optimizeSortPreservingMerge
#10488The text was updated successfully, but these errors were encountered: