Skip to content

WIP: User defined sorting #15106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Conversation

tobixdev
Copy link
Contributor

@tobixdev tobixdev commented Mar 9, 2025

Related to: #14828, #12644, #14247 and all tests most probably won't pass (the new test mentioned below passes on my machine though).

This PR is a rough proposal for an implementation of user-defined sorting. The main goal is to get a discussion starting whether this is a direction we want to go.

Some design considerations that help understand the PR:

  • PhysicalSortExpr allows defining a custom sort order, but does not know anything about extension types.
  • "Resolving" logical types happens during the creation of the initial physical plan.
  • LogicalTypePlanningInformation allows parameterizing this creation process
  • Obtaining the logical type is currently done by inspecting the Field whether there is an entry in the metadata with the key EXTENSION_TYPE_NAME_KEY.
    • An extension type registry is required for making use of this information.
    • Currently, this works only for direct access to columns. See Change ReturnTypeInfo to return a Field rather than DataType #14247 for some discussions on this issue.
    • In the future, maybe we can move DFSchema towards including a LogicalType and a DataType (at least as a first step). The remaining approach still works with such an assumption.

This is similar to "Option 1: User defined operators" from the discussion in #14247. Here are some thoughts on that:

  • From a logical standpoint, the expressions themselves do not change. It's still a SortExpr just on a special type.
  • Rewriting all Expr to UDFs via analyzers can be also tricky as we would then, for example, need to parameterize SortExpr with a custom ordering to allow the user-defined sorting use case.
  • This makes the physical plan creation more involved.

There are many changes in tests etc. so here is a list of "highlights" you should check out when taking a look at this PR.
Highlights:

  • datafusion/common/src/sort.rs: new sort data structures and adapted procedures from arrow-rs.
  • datafusion/common/src/types/logical.rs: Extension to logical type to provide a LogicalTypePlanningInformation.
  • datafusion/core/tests/dataframe/test_types.rs: A logical type (IntOrFloatType) that implements a custom order.
  • datafusion/core/tests/dataframe/mod.rs: A test for sorting a on union type.
  • datafusion/core/src/physical_planner.rs: Applying custom order during physical plan creation.

To get (a similar) approach upstream I'd suggest the following steps (splitting up the PR, adding tests etc.):

  1. Add SortOrdering to PhysicalSortExpr and add support for user-defined sorting operations (DynComparator for now)
  2. Add ExtensionTypeRegistry to support registering extension types.
  3. Use said registry to look up whether we have planning information for SortExpr during planning and construct the correct PhysicalSortExpr.
  4. Adapt existing code that relies on the fact that types have a natural ordering (e.g., AggregateExec). This step will be done interatively.

Note that this procedure leads to a state where some parts of DF already make use of user defined sorting, while other do not yet support it. However, I don't think this is a deal breaker.

I'm eager to hear your thoughts!

Further Notes

I think this would also require users to return a Field from a UDF instead of just a data type (#14247) to allow UDFs to return extension types. See TODO note in datafusion/core/src/physical_planner.rs.

I think supported the row-based sorting from arrow-rs can also be supported in a similar fashion by extending CustomOrdering.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate labels Mar 9, 2025
@paleolimbot
Copy link
Member

I'm not sure I'm qualified to review this from an implementation standpoint, but the idea of an extension type registry and a centralized place (here, the LogicalType trait implementation) for behaviour that differs is something I'd like as well, and this PR nicely points out all the places I should be looking 🙂 .

@tobixdev
Copy link
Contributor Author

Happy to hear any kind of feedback on that @paleolimbot . So take a look if you've time. Also good to hear that others also have these requirements in their projects :). Could you think of a way to implement your use case by extending the proposed solution (to get a feeling whether this idea generalizes to other use cases)?

As this is quite a fundamental change that can influence many more features to come, we would definitely need some input from someone more familiar with the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate execution Related to the execution crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants