Skip to content

Refactor the SQL implementation to include the SQLTable trait and add support for parameterized views. #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 14, 2025

Conversation

trueleo
Copy link
Collaborator

@trueleo trueleo commented Apr 7, 2025

Required for this issue

This PR primarily aims to refactor the SQL implementation and additionally define/extend types and traits to support parameterized view.

The idea behind the change is essentially separating SQLExecutor behavior from Table. Allowing tables to define their own state and make changes to logical/ast accordingly. Things like handling 4 token multi-part identifiers, attaching WITH context ..etc can be done via creating a custom SQLTable implementation in datafusion-table-provider.

Changes

  • Adds SQLTable trait which is used by SQLTableSource. SQLTable trait abstracts information about the remote table and allows implementors of the trait to hook into the final stages of federation, where they can change logical plan and the final ast for the sql query that will be used within VirtualExecutionPlan.

  • Adds RemoteTable, a default implementation for SQLTable trait, capable of handing table and parameterized views.

  • Adds RemoteTableRef, a extention to default TableReference capable of storing function args.

  • Provides a default AST Analyzer for rewriting Statement for tables which contain RemoteTableRef with some functional args

  • Extends SqlExecutor trait with logical_optimizer method, this can allow executor to hook into federation planning, allowing for rewriting LogicalPlan and even placement of FederationPlanNode. This is useful for avoiding federating nodes that are only part of datafusion eg. UDF, UDAF.. etc.

  • Refactors and testing related to usage of this feature

…eterized views

This PR primarily aims to refactor the SQL implementation and additionally
extends the traits and their default implementation to support storing
function arguments for table.

- Adds a SQLTable trait that is to be used within SQLTableSource. SQLTable
trait abstracts information about the remote table and allows of the trait
to hook into the final stages where they can change logical plan and the final
AST for sql query that is being federated via VirtualExecutionPlan.

- Adds RemoteTable, a default implementation for SQLTable trait, capable of
handing table and parameterized views.

- Adds RemoteTableRef, a extention to default TableReference capable of
storing function args.

- Provides a default AST Analyzer for rewriting Statement for tables which
contain RemoteTableRef with some functional Args

- Extends SqlExecutor trait with logical_optimizer method, this can allow
executor to hook into federation planning, allowing for rewriting LogicalPlan
and even placement of FederationPlanNode. This is useful for avoiding
federating nodes that are only part of datafusion eg. UDF, UDAF.. etc.

- Refactors and testing related to usage of this feature
@trueleo trueleo force-pushed the parameterized_views branch from 60440c7 to b7b88cd Compare April 11, 2025 05:01
@trueleo trueleo changed the title Define SQLTable trait for extention and implement parameterized views Refactor the SQL implementation to include the SQLTable trait and add support for parameterized views. Apr 11, 2025
@trueleo trueleo marked this pull request as ready for review April 11, 2025 06:23
@backkem
Copy link
Collaborator

backkem commented Apr 11, 2025

Overall, this makes sense to me.

I do want to mention the following: We seem to be building out the "mini optimizer tooling" inside federation with this PR. It is still predominantly used for rewriting table names (I don't mean the AST optimiser, I think that makes sense). It does make me wonder if it wouldn't be more appropriate to use DataFusions optimizer tooling for this purpose. This was briefly discussed in #61 but was abandoned at the time. However, now it seems to be leading to additional wiring (E.g.: SQLTable.table_reference). This would seem avoidable if the table rewrite was already done before federation. Naturally this takes work so I wonder if there is enough apatite for it. I'd like to hear other opinions on this.

If there is no agreement/commitment to the above, I'd be in favor of landing this first and re-visiting the table rewrite topic at later time.

@backkem backkem requested a review from hozan23 April 11, 2025 08:42
@trueleo
Copy link
Collaborator Author

trueleo commented Apr 12, 2025

@backkem

It does make me wonder if it wouldn't be more appropriate to use DataFusions optimizer tooling for this purpose

Any optimization process that is done via SQLExecutor::logical_optimizer or SQLTable::logical_optimizer can very well be done with Datafusion optimizer pipeline. But for such optimizers usually implementers will have to assume no state about the plan and gather state just like in rewrite_table_scan we have a hashmap known_rewrites.

For now I am okay with discarding logical optimizer hooks from SQLTable as I don't really have a use case for them in mind. But given how FederationOptimizer passes the whole logical plan to SQLFederationOptimizer, I do think I have a usecase for SQLExecutor::logical_optimizer. Basically this can allow executor to also define what part of the passed down subtree can it actually federate. For example currently datafusion-json-function or any other UDFs cannot be just be federated and require some custom logic to bubble it out of FederatedPlanNode

@trueleo
Copy link
Collaborator Author

trueleo commented Apr 12, 2025

If there is no agreement/commitment to the above, I'd be in favor of landing this first and re-visiting the table rewrite topic at later time.

I moved, rewrite_table_scan to a separate file but its usage is still the same. I do agree that most of the general transformations should be pushed into logical planning via datafusion's AnalyzerRule/OptimizerRule, as it simply makes it more readable when you run EXPLAIN.

@backkem
Copy link
Collaborator

backkem commented Apr 12, 2025

I suggest we ticket out the optimiser discussion again. Let's not hold up this PR on it.

RE partial federation: that indeed makes sense as a use-case. The way I originally (intended to) support this is with the FederationProvider.Optimizer. The idea is to let remotes that can't execute everything to Self-Select which part of the plan to federate in there.

@hozan23 hozan23 merged commit a67b931 into datafusion-contrib:main Apr 14, 2025
7 checks passed
@github-actions github-actions bot mentioned this pull request Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants