Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51247][SQL] Move SubstituteExecuteImmediate to 'resolution' batch and prepare it for SQL Scripting local variables. #49993

Conversation

dusantism-db
Copy link
Contributor

@dusantism-db dusantism-db commented Feb 18, 2025

What changes were proposed in this pull request?

This PR changes SubstituteExecuteImmediate to analyze it's entire subtree within a scoped context. This will allow us to disable SQL scripting local variables in the subtree, when they are added, which is necessary in order to sandbox the generated plan.

This PR also moves SubstituteExecuteImmediate to resolution batch in the analyzer. This is necessary in order to resolve arguments of EXECUTE IMMEDIATE properly, notably if the EXECUTE IMMEDIATE is the child of a ParameterizedQuery. This ensured proper resolution ordering i.e. first all parameters of EXECUTE IMMEDIATE will be resolved, and only then will the generated query itself be analyzed.

Local variables PR - #49445

Why are the changes needed?

They are necessaty for local variables support in SQL scripting.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests and golden files.

Was this patch authored or co-authored using generative AI tooling?

No.

@@ -1670,6 +1683,9 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
case s: Sort if !s.resolved || s.missingInput.nonEmpty =>
resolveReferencesInSort(s)

// Pass for Execute Immediate as arguments will be resolved by [[SubstituteExecuteImmediate]].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably clean up this later: I don't think we need special arguments resolution in SubstituteExecuteImmediate. EXECUTE IMMEDIATE is a leaf node and it can't resolve arguments to any columns, so it should be fine to go with normal resolution code path and resolve the arguments to session variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my knowledge, can you clarify on this?
IIUC, this comment in the code refers to the parameters specified in the USING clause in EXECUTE IMMEDIATE, i.e. their resolution in the query. In the USING clause we have <expr> AS <alias> syntax that can basically rename any session var and we should be able to use local vars here as well.
How would this work through the regular resolution path? Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I put the pass here is to not resolve targetVariables. For example if we have a session variable x and we want to set it with INTO clause of EXECUTE IMMEDIATE, if we don't have this pass here x will get resolved to it's value. This doesn't make sense as we don't want the variables value, we want it's VariableReference

Copy link
Contributor

@cloud-fan cloud-fan Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see. It's similar to how we resolve SetVariable. We also special-case SetVariable in this rule ResolveReferences

e.copy(args = resolveArguments(expressions))

case ExecuteImmediateQuery(expressions, query, targetVariables)
if expressions.forall(_.resolved) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this always going to be true, considering that we didn't match the previous case at this point?

Copy link
Contributor Author

@dusantism-db dusantism-db Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, i put the additional case to make it more explicit / robust. For example if the order changes or the previous case is moved out of the rule (for example to ResolveReferences), this would still work properly.

@davidm-db
Copy link
Contributor

Please add this PR to the description of the local variables one, and also edit the description there to better explain the new approach.

@cloud-fan
Copy link
Contributor

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in 26febf7 Feb 19, 2025
cloud-fan pushed a commit that referenced this pull request Feb 19, 2025
…tch and prepare it for SQL Scripting local variables

### What changes were proposed in this pull request?
This PR changes `SubstituteExecuteImmediate`  to analyze it's entire subtree within a scoped context. This will allow us to disable SQL scripting local variables in the subtree, when they are added, which is necessary in order to sandbox the generated plan.

This PR also moves `SubstituteExecuteImmediate` to `resolution` batch in the analyzer. This is necessary in order to resolve arguments of EXECUTE IMMEDIATE properly, notably if the EXECUTE IMMEDIATE is the child of a `ParameterizedQuery`. This ensured proper resolution ordering i.e. first all parameters of EXECUTE IMMEDIATE will be resolved, and only then will the generated query itself be analyzed.

Local variables PR - #49445

### Why are the changes needed?
They are necessaty for local variables support in SQL scripting.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests and golden files.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49993 from dusantism-db/execute-immediate-resolution-batch.

Authored-by: Dušan Tišma <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 26febf7)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 20, 2025
### What changes were proposed in this pull request?
This pull request introduces support for local variables in SQL scripting.

#### Behavior:

Local variables are declared in the headers of compound bodies, and are bound to it's scope. Variables of the same name are allowed in nested scopes, where the innermost variable will be resolved. Optionally, a local variable can be qualified with the label of the compound body in which it was declared, which would allow accessing variables which are not the innermost in the current scope.

Local variables have resolution priority over session variables, session variable resolution is attempted after local variable resolution. The exception to this is with fully qualified session variables, in the format `system.session.<varName>` or `session.<varName>`. System and session are forbidden for use as compound body labels.

Local variables must not be qualified on declaration, can be set using `SET VAR` and cannot be `DROPPED`.

They also should not be allowed to be declared with `DECLARE OR REPLACE`, however this is not implemented on this PR as `FOR` statement relies on this behavior. `FOR` statement must be updated in a separate PR to use proper local variables, as the current implementation is simulating them using session variables.

#### Implementation notes:

As core depends on catalyst, it's impossible to import code from core(where most of SQL scripting implementation is located) to catalyst. To solve this a trait `VariableManager` is introduced, which is then implemented in core and injected to catalyst. This `VariableManager` is basically a wrapper around `SqlScriptingExecutionContext` and provides methods for getting/setting/creating variables.

This injection is tricky because we want to have one `ScriptingVariableManager` **per script**.
Options considered to achieve this are:
- Pass the manager/context to the analyzer using function calls. If possible, this solution would be ideal because it would allow every run of the analyzer to have it's own scripting context which is automatically cleaned up (AnalysisContext). This would also allow more control over the variable resolution, i.e. for `EXECUTE IMMEDIATE` we could simply not pass in the script context and it would behave as if outside of a script. This is the intended behavior for `EXECUTE IMMEDIATE`. The problem with this approach is it seems hard to implement. The call stack would be as follows: `Analyzer.executeAndCheck` -> `HybridAnalyzer.apply` -> `RuleExecutor.executeAndTrack` -> `Analyzer.execute` (**overridden** from RuleExecutor) -> `Analyzer.withNewAnalysisContext`. Implementing this context propagation would require changing the signatures of all of these methods, including superclass methods like `execute` and `executeAndTrack`.
- Store the context in `CatalogManager`. `CatalogManager's` lifetime is tied to the session, so to allow for multiple scripts to execute in the same time we would need to e.g. have a map `scriptUUID -> VariableManager`, and to have the `scriptUUID` as a `ThreadLocal` variable in the `CatalogManager`. The drawback of this approach is that the script has to clean up it's resources after execution, and also that it's more complicated to e.g. forbid `EXECUTE IMMEDIATE` from accessing local variables.

Currently the second option seems better to me, however I am open to suggestions on how to approach this.

EDIT: An option similar to the second one was chosen, except a ThreadLocal Singleton instance of context is used instead of storing it in `CatalogManager`.

EDIT: Execute Immediate needs to be reworked in order to work properly with local variables. The generated query should not be able to access local variables, which means EXECUTE IMMEDIATE needs to somehow sandbox that query. This is done by analyzing it's entire subtree in SubstituteExecuteImmediate, with context so we know we are in EXECUTE IMMEDIATE. PR for this refactor - #49993

### Why are the changes needed?
Currently, local variables are simulated using session variables in SQL scripting, which is a temporary solution and bad in many ways.

### Does this PR introduce _any_ user-facing change?
Yes, this change introduces multiple new types of errors.

### How was this patch tested?
Tests were added to SqlScriptingExecutionSuite and SqlScriptingParserSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49445 from dusantism-db/scripting-local-variables.

Authored-by: Dušan Tišma <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Feb 20, 2025
### What changes were proposed in this pull request?
This pull request introduces support for local variables in SQL scripting.

#### Behavior:

Local variables are declared in the headers of compound bodies, and are bound to it's scope. Variables of the same name are allowed in nested scopes, where the innermost variable will be resolved. Optionally, a local variable can be qualified with the label of the compound body in which it was declared, which would allow accessing variables which are not the innermost in the current scope.

Local variables have resolution priority over session variables, session variable resolution is attempted after local variable resolution. The exception to this is with fully qualified session variables, in the format `system.session.<varName>` or `session.<varName>`. System and session are forbidden for use as compound body labels.

Local variables must not be qualified on declaration, can be set using `SET VAR` and cannot be `DROPPED`.

They also should not be allowed to be declared with `DECLARE OR REPLACE`, however this is not implemented on this PR as `FOR` statement relies on this behavior. `FOR` statement must be updated in a separate PR to use proper local variables, as the current implementation is simulating them using session variables.

#### Implementation notes:

As core depends on catalyst, it's impossible to import code from core(where most of SQL scripting implementation is located) to catalyst. To solve this a trait `VariableManager` is introduced, which is then implemented in core and injected to catalyst. This `VariableManager` is basically a wrapper around `SqlScriptingExecutionContext` and provides methods for getting/setting/creating variables.

This injection is tricky because we want to have one `ScriptingVariableManager` **per script**.
Options considered to achieve this are:
- Pass the manager/context to the analyzer using function calls. If possible, this solution would be ideal because it would allow every run of the analyzer to have it's own scripting context which is automatically cleaned up (AnalysisContext). This would also allow more control over the variable resolution, i.e. for `EXECUTE IMMEDIATE` we could simply not pass in the script context and it would behave as if outside of a script. This is the intended behavior for `EXECUTE IMMEDIATE`. The problem with this approach is it seems hard to implement. The call stack would be as follows: `Analyzer.executeAndCheck` -> `HybridAnalyzer.apply` -> `RuleExecutor.executeAndTrack` -> `Analyzer.execute` (**overridden** from RuleExecutor) -> `Analyzer.withNewAnalysisContext`. Implementing this context propagation would require changing the signatures of all of these methods, including superclass methods like `execute` and `executeAndTrack`.
- Store the context in `CatalogManager`. `CatalogManager's` lifetime is tied to the session, so to allow for multiple scripts to execute in the same time we would need to e.g. have a map `scriptUUID -> VariableManager`, and to have the `scriptUUID` as a `ThreadLocal` variable in the `CatalogManager`. The drawback of this approach is that the script has to clean up it's resources after execution, and also that it's more complicated to e.g. forbid `EXECUTE IMMEDIATE` from accessing local variables.

Currently the second option seems better to me, however I am open to suggestions on how to approach this.

EDIT: An option similar to the second one was chosen, except a ThreadLocal Singleton instance of context is used instead of storing it in `CatalogManager`.

EDIT: Execute Immediate needs to be reworked in order to work properly with local variables. The generated query should not be able to access local variables, which means EXECUTE IMMEDIATE needs to somehow sandbox that query. This is done by analyzing it's entire subtree in SubstituteExecuteImmediate, with context so we know we are in EXECUTE IMMEDIATE. PR for this refactor - #49993

### Why are the changes needed?
Currently, local variables are simulated using session variables in SQL scripting, which is a temporary solution and bad in many ways.

### Does this PR introduce _any_ user-facing change?
Yes, this change introduces multiple new types of errors.

### How was this patch tested?
Tests were added to SqlScriptingExecutionSuite and SqlScriptingParserSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49445 from dusantism-db/scripting-local-variables.

Authored-by: Dušan Tišma <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit fb17856)
Signed-off-by: Wenchen Fan <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Feb 20, 2025
### What changes were proposed in this pull request?
This pull request introduces support for local variables in SQL scripting.

#### Behavior:

Local variables are declared in the headers of compound bodies, and are bound to it's scope. Variables of the same name are allowed in nested scopes, where the innermost variable will be resolved. Optionally, a local variable can be qualified with the label of the compound body in which it was declared, which would allow accessing variables which are not the innermost in the current scope.

Local variables have resolution priority over session variables, session variable resolution is attempted after local variable resolution. The exception to this is with fully qualified session variables, in the format `system.session.<varName>` or `session.<varName>`. System and session are forbidden for use as compound body labels.

Local variables must not be qualified on declaration, can be set using `SET VAR` and cannot be `DROPPED`.

They also should not be allowed to be declared with `DECLARE OR REPLACE`, however this is not implemented on this PR as `FOR` statement relies on this behavior. `FOR` statement must be updated in a separate PR to use proper local variables, as the current implementation is simulating them using session variables.

#### Implementation notes:

As core depends on catalyst, it's impossible to import code from core(where most of SQL scripting implementation is located) to catalyst. To solve this a trait `VariableManager` is introduced, which is then implemented in core and injected to catalyst. This `VariableManager` is basically a wrapper around `SqlScriptingExecutionContext` and provides methods for getting/setting/creating variables.

This injection is tricky because we want to have one `ScriptingVariableManager` **per script**.
Options considered to achieve this are:
- Pass the manager/context to the analyzer using function calls. If possible, this solution would be ideal because it would allow every run of the analyzer to have it's own scripting context which is automatically cleaned up (AnalysisContext). This would also allow more control over the variable resolution, i.e. for `EXECUTE IMMEDIATE` we could simply not pass in the script context and it would behave as if outside of a script. This is the intended behavior for `EXECUTE IMMEDIATE`. The problem with this approach is it seems hard to implement. The call stack would be as follows: `Analyzer.executeAndCheck` -> `HybridAnalyzer.apply` -> `RuleExecutor.executeAndTrack` -> `Analyzer.execute` (**overridden** from RuleExecutor) -> `Analyzer.withNewAnalysisContext`. Implementing this context propagation would require changing the signatures of all of these methods, including superclass methods like `execute` and `executeAndTrack`.
- Store the context in `CatalogManager`. `CatalogManager's` lifetime is tied to the session, so to allow for multiple scripts to execute in the same time we would need to e.g. have a map `scriptUUID -> VariableManager`, and to have the `scriptUUID` as a `ThreadLocal` variable in the `CatalogManager`. The drawback of this approach is that the script has to clean up it's resources after execution, and also that it's more complicated to e.g. forbid `EXECUTE IMMEDIATE` from accessing local variables.

Currently the second option seems better to me, however I am open to suggestions on how to approach this.

EDIT: An option similar to the second one was chosen, except a ThreadLocal Singleton instance of context is used instead of storing it in `CatalogManager`.

EDIT: Execute Immediate needs to be reworked in order to work properly with local variables. The generated query should not be able to access local variables, which means EXECUTE IMMEDIATE needs to somehow sandbox that query. This is done by analyzing it's entire subtree in SubstituteExecuteImmediate, with context so we know we are in EXECUTE IMMEDIATE. PR for this refactor - apache/spark#49993

### Why are the changes needed?
Currently, local variables are simulated using session variables in SQL scripting, which is a temporary solution and bad in many ways.

### Does this PR introduce _any_ user-facing change?
Yes, this change introduces multiple new types of errors.

### How was this patch tested?
Tests were added to SqlScriptingExecutionSuite and SqlScriptingParserSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49445 from dusantism-db/scripting-local-variables.

Authored-by: Dušan Tišma <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants