Skip to content

Commit

Permalink
[SPARK-48530][SQL] Support for local variables in SQL Scripting
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?
This pull request introduces support for local variables in SQL scripting.

#### Behavior:

Local variables are declared in the headers of compound bodies, and are bound to it's scope. Variables of the same name are allowed in nested scopes, where the innermost variable will be resolved. Optionally, a local variable can be qualified with the label of the compound body in which it was declared, which would allow accessing variables which are not the innermost in the current scope.

Local variables have resolution priority over session variables, session variable resolution is attempted after local variable resolution. The exception to this is with fully qualified session variables, in the format `system.session.<varName>` or `session.<varName>`. System and session are forbidden for use as compound body labels.

Local variables must not be qualified on declaration, can be set using `SET VAR` and cannot be `DROPPED`.

They also should not be allowed to be declared with `DECLARE OR REPLACE`, however this is not implemented on this PR as `FOR` statement relies on this behavior. `FOR` statement must be updated in a separate PR to use proper local variables, as the current implementation is simulating them using session variables.

#### Implementation notes:

As core depends on catalyst, it's impossible to import code from core(where most of SQL scripting implementation is located) to catalyst. To solve this a trait `VariableManager` is introduced, which is then implemented in core and injected to catalyst. This `VariableManager` is basically a wrapper around `SqlScriptingExecutionContext` and provides methods for getting/setting/creating variables.

This injection is tricky because we want to have one `ScriptingVariableManager` **per script**.
Options considered to achieve this are:
- Pass the manager/context to the analyzer using function calls. If possible, this solution would be ideal because it would allow every run of the analyzer to have it's own scripting context which is automatically cleaned up (AnalysisContext). This would also allow more control over the variable resolution, i.e. for `EXECUTE IMMEDIATE` we could simply not pass in the script context and it would behave as if outside of a script. This is the intended behavior for `EXECUTE IMMEDIATE`. The problem with this approach is it seems hard to implement. The call stack would be as follows: `Analyzer.executeAndCheck` -> `HybridAnalyzer.apply` -> `RuleExecutor.executeAndTrack` -> `Analyzer.execute` (**overridden** from RuleExecutor) -> `Analyzer.withNewAnalysisContext`. Implementing this context propagation would require changing the signatures of all of these methods, including superclass methods like `execute` and `executeAndTrack`.
- Store the context in `CatalogManager`. `CatalogManager's` lifetime is tied to the session, so to allow for multiple scripts to execute in the same time we would need to e.g. have a map `scriptUUID -> VariableManager`, and to have the `scriptUUID` as a `ThreadLocal` variable in the `CatalogManager`. The drawback of this approach is that the script has to clean up it's resources after execution, and also that it's more complicated to e.g. forbid `EXECUTE IMMEDIATE` from accessing local variables.

Currently the second option seems better to me, however I am open to suggestions on how to approach this.

EDIT: An option similar to the second one was chosen, except a ThreadLocal Singleton instance of context is used instead of storing it in `CatalogManager`.

EDIT: Execute Immediate needs to be reworked in order to work properly with local variables. The generated query should not be able to access local variables, which means EXECUTE IMMEDIATE needs to somehow sandbox that query. This is done by analyzing it's entire subtree in SubstituteExecuteImmediate, with context so we know we are in EXECUTE IMMEDIATE. PR for this refactor - #49993

### Why are the changes needed?
Currently, local variables are simulated using session variables in SQL scripting, which is a temporary solution and bad in many ways.

### Does this PR introduce _any_ user-facing change?
Yes, this change introduces multiple new types of errors.

### How was this patch tested?
Tests were added to SqlScriptingExecutionSuite and SqlScriptingParserSuite.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49445 from dusantism-db/scripting-local-variables.

Authored-by: Dušan Tišma <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit fb17856)
Signed-off-by: Wenchen Fan <[email protected]>
  • Loading branch information
dusantism-db authored and cloud-fan committed Feb 20, 2025
1 parent 60e1d4a commit e140dbb
Show file tree
Hide file tree
Showing 30 changed files with 1,788 additions and 327 deletions.
21 changes: 21 additions & 0 deletions common/utils/src/main/resources/error/error-conditions.json
Original file line number Diff line number Diff line change
Expand Up @@ -3580,6 +3580,16 @@
"message" : [
"Variable <varName> can only be declared at the beginning of the compound."
]
},
"QUALIFIED_LOCAL_VARIABLE" : {
"message" : [
"The variable <varName> must be declared without a qualifier, as qualifiers are not allowed for local variable declarations."
]
},
"REPLACE_LOCAL_VARIABLE" : {
"message" : [
"The variable <varName> does not support DECLARE OR REPLACE, as local variables cannot be replaced."
]
}
},
"sqlState" : "42K0M"
Expand Down Expand Up @@ -3726,6 +3736,12 @@
],
"sqlState" : "42K0L"
},
"LABEL_NAME_FORBIDDEN" : {
"message" : [
"The label name <label> is forbidden."
],
"sqlState" : "42K0L"
},
"LOAD_DATA_PATH_NOT_EXISTS" : {
"message" : [
"LOAD DATA input path does not exist: <path>."
Expand Down Expand Up @@ -5788,6 +5804,11 @@
"SQL Scripting is under development and not all features are supported. SQL Scripting enables users to write procedural SQL including control flow and error handling. To enable existing features set <sqlScriptingEnabled> to `true`."
]
},
"SQL_SCRIPTING_DROP_TEMPORARY_VARIABLE" : {
"message" : [
"DROP TEMPORARY VARIABLE is not supported within SQL scripts. To bypass this, use `EXECUTE IMMEDIATE 'DROP TEMPORARY VARIABLE ...'` ."
]
},
"SQL_SCRIPTING_WITH_POSITIONAL_PARAMETERS" : {
"message" : [
"Positional parameters are not supported with SQL Scripting."
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.util

/**
* Helper trait for defining thread locals with lexical scoping. With this helper, the thread local
* is private and can only be set by the [[Handle]]. The [[Handle]] only exposes the thread local
* value to functions passed into its runWith method. This pattern allows for
* the lifetime of the thread local value to be strictly controlled.
*
* Rather than calling `tl.set(...)` and `tl.remove()` you would get a handle and execute your code
* in `handle.runWith { ... }`.
*
* Example:
* {{{
* object Credentials extends LexicalThreadLocal[Int] {
* def create(creds: Map[String, String]) = new Handle(Some(creds))
* }
* ...
* val handle = Credentials.create(Map("key" -> "value"))
* assert(Credentials.get() == None)
* handle.runWith {
* assert(Credentials.get() == Some(Map("key" -> "value")))
* }
* }}}
*/
trait LexicalThreadLocal[T] {
private val tl = new ThreadLocal[T]

private def set(opt: Option[T]): Unit = {
opt match {
case Some(x) => tl.set(x)
case None => tl.remove()
}
}

protected def createHandle(opt: Option[T]): Handle = new Handle(opt)

def get(): Option[T] = Option(tl.get)

/** Final class representing a handle to a thread local value. */
final class Handle private[LexicalThreadLocal] (private val opt: Option[T]) {
def runWith[R](f: => R): R = {
val old = get()
set(opt)
try f finally {
set(old)
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.sql.catalyst

import org.apache.spark.sql.catalyst.catalog.VariableManager
import org.apache.spark.util.LexicalThreadLocal

object SqlScriptingLocalVariableManager extends LexicalThreadLocal[VariableManager] {
def create(variableManager: VariableManager): Handle = createHandle(Option(variableManager))
}
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,10 @@ import scala.collection.mutable

import org.apache.spark.internal.Logging
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.catalyst.SqlScriptingLocalVariableManager
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.expressions.SubExprUtils.wrapOuterReference
import org.apache.spark.sql.catalyst.parser.SqlScriptingLabelContext.isForbiddenLabelName
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.trees.CurrentOrigin.withOrigin
import org.apache.spark.sql.catalyst.trees.TreePattern._
Expand Down Expand Up @@ -251,6 +253,14 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
}
}

/**
* Look up variable by nameParts.
* If in SQL Script, first check local variables, unless in EXECUTE IMMEDIATE
* (EXECUTE IMMEDIATE generated query cannot access local variables).
* if not found fall back to session variables.
* @param nameParts NameParts of the variable.
* @return Reference to the variable.
*/
def lookupVariable(nameParts: Seq[String]): Option[VariableReference] = {
// The temp variables live in `SYSTEM.SESSION`, and the name can be qualified or not.
def maybeTempVariableName(nameParts: Seq[String]): Boolean = {
Expand All @@ -266,22 +276,41 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
}
}

if (maybeTempVariableName(nameParts)) {
val variableName = if (conf.caseSensitiveAnalysis) {
nameParts.last
} else {
nameParts.last.toLowerCase(Locale.ROOT)
}
catalogManager.tempVariableManager.get(variableName).map { varDef =>
val namePartsCaseAdjusted = if (conf.caseSensitiveAnalysis) {
nameParts
} else {
nameParts.map(_.toLowerCase(Locale.ROOT))
}

SqlScriptingLocalVariableManager.get()
// If we are in EXECUTE IMMEDIATE lookup only session variables.
.filterNot(_ => AnalysisContext.get.isExecuteImmediate)
// If variable name is qualified with session.<varName> treat it as a session variable.
.filterNot(_ =>
nameParts.length > 2 || (nameParts.length == 2 && isForbiddenLabelName(nameParts.head)))
.flatMap(_.get(namePartsCaseAdjusted))
.map { varDef =>
VariableReference(
nameParts,
FakeSystemCatalog,
Identifier.of(Array(CatalogManager.SESSION_NAMESPACE), variableName),
FakeLocalCatalog,
Identifier.of(Array(varDef.identifier.namespace().last), namePartsCaseAdjusted.last),
varDef)
}
} else {
None
}
.orElse(
if (maybeTempVariableName(nameParts)) {
catalogManager.tempVariableManager
.get(namePartsCaseAdjusted)
.map { varDef =>
VariableReference(
nameParts,
FakeSystemCatalog,
Identifier.of(Array(CatalogManager.SESSION_NAMESPACE), namePartsCaseAdjusted.last),
varDef
)}
} else {
None
}
)
}

// Resolves `UnresolvedAttribute` to its value.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,13 @@ package org.apache.spark.sql.catalyst.analysis

import scala.jdk.CollectionConverters._

import org.apache.spark.SparkException
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.catalyst.SqlScriptingLocalVariableManager
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules.Rule
import org.apache.spark.sql.connector.catalog.{CatalogManager, CatalogPlugin, Identifier, LookupCatalog, SupportsNamespaces}
import org.apache.spark.sql.errors.DataTypeErrors.toSQLId
import org.apache.spark.sql.errors.QueryCompilationErrors
import org.apache.spark.util.ArrayImplicits._

Expand All @@ -35,10 +39,42 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
// We only support temp variables for now and the system catalog is not properly implemented
// yet. We need to resolve `UnresolvedIdentifier` for variable commands specially.
case c @ CreateVariable(UnresolvedIdentifier(nameParts, _), _, _) =>
val resolved = resolveVariableName(nameParts)
// From scripts we can only create local variables, which must be unqualified,
// and must not be DECLARE OR REPLACE.
val resolved = if (withinSqlScript) {
// TODO [SPARK-50785]: Uncomment this when For Statement starts properly using local vars.
// if (c.replace) {
// throw new AnalysisException(
// "INVALID_VARIABLE_DECLARATION.REPLACE_LOCAL_VARIABLE",
// Map("varName" -> toSQLId(nameParts))
// )
// }

if (nameParts.length != 1) {
throw new AnalysisException(
"INVALID_VARIABLE_DECLARATION.QUALIFIED_LOCAL_VARIABLE",
Map("varName" -> toSQLId(nameParts)))
}

SqlScriptingLocalVariableManager.get()
.getOrElse(throw SparkException.internalError(
"Scripting local variable manager should be present in SQL script."))
.qualify(nameParts.last)
} else {
val resolvedIdentifier = catalogManager.tempVariableManager.qualify(nameParts.last)

assertValidSessionVariableNameParts(nameParts, resolvedIdentifier)
resolvedIdentifier
}

c.copy(name = resolved)
case d @ DropVariable(UnresolvedIdentifier(nameParts, _), _) =>
val resolved = resolveVariableName(nameParts)
if (withinSqlScript) {
throw new AnalysisException(
"UNSUPPORTED_FEATURE.SQL_SCRIPTING_DROP_TEMPORARY_VARIABLE", Map.empty)
}
val resolved = catalogManager.tempVariableManager.qualify(nameParts.last)
assertValidSessionVariableNameParts(nameParts, resolved)
d.copy(name = resolved)

case UnresolvedIdentifier(nameParts, allowTemp) =>
Expand Down Expand Up @@ -73,28 +109,34 @@ class ResolveCatalogs(val catalogManager: CatalogManager)
}
}

private def resolveVariableName(nameParts: Seq[String]): ResolvedIdentifier = {
def ident: Identifier = Identifier.of(Array(CatalogManager.SESSION_NAMESPACE), nameParts.last)
if (nameParts.length == 1) {
ResolvedIdentifier(FakeSystemCatalog, ident)
} else if (nameParts.length == 2) {
if (nameParts.head.equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
ResolvedIdentifier(FakeSystemCatalog, ident)
} else {
throw QueryCompilationErrors.unresolvedVariableError(
nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, CatalogManager.SESSION_NAMESPACE))
}
} else if (nameParts.length == 3) {
if (nameParts(0).equalsIgnoreCase(CatalogManager.SYSTEM_CATALOG_NAME) &&
nameParts(1).equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE)) {
ResolvedIdentifier(FakeSystemCatalog, ident)
} else {
throw QueryCompilationErrors.unresolvedVariableError(
nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, CatalogManager.SESSION_NAMESPACE))
}
} else {
private def withinSqlScript: Boolean =
SqlScriptingLocalVariableManager.get().isDefined && !AnalysisContext.get.isExecuteImmediate

private def assertValidSessionVariableNameParts(
nameParts: Seq[String],
resolvedIdentifier: ResolvedIdentifier): Unit = {
if (!validSessionVariableName(nameParts)) {
throw QueryCompilationErrors.unresolvedVariableError(
nameParts, Seq(CatalogManager.SYSTEM_CATALOG_NAME, CatalogManager.SESSION_NAMESPACE))
nameParts,
Seq(
resolvedIdentifier.catalog.name(),
resolvedIdentifier.identifier.namespace().head)
)
}

def validSessionVariableName(nameParts: Seq[String]): Boolean = nameParts.length match {
case 1 => true

// On declare variable, local variables support only unqualified names.
// On drop variable, local variables are not supported at all.
case 2 if nameParts.head.equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE) => true

// When there are 3 nameParts the variable must be a fully qualified session variable
// i.e. "system.session.<varName>"
case 3 if nameParts(0).equalsIgnoreCase(CatalogManager.SYSTEM_CATALOG_NAME) &&
nameParts(1).equalsIgnoreCase(CatalogManager.SESSION_NAMESPACE) => true

case _ => false
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,12 @@ class ResolveSetVariable(val catalogManager: CatalogManager) extends Rule[Logica
// Names are normalized when the variables are created.
// No need for case insensitive comparison here.
// TODO: we need to group by the qualified variable name once other catalogs support it.
val dups = resolvedVars.groupBy(_.identifier.name).filter(kv => kv._2.length > 1)
val dups = resolvedVars.groupBy(_.identifier).filter(kv => kv._2.length > 1)
if (dups.nonEmpty) {
throw new AnalysisException(
errorClass = "DUPLICATE_ASSIGNMENTS",
messageParameters = Map("nameList" -> dups.keys.map(toSQLId).mkString(", ")))
messageParameters = Map("nameList" ->
dups.keys.map(key => toSQLId(key.name())).mkString(", ")))
}

setVariable.copy(targetVariables = resolvedVars)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package org.apache.spark.sql.catalyst.analysis.resolver

import java.util.Locale

import org.apache.spark.sql.catalyst.{FunctionIdentifier, SQLConfHelper}
import org.apache.spark.sql.catalyst.{FunctionIdentifier, SQLConfHelper, SqlScriptingLocalVariableManager}
import org.apache.spark.sql.catalyst.analysis.{
FunctionRegistry,
GetViewColumnByNameAndOrdinal,
Expand Down Expand Up @@ -266,7 +266,9 @@ class ResolverGuard(catalogManager: CatalogManager) extends SQLConfHelper {
LegacyBehaviorPolicy.withName(conf.getConf(SQLConf.LEGACY_CTE_PRECEDENCE_POLICY)) ==
LegacyBehaviorPolicy.CORRECTED

private def checkVariables() = catalogManager.tempVariableManager.isEmpty
private def checkVariables() =
catalogManager.tempVariableManager.isEmpty &&
SqlScriptingLocalVariableManager.get().forall(_.isEmpty)
}

object ResolverGuard {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -261,3 +261,11 @@ object FakeSystemCatalog extends CatalogPlugin {
override def initialize(name: String, options: CaseInsensitiveStringMap): Unit = {}
override def name(): String = "system"
}

/**
* A fake v2 catalog to hold local variables for SQL scripting.
*/
object FakeLocalCatalog extends CatalogPlugin {
override def initialize(name: String, options: CaseInsensitiveStringMap): Unit = {}
override def name(): String = "local"
}
Loading

0 comments on commit e140dbb

Please sign in to comment.