Skip to content

Commit ad8222a

Browse files
LuciferYangdongjoon-hyun
authored andcommitted
[SPARK-51158][YARN][TESTS] No longer restrict the test cases related to connect in the YarnClusterSuite to only run on GitHub Actions
### What changes were proposed in this pull request? This PR adds more suitable `assume` conditions for the test cases related to `connect` in the `YarnClusterSuite`, so they are no longer mandated to run exclusively on GitHub Actions. The specific changes are as follows: 1. In `SparkBuild.scala`, test compilation dependencies have been added for the `Yarn` module to ensure that `build/sbt package` is executed to collect dependencies into the `assembly` directory before running `test` or `testOnly`. 2. In the `testPySpark` function of `YarnClusterSuite.scala`, two `assume` conditions have been added when `SPARK_API_MODE` is `connect`: - Check if `spark-connect_$scalaVersion-$SPARK_VERSION.jar` exists in the `assembly` directory. This condition is primarily for testing scenarios using Maven commands, as it cannot be guaranteed that relevant dependencies are collected into the `assembly` directory first in such cases. - Call the `check_dependencies` function in `pyspark.sql.connect.utils` to ensure that the Connect-related Python packages have been installed. The test cases related to `connect` in the `YarnClusterSuite` will only be executed if both of the above conditions are met. 3. Remove the `assume` added in #49848 ### Why are the changes needed? No longer restrict the test cases related to `connect` in the `YarnClusterSuite` to only run on GitHub Actions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - manual check: SBT Run `build/sbt clean "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite" -Pyarn` 1. Python dependencies are not installed or partially installed The relevant tests will be CANCELED: ``` Traceback (most recent call last): File "/Users/yangjie01/SourceCode/git/spark-sbt/python/pyspark/sql/connect/utils.py", line 47, in require_minimum_grpc_version import grpc ModuleNotFoundError: No module named 'grpc' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<string>", line 1, in <module> File "/Users/yangjie01/SourceCode/git/spark-sbt/python/pyspark/sql/connect/utils.py", line 37, in check_dependencies require_minimum_grpc_version() File "/Users/yangjie01/SourceCode/git/spark-sbt/python/pyspark/sql/connect/utils.py", line 49, in require_minimum_grpc_version raise PySparkImportError( pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] grpcio >= 1.48.1 must be installed; however, it was not found. [info] - run Python application with Spark Connect in yarn-client mode !!! CANCELED !!! (159 milliseconds) [info] checker.isConnectPythonPackagesAvailable was false (YarnClusterSuite.scala:444) [info] org.scalatest.exceptions.TestCanceledException: ... [info] - run Python application with Spark Connect in yarn-cluster mode !!! CANCELED !!! (1 millisecond) [info] checker.isConnectPythonPackagesAvailable was false (YarnClusterSuite.scala:444) [info] org.scalatest.exceptions.TestCanceledException: ... [info] Run completed in 4 minutes, 22 seconds. [info] Total number of tests run: 28 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 28, failed 0, canceled 2, ignored 0, pending 0 [info] All tests passed (excluding canceled). ``` 2. Python dependencies are installed The tests succeed and no tests will be canceled: ``` [info] Run completed in 4 minutes, 51 seconds. [info] Total number of tests run: 30 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` 3. Running `build/sbt clean "yarn/test" -Pyarn` yields similar results. Maven 1. Dependencies not collected into the assembly module ``` build/mvn clean install -DskipTests -pl resource-managers/yarn -am -Pyarn build/mvn test -pl resource-managers/yarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Pyarn ``` Relevant tests will be CANCELED: ``` - run Python application with Spark Connect in yarn-client mode !!! CANCELED !!! checker.isSparkConnectJarAvailable was false (YarnClusterSuite.scala:443) - run Python application with Spark Connect in yarn-cluster mode !!! CANCELED !!! checker.isSparkConnectJarAvailable was false (YarnClusterSuite.scala:443) ... Run completed in 4 minutes, 19 seconds. Total number of tests run: 28 Suites: completed 2, aborted 0 Tests: succeeded 28, failed 0, canceled 2, ignored 0, pending 0 All tests passed (excluding canceled). ``` 2. Dependencies collected into the assembly module, but Python dependencies are not installed or partially installed ``` build/mvn clean install -DskipTests -Pyarn build/mvn test -pl resource-managers/yarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Pyarn ``` Relevant tests will be CANCELED: ``` Traceback (most recent call last): File "/Users/yangjie01/SourceCode/git/spark-maven/python/pyspark/sql/connect/utils.py", line 47, in require_minimum_grpc_version import grpc ModuleNotFoundError: No module named 'grpc' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<string>", line 1, in <module> File "/Users/yangjie01/SourceCode/git/spark-maven/python/pyspark/sql/connect/utils.py", line 37, in check_dependencies require_minimum_grpc_version() File "/Users/yangjie01/SourceCode/git/spark-maven/python/pyspark/sql/connect/utils.py", line 49, in require_minimum_grpc_version raise PySparkImportError( pyspark.errors.exceptions.base.PySparkImportError: [PACKAGE_NOT_INSTALLED] grpcio >= 1.48.1 must be installed; however, it was not found. - run Python application with Spark Connect in yarn-client mode !!! CANCELED !!! checker.isConnectPythonPackagesAvailable was false (YarnClusterSuite.scala:444) - run Python application with Spark Connect in yarn-cluster mode !!! CANCELED !!! checker.isConnectPythonPackagesAvailable was false (YarnClusterSuite.scala:444) Run completed in 4 minutes, 36 seconds. Total number of tests run: 28 Suites: completed 2, aborted 0 Tests: succeeded 28, failed 0, canceled 2, ignored 0, pending 0 All tests passed (excluding canceled). ``` 3. Dependencies collected into the assembly module, and Python dependencies are installed ``` build/mvn clean install -DskipTests -Pyarn build/mvn test -pl resource-managers/yarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Pyarn ``` Tests succeed and no tests will be canceled: ``` Run completed in 4 minutes, 40 seconds. Total number of tests run: 30 Suites: completed 2, aborted 0 Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? NO Closes #49884 from LuciferYang/YarnClusterSuite-reenable. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent ba3e271 commit ad8222a

File tree

2 files changed

+66
-3
lines changed

2 files changed

+66
-3
lines changed

project/SparkBuild.scala

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1208,6 +1208,7 @@ object YARN {
12081208
"Generate config.properties which contains a setting whether Hadoop is provided or not")
12091209
val propFileName = "config.properties"
12101210
val hadoopProvidedProp = "spark.yarn.isHadoopProvided"
1211+
val buildTestDeps = TaskKey[Unit]("buildTestDeps", "Build needed dependencies for test.")
12111212

12121213
lazy val settings = Seq(
12131214
Compile / unmanagedResources :=
@@ -1223,7 +1224,14 @@ object YARN {
12231224
(Compile / genConfigProperties).value
12241225
c
12251226
}
1226-
}).value
1227+
}).value,
1228+
1229+
buildTestDeps := {
1230+
(LocalProject("assembly") / Compile / Keys.`package`).value
1231+
},
1232+
test := ((Test / test) dependsOn (buildTestDeps)).value,
1233+
1234+
testOnly := ((Test / testOnly) dependsOn (buildTestDeps)).evaluated
12271235
)
12281236
}
12291237

resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,16 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
6262
}
6363
}
6464

65+
private var pyConnectDepChecker: PyConnectDepChecker = _
66+
67+
private def getOrCreatePyConnectDepChecker(
68+
python: String, libPath: Seq[String]): PyConnectDepChecker = {
69+
if (pyConnectDepChecker == null) {
70+
pyConnectDepChecker = new PyConnectDepChecker(python, libPath)
71+
}
72+
pyConnectDepChecker
73+
}
74+
6575
override def newYarnConfig(): YarnConfiguration = new YarnConfiguration()
6676

6777
private val TEST_PYFILE = """
@@ -266,13 +276,11 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
266276
}
267277

268278
test("run Python application with Spark Connect in yarn-client mode") {
269-
assume(sys.env.contains("GITHUB_ACTIONS"))
270279
testPySpark(
271280
true, extraConf = Map(SPARK_API_MODE.key -> "connect"), script = TEST_CONNECT_PYFILE)
272281
}
273282

274283
test("run Python application with Spark Connect in yarn-cluster mode") {
275-
assume(sys.env.contains("GITHUB_ACTIONS"))
276284
testPySpark(
277285
false, extraConf = Map(SPARK_API_MODE.key -> "connect"), script = TEST_CONNECT_PYFILE)
278286
}
@@ -430,6 +438,12 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
430438
"PYSPARK_PYTHON" -> pythonExecutablePath
431439
) ++ extraEnv
432440

441+
if (extraConf.getOrElse(SPARK_API_MODE.key, SPARK_API_MODE.defaultValueString) == "connect") {
442+
val checker = getOrCreatePyConnectDepChecker(pythonExecutablePath, pythonPath)
443+
assume(checker.isSparkConnectJarAvailable)
444+
assume(checker.isConnectPythonPackagesAvailable)
445+
}
446+
433447
val moduleDir = {
434448
val subdir = new File(tempDir, "pyModules")
435449
subdir.mkdir()
@@ -869,3 +883,44 @@ private object ExecutorEnvTestApp {
869883
}
870884

871885
}
886+
887+
private class PyConnectDepChecker(python: String, libPath: Seq[String]) {
888+
889+
import scala.sys.process.Process
890+
import scala.util.Try
891+
import scala.util.Properties.versionNumberString
892+
893+
lazy val isSparkConnectJarAvailable: Boolean = {
894+
val filePath = s"$sparkHome/assembly/target/$scalaDir/jars/" +
895+
s"spark-connect_$scalaVersion-$SPARK_VERSION.jar"
896+
java.nio.file.Files.exists(Paths.get(filePath))
897+
}
898+
899+
lazy val isConnectPythonPackagesAvailable: Boolean = Try {
900+
Process(
901+
Seq(
902+
python,
903+
"-c",
904+
"from pyspark.sql.connect.utils import check_dependencies;" +
905+
"check_dependencies('pyspark.sql.connect.fake_module')"),
906+
None,
907+
"PYTHONPATH" -> libPath.mkString(File.pathSeparator)).!!
908+
true
909+
}.getOrElse(false)
910+
911+
private lazy val scalaVersion = {
912+
versionNumberString.split('.') match {
913+
case Array(major, minor, _*) => major + "." + minor
914+
case _ => versionNumberString
915+
}
916+
}
917+
918+
private lazy val scalaDir = s"scala-$scalaVersion"
919+
920+
private lazy val sparkHome: String = {
921+
if (!(sys.props.contains("spark.test.home") || sys.env.contains("SPARK_HOME"))) {
922+
fail("spark.test.home or SPARK_HOME is not set.")
923+
}
924+
sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
925+
}
926+
}

0 commit comments

Comments
 (0)