Skip to content

[SPARK-52612][INFRA] Add an env NO_PROVIDED_SPARK_JARS to control collection behavior of sbt/package for spark-avro.jar and spark-protobuf.jar #51321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Jun 30, 2025

What changes were proposed in this pull request?

This pr introduces an environment variable named NO_PROVIDED_SPARK_JARS, which controls the behavior of the sbt/package command so that it only collects spark-avro.jar and spark-protobuf.jar into the assembly/target/scala-2.13/jars directory during documentation generation.

Why are the changes needed?

  1. To ensure that, by default, the sbt/package command does not collect jars with a provided scope, such as spark-avro.jar and spark-protobuf.jar, into the assembly/target/scala-2.13/jars directory, maintaining consistency with Maven's behavior.

  2. To ensure that, during documentation generation, the sbt/package command collects the necessary jars into the assembly/target/scala-2.13/jars directory to ensure that no dependencies are missing for the documentation generation task.

  3. To avoid the following error when executing benchmark tasks using GitHub Actions:

25/06/28 07:03:45 ERROR SparkContext: Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment
java.lang.IllegalArgumentException: requirement failed: File spark-avro_2.13-4.1.0-SNAPSHOT.jar was already registered with a different path (old path = /home/runner/work/spark/spark/connector/avro/target/scala-2.13/spark-avro_2.13-4.1.0-SNAPSHOT.jar, new path = /home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar
...

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Passed GitHub Actions.
  • Manually confirmed that benchmark tasks are not affected and that the ERROR log described above no longer appears during benchmark task execution.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the BUILD label Jun 30, 2025
@LuciferYang LuciferYang marked this pull request as draft June 30, 2025 06:23
@LuciferYang
Copy link
Contributor Author

@LuciferYang
Copy link
Contributor Author

It seems that failing to gather the spark-avro-*.jar into assembly/jars/scala-2.13 will cause the Documentation generation task to fail:

SELECT schema_of_avro('{"type": "record", "name": "struct", "fields": [{"name": "u", "type": ["int", "string"]}]}', map());
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 260, in <module>
    generate_functions_examples_html(jvm, jspark, html_output_dir)
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 245, in generate_functions_examples_html
    examples = _make_pretty_examples(jspark, infos)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 185, in _make_pretty_examples
    query_output = jspark.sql(query).showString(20, 20, False)
                   ^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1.sql.
: org.apache.spark.sql.AnalysisException: [AVRO_NOT_LOADED_SQL_FUNCTIONS_UNUSABLE] Cannot call the SCHEMA_OF_AVRO SQL function because the Avro data source is not loaded.
Please restart your job or session with the 'spark-avro' package loaded, such as by using the --packages argument on the command line, and then retry your query or command again. SQLSTATE: 22KD3
	at org.apache.spark.sql.errors.QueryCompilationErrors$.avroNotLoadedSqlFunctionsUnusable(QueryCompilationErrors.scala:4298)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.liftedTree3$1(avroSqlFunctions.scala:287)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement$lzycompute(avroSqlFunctions.scala:282)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement(avroSqlFunctions.scala:267)
	at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType(Expression.scala:435)
	at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType$(Expression.scala:435)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.dataType(avroSqlFunctions.scala:227)
	at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:205)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
	at scala.collection.immutable.List.collect(List.scala:268)
	at scala.collection.immutable.List.collect(List.scala:79)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.findAliases(DeduplicateRelations.scala:509)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.$anonfun$renewDuplicatedRelations$4(DeduplicateRelations.scala:115)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.deduplicateAndRenew(DeduplicateRelations.scala:306)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.org$apache$spark$sql$catalyst$analysis$DeduplicateRelations$$renewDuplicatedRelations(DeduplicateRelations.scala:116)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:35)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:28)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:249)
	at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183)
	at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179)
	at scala.collection.immutable.List.foldLeft(List.scala:79)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:238)
	at scala.collection.immutable.List.foreach(List.scala:334)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:238)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:320)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:316)
	at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:234)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:316)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:277)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.resolveInFixedPoint(HybridAnalyzer.scala:233)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.$anonfun$apply$1(HybridAnalyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.withTrackedAnalyzerBridgeState(HybridAnalyzer.scala:130)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.apply(HybridAnalyzer.scala:86)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:310)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:423)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:310)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$2(QueryExecution.scala:110)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:283)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:658)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:283)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:282)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$1(QueryExecution.scala:110)
	at scala.util.Try$.apply(Try.scala:217)
	at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
	at org.apache.spark.util.Utils$.getTryWithCallerStacktrace(Utils.scala:1439)
	at org.apache.spark.util.LazyTry.get(LazyTry.scala:58)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:121)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:80)
	at org.apache.spark.sql.classic.Dataset$.$anonfun$ofRows$5(Dataset.scala:139)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.classic.Dataset$.ofRows(Dataset.scala:136)
	at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:499)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
	at jdk.internal.reflect.GeneratedMethodAccessor60.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:840)
	Suppressed: org.apache.spark.util.Utils$OriginalTryStackTraceException: Full stacktrace of original doTryWithCallerStacktrace caller
		at org.apache.spark.sql.errors.QueryCompilationErrors$.avroNotLoadedSqlFunctionsUnusable(QueryCompilationErrors.scala:4298)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.liftedTree3$1(avroSqlFunctions.scala:287)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement$lzycompute(avroSqlFunctions.scala:282)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement(avroSqlFunctions.scala:267)
		at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType(Expression.scala:435)
		at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType$(Expression.scala:435)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.dataType(avroSqlFunctions.scala:227)
		at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:205)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
		at scala.collection.immutable.List.collect(List.scala:268)
		at scala.collection.immutable.List.collect(List.scala:79)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.findAliases(DeduplicateRelations.scala:509)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.$anonfun$renewDuplicatedRelations$4(DeduplicateRelations.scala:115)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.deduplicateAndRenew(DeduplicateRelations.scala:306)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.org$apache$spark$sql$catalyst$analysis$DeduplicateRelations$$renewDuplicatedRelations(DeduplicateRelations.scala:116)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:35)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:28)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:249)
		at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183)
		at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179)
		at scala.collection.immutable.List.foldLeft(List.scala:79)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:238)
		at scala.collection.immutable.List.foreach(List.scala:334)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:238)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:320)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:316)
		at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:234)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:316)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:277)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:208)
		at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:208)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.resolveInFixedPoint(HybridAnalyzer.scala:233)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.$anonfun$apply$1(HybridAnalyzer.scala:95)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.withTrackedAnalyzerBridgeState(HybridAnalyzer.scala:130)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.apply(HybridAnalyzer.scala:86)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:310)
		at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:423)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:310)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$2(QueryExecution.scala:110)
		at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:283)
		at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:658)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:283)
		at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
		at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:282)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$1(QueryExecution.scala:110)
		at scala.util.Try$.apply(Try.scala:217)
		at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
		at org.apache.spark.util.LazyTry.tryT$lzycompute(LazyTry.scala:46)
		at org.apache.spark.util.LazyTry.tryT(LazyTry.scala:46)
		... 23 more

                    ------------------------------------------------
      Jekyll 4.4.1   Please append `--trace` to the `build` command 
                     for any additional information or backtrace. 
                    ------------------------------------------------
/__w/spark/spark/docs/_plugins/build_api_docs.rb:195:in `build_sql_docs': SQL doc generation failed (RuntimeError)
	from /__w/spark/spark/docs/_plugins/build_api_docs.rb:236:in `<top (required)>'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:57:in `require'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:57:in `block in require_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:55:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:55:in `require_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:96:in `block in require_plugin_files'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:94:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:94:in `require_plugin_files'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:21:in `conscientious_require'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/site.rb:131:in `setup'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/site.rb:36:in `initialize'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/commands/build.rb:30:in `new'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/commands/build.rb:30:in `process'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/command.rb:91:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/command.rb:91:in `process_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/commands/build.rb:18:in `block (2 levels) in init_with_program'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/exe/jekyll:15:in `<top (required)>'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/bin/jekyll:25:in `load'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/bin/jekyll:25:in `<main>'
Error: Process completed with exit code 1.

@github-actions github-actions bot added INFRA and removed BUILD labels Jun 30, 2025
@LuciferYang LuciferYang changed the title [SPARK-52612][BUILD] No longer copy spark-avro-*.jar to the assembly/jars/scala-2.13 directory during sbt/package [SPARK-52612][BUILD] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark. Jun 30, 2025
@LuciferYang LuciferYang changed the title [SPARK-52612][BUILD] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark. [SPARK-52612][INFRA] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark. Jun 30, 2025
@github-actions github-actions bot added the BUILD label Jul 1, 2025
@LuciferYang
Copy link
Contributor Author

After running sbt/package, we observe that spark-avro-*.jar and spark-protobuf-*.jar are collected in the assembly/target/scala-2.13/jars directory, which differs from the result obtained when executing the equivalent command using Maven.

The aforementioned phenomenon may lead to several issues:

  1. There are discrepancies in the output results between dev/make-distribution.sh --tgz --sbt-enabled and dev/make-distribution.sh --tgz.
  2. Since some tests rely on the contents of the assembly/target/scala-2.13/jars directory, different results may occur when testing with SBT and Maven.
  3. Currently, running Benchmarks via GitHub Actions may trigger an error, although it does not affect the final output:
25/06/28 07:03:45 ERROR SparkContext: Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment
java.lang.IllegalArgumentException: requirement failed: File spark-avro_2.13-4.1.0-SNAPSHOT.jar was already registered with a different path (old path = /home/runner/work/spark/spark/connector/avro/target/scala-2.13/spark-avro_2.13-4.1.0-SNAPSHOT.jar, new path = /home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar
...

On the other hand, if spark-avro-*.jar and spark-protobuf-*.jar are directly removed from the packaging results of sbt/package, it will lead to failures in Documentation generation:

SELECT schema_of_avro('{"type": "record", "name": "struct", "fields": [{"name": "u", "type": ["int", "string"]}]}', map());
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 260, in <module>
    generate_functions_examples_html(jvm, jspark, html_output_dir)
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 245, in generate_functions_examples_html
    examples = _make_pretty_examples(jspark, infos)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 185, in _make_pretty_examples
    query_output = jspark.sql(query).showString(20, 20, False)
                   ^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1.sql.
: org.apache.spark.sql.AnalysisException: [AVRO_NOT_LOADED_SQL_FUNCTIONS_UNUSABLE] Cannot call the SCHEMA_OF_AVRO SQL function because the Avro data source is not loaded.
Please restart your job or session with the 'spark-avro' package loaded, such as by using the --packages argument on the command line, and then retry your query or command again. SQLSTATE: 22KD3

Therefore, it appears that there are negative consequences regardless of whether the packaging results include or exclude spark-avro-*.jar/spark-protobuf-*.jar.

Currently, I can think of two solutions:

  1. Align the packaging results of sbt/package with those of mvn package, but this requires special handling for Documentation generation (in the current PR, I have introduced a new environment variable and collect spark-avro-*.jar and spark-protobuf-*.jar during documentation construction to avoid failures). However, I have not yet considered whether there are other scenarios that require the inclusion of these two JAR files.
  2. Align the packaging results of mvn package with those of sbt/package. In this case, subsequent Spark distributions will include spark-avro-*.jar and spark-protobuf-*.jar by default, which is a noticeable change for users and may require broader discussion.

WDYT? @dongjoon-hyun @HyukjinKwon @yaooqinn @zhengruifeng

@LuciferYang LuciferYang changed the title [SPARK-52612][INFRA] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark. [SPARK-52612][INFRA] package issue Jul 1, 2025
@dongjoon-hyun
Copy link
Member

For me, I prefer (1) because Maven has been always our official build system for Spark distribution. BTW, do you know when this situation starts to happen?

@LuciferYang
Copy link
Contributor Author

For me, I prefer (1) because Maven has been always our official build system for Spark distribution. BTW, do you know when this situation starts to happen?

The spark-avro_*.jar was accidentally packaged after SPARK-48763.

@github-actions github-actions bot added DOCS and removed INFRA labels Jul 1, 2025
} else if (jar.getName.contains("spark-connect") &&
!SbtPomKeys.profiles.value.contains("noshade-connect")) {
Files.copy(fid.toPath, destJar.toPath)
} else if (jar.getName.contains("spark-protobuf") &&
!SbtPomKeys.profiles.value.contains("noshade-protobuf")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked, and there is no profile named noshade-protobuf in the project, so there's no need to make a judgment about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current pr, I've chosen not to touch those profiles that are suspected to be non-existent. If it's possible to clean them up, I'll handle that in a separate PR.

@LuciferYang
Copy link
Contributor Author

The tests should have passed. It seems that the current solution involves relatively minor modifications. Let's run a full benchmarks to verify the effects of the changes:

If everything goes well, I'll update the pr title and description tomorrow.

@dongjoon-hyun
Copy link
Member

Thank you so much, @LuciferYang . Looking forward to seeing a good result.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jul 2, 2025

The tests should have passed. It seems that the current solution involves relatively minor modifications. Let's run a full benchmarks to verify the effects of the changes:

If everything goes well, I'll update the pr title and description tomorrow.

All micro benchmarks can pass successfully, and there will no longer be logs similar to "Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment".

image

@LuciferYang LuciferYang changed the title [SPARK-52612][INFRA] package issue [SPARK-52612][INFRA] Add an env NO_PROVIDED_SPARK_JARS to control collection behavior of sbt/package for spark-avro.jar and spark-protobuf.jar Jul 2, 2025
@LuciferYang LuciferYang marked this pull request as ready for review July 2, 2025 02:52
@LuciferYang
Copy link
Contributor Author

Could you help review this pull request If you have time? @dongjoon-hyun Thanks ~

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Simple and nice! Thank you, @LuciferYang .

dongjoon-hyun pushed a commit that referenced this pull request Jul 2, 2025
…ollection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar`

### What changes were proposed in this pull request?
This pr introduces an environment variable named `NO_PROVIDED_SPARK_JARS`, which controls the behavior of the `sbt/package` command so that it only collects `spark-avro.jar` and `spark-protobuf.jar` into the `assembly/target/scala-2.13/jars` directory during documentation generation.

### Why are the changes needed?
1. To ensure that, by default, the `sbt/package` command does not collect jars with a `provided` scope, such as `spark-avro.jar` and `spark-protobuf.jar`, into the `assembly/target/scala-2.13/jars` directory, maintaining consistency with Maven's behavior.

2. To ensure that, during documentation generation, the `sbt/package` command collects the necessary jars into the `assembly/target/scala-2.13/jars` directory to ensure that no dependencies are missing for the documentation generation task.

3. To avoid the following error when executing benchmark tasks using GitHub Actions:

```
25/06/28 07:03:45 ERROR SparkContext: Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment
java.lang.IllegalArgumentException: requirement failed: File spark-avro_2.13-4.1.0-SNAPSHOT.jar was already registered with a different path (old path = /home/runner/work/spark/spark/connector/avro/target/scala-2.13/spark-avro_2.13-4.1.0-SNAPSHOT.jar, new path = /home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar
...
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Passed GitHub Actions.
- Manually confirmed that benchmark tasks are not affected and that the ERROR log described above no longer appears during benchmark task execution.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #51321 from LuciferYang/SPARK-52612.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 591e1c3)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/4.0.

@LuciferYang
Copy link
Contributor Author

Thank you @dongjoon-hyun ~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants