[SPARK-52612][INFRA] Add an env `NO_PROVIDED_SPARK_JARS` to control collection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar` #51321

LuciferYang · 2025-06-30T06:23:20Z

What changes were proposed in this pull request?

This pr introduces an environment variable named NO_PROVIDED_SPARK_JARS, which controls the behavior of the sbt/package command so that it only collects spark-avro.jar and spark-protobuf.jar into the assembly/target/scala-2.13/jars directory during documentation generation.

Why are the changes needed?

To ensure that, by default, the sbt/package command does not collect jars with a provided scope, such as spark-avro.jar and spark-protobuf.jar, into the assembly/target/scala-2.13/jars directory, maintaining consistency with Maven's behavior.
To ensure that, during documentation generation, the sbt/package command collects the necessary jars into the assembly/target/scala-2.13/jars directory to ensure that no dependencies are missing for the documentation generation task.
To avoid the following error when executing benchmark tasks using GitHub Actions:

25/06/28 07:03:45 ERROR SparkContext: Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment
java.lang.IllegalArgumentException: requirement failed: File spark-avro_2.13-4.1.0-SNAPSHOT.jar was already registered with a different path (old path = /home/runner/work/spark/spark/connector/avro/target/scala-2.13/spark-avro_2.13-4.1.0-SNAPSHOT.jar, new path = /home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar
...

Does this PR introduce any user-facing change?

No

How was this patch tested?

Passed GitHub Actions.
Manually confirmed that benchmark tasks are not affected and that the ERROR log described above no longer appears during benchmark task execution.

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2025-06-30T06:25:18Z

test: https://github.com/LuciferYang/spark/actions/runs/15965511227

LuciferYang · 2025-06-30T07:52:35Z

It seems that failing to gather the spark-avro-*.jar into assembly/jars/scala-2.13 will cause the Documentation generation task to fail:

https://github.com/LuciferYang/spark/actions/runs/15965425878/job/45024930961

SELECT schema_of_avro('{"type": "record", "name": "struct", "fields": [{"name": "u", "type": ["int", "string"]}]}', map());
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 260, in <module>
    generate_functions_examples_html(jvm, jspark, html_output_dir)
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 245, in generate_functions_examples_html
    examples = _make_pretty_examples(jspark, infos)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 185, in _make_pretty_examples
    query_output = jspark.sql(query).showString(20, 20, False)
                   ^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1.sql.
: org.apache.spark.sql.AnalysisException: [AVRO_NOT_LOADED_SQL_FUNCTIONS_UNUSABLE] Cannot call the SCHEMA_OF_AVRO SQL function because the Avro data source is not loaded.
Please restart your job or session with the 'spark-avro' package loaded, such as by using the --packages argument on the command line, and then retry your query or command again. SQLSTATE: 22KD3
	at org.apache.spark.sql.errors.QueryCompilationErrors$.avroNotLoadedSqlFunctionsUnusable(QueryCompilationErrors.scala:4298)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.liftedTree3$1(avroSqlFunctions.scala:287)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement$lzycompute(avroSqlFunctions.scala:282)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement(avroSqlFunctions.scala:267)
	at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType(Expression.scala:435)
	at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType$(Expression.scala:435)
	at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.dataType(avroSqlFunctions.scala:227)
	at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:205)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
	at scala.collection.immutable.List.collect(List.scala:268)
	at scala.collection.immutable.List.collect(List.scala:79)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.findAliases(DeduplicateRelations.scala:509)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.$anonfun$renewDuplicatedRelations$4(DeduplicateRelations.scala:115)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.deduplicateAndRenew(DeduplicateRelations.scala:306)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.org$apache$spark$sql$catalyst$analysis$DeduplicateRelations$$renewDuplicatedRelations(DeduplicateRelations.scala:116)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:35)
	at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:28)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:249)
	at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183)
	at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179)
	at scala.collection.immutable.List.foldLeft(List.scala:79)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:238)
	at scala.collection.immutable.List.foreach(List.scala:334)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:238)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:320)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:316)
	at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:234)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:316)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:277)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:208)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.resolveInFixedPoint(HybridAnalyzer.scala:233)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.$anonfun$apply$1(HybridAnalyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.withTrackedAnalyzerBridgeState(HybridAnalyzer.scala:130)
	at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.apply(HybridAnalyzer.scala:86)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:310)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:423)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:310)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$2(QueryExecution.scala:110)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:283)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:658)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:283)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:282)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$1(QueryExecution.scala:110)
	at scala.util.Try$.apply(Try.scala:217)
	at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
	at org.apache.spark.util.Utils$.getTryWithCallerStacktrace(Utils.scala:1439)
	at org.apache.spark.util.LazyTry.get(LazyTry.scala:58)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:121)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:80)
	at org.apache.spark.sql.classic.Dataset$.$anonfun$ofRows$5(Dataset.scala:139)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.classic.Dataset$.ofRows(Dataset.scala:136)
	at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:499)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
	at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
	at jdk.internal.reflect.GeneratedMethodAccessor60.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:840)
	Suppressed: org.apache.spark.util.Utils$OriginalTryStackTraceException: Full stacktrace of original doTryWithCallerStacktrace caller
		at org.apache.spark.sql.errors.QueryCompilationErrors$.avroNotLoadedSqlFunctionsUnusable(QueryCompilationErrors.scala:4298)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.liftedTree3$1(avroSqlFunctions.scala:287)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement$lzycompute(avroSqlFunctions.scala:282)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.replacement(avroSqlFunctions.scala:267)
		at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType(Expression.scala:435)
		at org.apache.spark.sql.catalyst.expressions.RuntimeReplaceable.dataType$(Expression.scala:435)
		at org.apache.spark.sql.catalyst.expressions.SchemaOfAvro.dataType(avroSqlFunctions.scala:227)
		at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:205)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$$anonfun$findAliases$1.applyOrElse(DeduplicateRelations.scala:509)
		at scala.collection.immutable.List.collect(List.scala:268)
		at scala.collection.immutable.List.collect(List.scala:79)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.findAliases(DeduplicateRelations.scala:509)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.$anonfun$renewDuplicatedRelations$4(DeduplicateRelations.scala:115)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.deduplicateAndRenew(DeduplicateRelations.scala:306)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.org$apache$spark$sql$catalyst$analysis$DeduplicateRelations$$renewDuplicatedRelations(DeduplicateRelations.scala:116)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:35)
		at org.apache.spark.sql.catalyst.analysis.DeduplicateRelations$.apply(DeduplicateRelations.scala:28)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:249)
		at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183)
		at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179)
		at scala.collection.immutable.List.foldLeft(List.scala:79)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:238)
		at scala.collection.immutable.List.foreach(List.scala:334)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:238)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:320)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:316)
		at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:234)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:316)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:277)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:208)
		at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)
		at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:208)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.resolveInFixedPoint(HybridAnalyzer.scala:233)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.$anonfun$apply$1(HybridAnalyzer.scala:95)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.withTrackedAnalyzerBridgeState(HybridAnalyzer.scala:130)
		at org.apache.spark.sql.catalyst.analysis.resolver.HybridAnalyzer.apply(HybridAnalyzer.scala:86)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:310)
		at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:423)
		at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:310)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$2(QueryExecution.scala:110)
		at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:283)
		at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:658)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:283)
		at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
		at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:282)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyAnalyzed$1(QueryExecution.scala:110)
		at scala.util.Try$.apply(Try.scala:217)
		at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
		at org.apache.spark.util.LazyTry.tryT$lzycompute(LazyTry.scala:46)
		at org.apache.spark.util.LazyTry.tryT(LazyTry.scala:46)
		... 23 more

                    ------------------------------------------------
      Jekyll 4.4.1   Please append `--trace` to the `build` command 
                     for any additional information or backtrace. 
                    ------------------------------------------------
/__w/spark/spark/docs/_plugins/build_api_docs.rb:195:in `build_sql_docs': SQL doc generation failed (RuntimeError)
	from /__w/spark/spark/docs/_plugins/build_api_docs.rb:236:in `<top (required)>'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:57:in `require'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:57:in `block in require_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:55:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/external.rb:55:in `require_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:96:in `block in require_plugin_files'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:94:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:94:in `require_plugin_files'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/plugin_manager.rb:21:in `conscientious_require'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/site.rb:131:in `setup'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/site.rb:36:in `initialize'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/commands/build.rb:30:in `new'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/commands/build.rb:30:in `process'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/command.rb:91:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/command.rb:91:in `process_with_graceful_fail'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/lib/jekyll/commands/build.rb:18:in `block (2 levels) in init_with_program'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/gems/jekyll-4.4.1/exe/jekyll:15:in `<top (required)>'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/bin/jekyll:25:in `load'
	from /__w/spark/spark/docs/.local_ruby_bundle/ruby/3.0.0/bin/jekyll:25:in `<main>'
Error: Process completed with exit code 1.

.github/workflows/benchmark.yml

LuciferYang · 2025-07-01T04:10:42Z

After running sbt/package, we observe that spark-avro-*.jar and spark-protobuf-*.jar are collected in the assembly/target/scala-2.13/jars directory, which differs from the result obtained when executing the equivalent command using Maven.

The aforementioned phenomenon may lead to several issues:

There are discrepancies in the output results between dev/make-distribution.sh --tgz --sbt-enabled and dev/make-distribution.sh --tgz.
Since some tests rely on the contents of the assembly/target/scala-2.13/jars directory, different results may occur when testing with SBT and Maven.
Currently, running Benchmarks via GitHub Actions may trigger an error, although it does not affect the final output:

25/06/28 07:03:45 ERROR SparkContext: Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment
java.lang.IllegalArgumentException: requirement failed: File spark-avro_2.13-4.1.0-SNAPSHOT.jar was already registered with a different path (old path = /home/runner/work/spark/spark/connector/avro/target/scala-2.13/spark-avro_2.13-4.1.0-SNAPSHOT.jar, new path = /home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar
...

On the other hand, if spark-avro-*.jar and spark-protobuf-*.jar are directly removed from the packaging results of sbt/package, it will lead to failures in Documentation generation:

SELECT schema_of_avro('{"type": "record", "name": "struct", "fields": [{"name": "u", "type": ["int", "string"]}]}', map());
Traceback (most recent call last):
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 260, in <module>
    generate_functions_examples_html(jvm, jspark, html_output_dir)
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 245, in generate_functions_examples_html
    examples = _make_pretty_examples(jspark, infos)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/sql/gen-sql-functions-docs.py", line 185, in _make_pretty_examples
    query_output = jspark.sql(query).showString(20, 20, False)
                   ^^^^^^^^^^^^^^^^^
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
  File "/__w/spark/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1.sql.
: org.apache.spark.sql.AnalysisException: [AVRO_NOT_LOADED_SQL_FUNCTIONS_UNUSABLE] Cannot call the SCHEMA_OF_AVRO SQL function because the Avro data source is not loaded.
Please restart your job or session with the 'spark-avro' package loaded, such as by using the --packages argument on the command line, and then retry your query or command again. SQLSTATE: 22KD3

Therefore, it appears that there are negative consequences regardless of whether the packaging results include or exclude spark-avro-*.jar/spark-protobuf-*.jar.

Currently, I can think of two solutions:

Align the packaging results of sbt/package with those of mvn package, but this requires special handling for Documentation generation (in the current PR, I have introduced a new environment variable and collect spark-avro-*.jar and spark-protobuf-*.jar during documentation construction to avoid failures). However, I have not yet considered whether there are other scenarios that require the inclusion of these two JAR files.
Align the packaging results of mvn package with those of sbt/package. In this case, subsequent Spark distributions will include spark-avro-*.jar and spark-protobuf-*.jar by default, which is a noticeable change for users and may require broader discussion.

WDYT? @dongjoon-hyun @HyukjinKwon @yaooqinn @zhengruifeng

dongjoon-hyun · 2025-07-01T04:21:54Z

For me, I prefer (1) because Maven has been always our official build system for Spark distribution. BTW, do you know when this situation starts to happen?

LuciferYang · 2025-07-01T04:47:37Z

For me, I prefer (1) because Maven has been always our official build system for Spark distribution. BTW, do you know when this situation starts to happen?

The spark-avro_*.jar was accidentally packaged after SPARK-48763.

LuciferYang · 2025-07-01T05:15:32Z

project/SparkBuild.scala

          } else if (jar.getName.contains("spark-connect") &&
            !SbtPomKeys.profiles.value.contains("noshade-connect")) {
            Files.copy(fid.toPath, destJar.toPath)
-          } else if (jar.getName.contains("spark-protobuf") &&
-            !SbtPomKeys.profiles.value.contains("noshade-protobuf")) {


I've checked, and there is no profile named noshade-protobuf in the project, so there's no need to make a judgment about it.

In the current pr, I've chosen not to touch those profiles that are suspected to be non-existent. If it's possible to clean them up, I'll handle that in a separate PR.

LuciferYang · 2025-07-01T14:04:09Z

The tests should have passed. It seems that the current solution involves relatively minor modifications. Let's run a full benchmarks to verify the effects of the changes:

https://github.com/LuciferYang/spark/actions/runs/16001545240

If everything goes well, I'll update the pr title and description tomorrow.

dongjoon-hyun · 2025-07-01T21:48:25Z

Thank you so much, @LuciferYang . Looking forward to seeing a good result.

LuciferYang · 2025-07-02T02:33:51Z

The tests should have passed. It seems that the current solution involves relatively minor modifications. Let's run a full benchmarks to verify the effects of the changes:

https://github.com/LuciferYang/spark/actions/runs/16001545240

If everything goes well, I'll update the pr title and description tomorrow.

All micro benchmarks can pass successfully, and there will no longer be logs similar to "Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment".

LuciferYang · 2025-07-02T02:53:24Z

Could you help review this pull request If you have time? @dongjoon-hyun Thanks ~

dongjoon-hyun

+1, LGTM. Simple and nice! Thank you, @LuciferYang .

…ollection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar` ### What changes were proposed in this pull request? This pr introduces an environment variable named `NO_PROVIDED_SPARK_JARS`, which controls the behavior of the `sbt/package` command so that it only collects `spark-avro.jar` and `spark-protobuf.jar` into the `assembly/target/scala-2.13/jars` directory during documentation generation. ### Why are the changes needed? 1. To ensure that, by default, the `sbt/package` command does not collect jars with a `provided` scope, such as `spark-avro.jar` and `spark-protobuf.jar`, into the `assembly/target/scala-2.13/jars` directory, maintaining consistency with Maven's behavior. 2. To ensure that, during documentation generation, the `sbt/package` command collects the necessary jars into the `assembly/target/scala-2.13/jars` directory to ensure that no dependencies are missing for the documentation generation task. 3. To avoid the following error when executing benchmark tasks using GitHub Actions: ``` 25/06/28 07:03:45 ERROR SparkContext: Failed to add file:///home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar to Spark environment java.lang.IllegalArgumentException: requirement failed: File spark-avro_2.13-4.1.0-SNAPSHOT.jar was already registered with a different path (old path = /home/runner/work/spark/spark/connector/avro/target/scala-2.13/spark-avro_2.13-4.1.0-SNAPSHOT.jar, new path = /home/runner/work/spark/spark/assembly/target/scala-2.13/jars/spark-avro_2.13-4.1.0-SNAPSHOT.jar ... ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Passed GitHub Actions. - Manually confirmed that benchmark tasks are not affected and that the ERROR log described above no longer appears during benchmark task execution. ### Was this patch authored or co-authored using generative AI tooling? No Closes #51321 from LuciferYang/SPARK-52612. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 591e1c3) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2025-07-02T03:58:40Z

Merged to master/4.0.

LuciferYang · 2025-07-02T04:11:09Z

Thank you @dongjoon-hyun ~

init

de37ad5

github-actions bot added the BUILD label Jun 30, 2025

LuciferYang marked this pull request as draft June 30, 2025 06:23

test

003f330

github-actions bot added INFRA and removed BUILD labels Jun 30, 2025

LuciferYang changed the title ~~[SPARK-52612][BUILD] No longer copy spark-avro-*.jar to the assembly/jars/scala-2.13 directory during sbt/package~~ [SPARK-52612][BUILD] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark. Jun 30, 2025

LuciferYang changed the title ~~[SPARK-52612][BUILD] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark.~~ [SPARK-52612][INFRA] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark. Jun 30, 2025

LuciferYang commented Jun 30, 2025

View reviewed changes

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved

test

5956b91

github-actions bot added the BUILD label Jul 1, 2025

LuciferYang changed the title ~~[SPARK-52612][INFRA] No longer explicitly add the spark-avro-*.jar to the classpath of the benchmark.~~ [SPARK-52612][INFRA] package issue Jul 1, 2025

github-actions bot added SQL INFRA and removed INFRA BUILD labels Jul 1, 2025

LuciferYang force-pushed the SPARK-52612 branch from 33d43a5 to 5956b91 Compare July 1, 2025 04:59

github-actions bot added BUILD and removed SQL labels Jul 1, 2025

more test

0d249f0

github-actions bot added DOCS and removed INFRA labels Jul 1, 2025

LuciferYang commented Jul 1, 2025

View reviewed changes

LuciferYang added 2 commits July 1, 2025 13:34

revert

afec878

rename

7fdb7ae

LuciferYang changed the title ~~[SPARK-52612][INFRA] package issue~~ [SPARK-52612][INFRA] Add an env NO_PROVIDED_SPARK_JARS to control collection behavior of sbt/package for spark-avro.jar and spark-protobuf.jar Jul 2, 2025

LuciferYang marked this pull request as ready for review July 2, 2025 02:52

dongjoon-hyun approved these changes Jul 2, 2025

View reviewed changes

dongjoon-hyun closed this in 591e1c3 Jul 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52612][INFRA] Add an env `NO_PROVIDED_SPARK_JARS` to control collection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar` #51321

[SPARK-52612][INFRA] Add an env `NO_PROVIDED_SPARK_JARS` to control collection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar` #51321

LuciferYang commented Jun 30, 2025 •

edited

Loading

Uh oh!

LuciferYang commented Jun 30, 2025

Uh oh!

LuciferYang commented Jun 30, 2025

Uh oh!

Uh oh!

LuciferYang commented Jul 1, 2025

Uh oh!

dongjoon-hyun commented Jul 1, 2025

Uh oh!

LuciferYang commented Jul 1, 2025

Uh oh!

LuciferYang Jul 1, 2025

Uh oh!

LuciferYang Jul 1, 2025

Uh oh!

LuciferYang commented Jul 1, 2025

Uh oh!

dongjoon-hyun commented Jul 1, 2025

Uh oh!

LuciferYang commented Jul 2, 2025 •

edited

Loading

Uh oh!

LuciferYang commented Jul 2, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Jul 2, 2025

Uh oh!

LuciferYang commented Jul 2, 2025

Uh oh!

Uh oh!

[SPARK-52612][INFRA] Add an env NO_PROVIDED_SPARK_JARS to control collection behavior of sbt/package for spark-avro.jar and spark-protobuf.jar #51321

[SPARK-52612][INFRA] Add an env NO_PROVIDED_SPARK_JARS to control collection behavior of sbt/package for spark-avro.jar and spark-protobuf.jar #51321

Conversation

LuciferYang commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Jun 30, 2025

Uh oh!

LuciferYang commented Jun 30, 2025

Uh oh!

Uh oh!

LuciferYang commented Jul 1, 2025

Uh oh!

dongjoon-hyun commented Jul 1, 2025

Uh oh!

LuciferYang commented Jul 1, 2025

Uh oh!

LuciferYang Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Jul 1, 2025

Uh oh!

dongjoon-hyun commented Jul 1, 2025

Uh oh!

LuciferYang commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Jul 2, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 2, 2025

Uh oh!

LuciferYang commented Jul 2, 2025

Uh oh!

Uh oh!

[SPARK-52612][INFRA] Add an env `NO_PROVIDED_SPARK_JARS` to control collection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar` #51321

[SPARK-52612][INFRA] Add an env `NO_PROVIDED_SPARK_JARS` to control collection behavior of `sbt/package` for `spark-avro.jar` and `spark-protobuf.jar` #51321

LuciferYang commented Jun 30, 2025 •

edited

Loading

LuciferYang commented Jul 2, 2025 •

edited

Loading