Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49428][SQL] Move Connect Scala Client from Connector to SQL #49695

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR moves the connect Scala JVM client project to sql. It also moves the connect/bin and connect/doc to sql.

Why are the changes needed?

Connect is part of the sql project now. It is weird to keep these seperate.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@hvanhovell hvanhovell changed the title Spark 49428 [SPARK-49428][SQL] Move Connect Scala Client from Connector to SQL Jan 27, 2025
@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 28, 2025

./build/mvn $MAVEN_CLI_OPTS -Dtest.exclude.tags="$EXCLUDED_TAGS" -Djava.version=${JAVA_VERSION/-ea} -pl connector/connect/client/jvm,sql/connect/common,sql/connect/server test -fae

image

We need to correct some parts of the maven_test.yml file, and at the same time, consider how to ensure compatibility if there are differences in code structure between the master and branch-4.0 now, as this yml file will also be used for the Maven daily test in branch-4.0.

@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 28, 2025

<argument>${basedir}/../connector/connect/client/jvm/target/connect-repl</argument>

<argument>${basedir}/../connector/connect/client/jvm/target/spark-connect-client-jvm_${scala.binary.version}-${project.version}.jar</argument>

The configuration in the assembly/pom.xml also needs to be fixed, otherwise the following error will occur during the build process:

[INFO] --- exec:3.2.0:exec (copy-connect-client-repl-jars) @ spark-assembly_2.13 ---
cp: /Users/yangjie01/SourceCode/git/spark-mine-13/assembly/../connector/connect/client/jvm/target/connect-repl: No such file or directory
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
    at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:355)
    at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:253)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:884)
    at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:844)
    at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:450)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:328)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
    at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
    at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
    at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:159)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:261)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:173)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:101)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:906)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:283)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:206)
    at jdk.internal.reflect.DirectMethodHandleAccessor.invoke (DirectMethodHandleAccessor.java:103)
    at java.lang.reflect.Method.invoke (Method.java:580)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:255)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:201)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:361)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:314)

@hvanhovell
Copy link
Contributor Author

@LuciferYang yeah I was afraid of that. Thanks for the heads-up!

@github-actions github-actions bot added the INFRA label Jan 29, 2025
@hvanhovell
Copy link
Contributor Author

@LuciferYang do you think this is good to go? I will port this to 4.0 as well.

@LuciferYang
Copy link
Contributor

Let me double-check that

@LuciferYang
Copy link
Contributor

There may still be some code that needs to be fixed:

See [Spark Connect Client](https://github.com/apache/spark/tree/master/connector/connect) directory

See [Spark Connect Client](https://github.com/apache/spark/tree/master/connector/connect) directory

[reference](https://github.com/apache/spark/blob/master/connector/connect/docs/client-connection-string.md).

# Copy over the unified ScalaDoc for all projects to api/scala.
# This directory will be copied over to _site when `jekyll` command is run.
copy_and_update_scala_docs("../target/scala-2.13/unidoc", "api/scala")
# copy_and_update_scala_docs("../connector/connect/client/jvm/target/scala-2.13/unidoc", "api/connect/scala")
# Copy over the unified JavaDoc for all projects to api/java.
copy_and_update_java_docs("../target/javaunidoc", "api/java", "api/scala")
# copy_and_update_java_docs("../connector/connect/client/jvm/target/javaunidoc", "api/connect/java", "api/connect/scala")

spark/dev/lint-scala

Lines 37 to 44 in 6bbfa2d

-pl connector/connect/client/jvm \
2>&1 | grep -e "Unformatted files found" \
)
if test ! -z "$ERRORS"; then
echo -e "The scalafmt check failed on sql/connect or connector/connect at following occurrences:\n\n$ERRORS\n"
echo "Before submitting your change, please make sure to format your code using the following command:"
echo "./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl connector/connect/client/jvm"

spark/.github/labeler.yml

Lines 222 to 227 in 6bbfa2d

CONNECT:
- changed-files:
- any-glob-to-any-file: [
'sql/connect/**/*',
'connector/connect/**/*',
'python/**/connect/**/*'

pushd sql/connect/common/src/main &&
echo "Start protobuf breaking changes checking against $BRANCH" &&
buf breaking --against "https://github.com/apache/spark.git#branch=$BRANCH,subdir=connector/connect/common/src/main" &&
echo "Finsh protobuf breaking changes checking: SUCCESS"

connect = Module(
name="connect",
dependencies=[hive, avro, protobuf],
source_file_regexes=[
"sql/connect",
"connector/connect",
],
sbt_test_goals=[
"connect/test",
"connect-client-jvm/test",
],
)

@hvanhovell Could you please review the above file again? Of course, I don't object to fixing some of them in separate pull requests.

@LuciferYang
Copy link
Contributor

There may still be some code that needs to be fixed:

See [Spark Connect Client](https://github.com/apache/spark/tree/master/connector/connect) directory

See [Spark Connect Client](https://github.com/apache/spark/tree/master/connector/connect) directory

[reference](https://github.com/apache/spark/blob/master/connector/connect/docs/client-connection-string.md).

# Copy over the unified ScalaDoc for all projects to api/scala.
# This directory will be copied over to _site when `jekyll` command is run.
copy_and_update_scala_docs("../target/scala-2.13/unidoc", "api/scala")
# copy_and_update_scala_docs("../connector/connect/client/jvm/target/scala-2.13/unidoc", "api/connect/scala")
# Copy over the unified JavaDoc for all projects to api/java.
copy_and_update_java_docs("../target/javaunidoc", "api/java", "api/scala")
# copy_and_update_java_docs("../connector/connect/client/jvm/target/javaunidoc", "api/connect/java", "api/connect/scala")

spark/dev/lint-scala

Lines 37 to 44 in 6bbfa2d

-pl connector/connect/client/jvm \
2>&1 | grep -e "Unformatted files found" \
)
if test ! -z "$ERRORS"; then
echo -e "The scalafmt check failed on sql/connect or connector/connect at following occurrences:\n\n$ERRORS\n"
echo "Before submitting your change, please make sure to format your code using the following command:"
echo "./build/mvn scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl sql/api -pl sql/connect/common -pl sql/connect/server -pl connector/connect/client/jvm"

spark/.github/labeler.yml

Lines 222 to 227 in 6bbfa2d

CONNECT:
- changed-files:
- any-glob-to-any-file: [
'sql/connect/**/*',
'connector/connect/**/*',
'python/**/connect/**/*'

pushd sql/connect/common/src/main &&
echo "Start protobuf breaking changes checking against $BRANCH" &&
buf breaking --against "https://github.com/apache/spark.git#branch=$BRANCH,subdir=connector/connect/common/src/main" &&
echo "Finsh protobuf breaking changes checking: SUCCESS"

connect = Module(
name="connect",
dependencies=[hive, avro, protobuf],
source_file_regexes=[
"sql/connect",
"connector/connect",
],
sbt_test_goals=[
"connect/test",
"connect-client-jvm/test",
],
)

@hvanhovell Could you please review the above file again? Of course, I don't object to fixing some of them in separate pull requests.

It looks like you've fixed them in the latest commit :), let's wait for the latest CI results.

@@ -35,7 +35,7 @@ fi

pushd sql/connect/common/src/main &&
echo "Start protobuf breaking changes checking against $BRANCH" &&
buf breaking --against "https://github.com/apache/spark.git#branch=$BRANCH,subdir=connector/connect/common/src/main" &&
buf breaking --against "https://github.com/apache/spark.git#branch=$BRANCH,subdir=sql/connect/common/src/main" &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perform a branch version check here? If BRANCH is branch-3.x, should we keep comparing it with connector/connect/common/src/main ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we run that against 3.5/3.4?

Copy link
Contributor

@LuciferYang LuciferYang Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branch-4.0 will against branch-3.5, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we run that? I grepped through to the code base and I can't find us using it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used in build_and_test.yml, but no more.

@HyukjinKwon is this a dead file?

Copy link
Contributor

@LuciferYang LuciferYang Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be used in GitHub Actions, because GitHub Actions uses the following configuration:

- name: Breaking change detection against branch-4.0
uses: bufbuild/buf-breaking-action@v1
with:
input: sql/connect/common/src/main
against: 'https://github.com/apache/spark.git#branch=branch-4.0,subdir=sql/connect/common/src/main'

Judging from the commit history, this seems more like a tool for local use by developers? It would be best to have @grundprinzip, @HyukjinKwon , or @zhengruifeng , who have modified this file, confirm it.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks @hvanhovell

I have locally checked the Maven compilation and tested the three modules related to Connect, including connect-common, connect, and connect-client-jvm, and they are all ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants