Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51223][CONNECT] Always use an ephemeral port for local connect #49965

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Feb 15, 2025

What changes were proposed in this pull request?

Always use an ephemeral port when automatically starting a local connect server. This prevents port conflicts when starting a connect server purely for the purpose of the local Spark environment, both with --remote local as well as --conf spark.api.mode=connect within PySpark.

Why are the changes needed?

Trying to launch multiple PySpark sessions with either --remote local or spark.api.mode=connect fails with port conflicts. Additionally, using a cluster deploy mode with PySpark would lead to a port conflicts if two drivers start on the same node.

Does this PR introduce any user-facing change?

Yes, allows you to run multiple automatically launched local Spark Connect servers without manually specifying ports for each one.

How was this patch tested?

Existing UTs which were already using ephemeral ports.

Also manually run two simultaneous pyspark --remote local and two spark-submit --conf spark.api.mode=connect test.py

Was this patch authored or co-authored using generative AI tooling?

No

@Kimahriman
Copy link
Contributor Author

@HyukjinKwon @hvanhovell

The Scala local remote setup works a lot different so I wasn't sure what if anything could be done with that. Currently if you start two spark-shell --remote local, the second one will just silently not create a new Spark Connect server and instead connect to the first one. Not sure if this is intentional or not. I'm also not sure how the connect API mode is supposed to work in Scala with a cluster deploy mode, like in Yarn, since it uses the start-connect-server script which I don't think exists in the uploaded artifacts? Not totally sure on that one.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like this change but can we do this in 4.1? I am actually thinking about replacing Py4J server to Spark Connect server - when we happen to land that change, then this change alone makes much more sense without thinking about Spark Connect case.

@HyukjinKwon
Copy link
Member

and I also think about using Unix Domain Socket instead - I have a draft here: https://github.com/apache/spark/compare/master...HyukjinKwon:spark:SPARK-51156-2?expand=1 but this will likely happen in 4.1

@HyukjinKwon
Copy link
Member

@Kimahriman actually are you interested in picking https://github.com/apache/spark/compare/master...HyukjinKwon:spark:SPARK-51156-2?expand=1 up and opening a PR? I am currently stuck in some work so couldn't have time to get working on it ..

@Kimahriman
Copy link
Contributor Author

Don't have a huge preference about including this. Mostly was just thinking about the cluster deploy mode having port conflicts would be a not fun thing users of the new Connect API mode enabled by default distro users might try.

Don't have time right now to look into the UDS stuff, but I have been playing around with a slightly different approach to the securing local connections setup with a more generic config that could be used for remote authentication too if that would help at all

@HyukjinKwon
Copy link
Member

oh if you have another approach, please go ahead and open a PR 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants