-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51223][CONNECT] Always use an ephemeral port for local connect #49965
base: master
Are you sure you want to change the base?
Conversation
The Scala local remote setup works a lot different so I wasn't sure what if anything could be done with that. Currently if you start two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quite like this change but can we do this in 4.1? I am actually thinking about replacing Py4J server to Spark Connect server - when we happen to land that change, then this change alone makes much more sense without thinking about Spark Connect case.
and I also think about using Unix Domain Socket instead - I have a draft here: https://github.com/apache/spark/compare/master...HyukjinKwon:spark:SPARK-51156-2?expand=1 but this will likely happen in 4.1 |
@Kimahriman actually are you interested in picking https://github.com/apache/spark/compare/master...HyukjinKwon:spark:SPARK-51156-2?expand=1 up and opening a PR? I am currently stuck in some work so couldn't have time to get working on it .. |
Don't have a huge preference about including this. Mostly was just thinking about the cluster deploy mode having port conflicts would be a not fun thing users of the new Connect API mode enabled by default distro users might try. Don't have time right now to look into the UDS stuff, but I have been playing around with a slightly different approach to the securing local connections setup with a more generic config that could be used for remote authentication too if that would help at all |
oh if you have another approach, please go ahead and open a PR 👍 |
What changes were proposed in this pull request?
Always use an ephemeral port when automatically starting a local connect server. This prevents port conflicts when starting a connect server purely for the purpose of the local Spark environment, both with
--remote local
as well as--conf spark.api.mode=connect
within PySpark.Why are the changes needed?
Trying to launch multiple PySpark sessions with either
--remote local
orspark.api.mode=connect
fails with port conflicts. Additionally, using a cluster deploy mode with PySpark would lead to a port conflicts if two drivers start on the same node.Does this PR introduce any user-facing change?
Yes, allows you to run multiple automatically launched local Spark Connect servers without manually specifying ports for each one.
How was this patch tested?
Existing UTs which were already using ephemeral ports.
Also manually run two simultaneous
pyspark --remote local
and twospark-submit --conf spark.api.mode=connect test.py
Was this patch authored or co-authored using generative AI tooling?
No