Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1750] feat(remote merge): Support Spark. #2405

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zhengchenyu
Copy link
Collaborator

What changes were proposed in this pull request?

Support spark Framework

Why are the changes needed?

#1750

Does this PR introduce any user-facing change?

No.

How was this patch tested?

unit test, integration test, real job in cluster.

Copy link

github-actions bot commented Mar 14, 2025

Test Results

 3 029 files  +18   3 029 suites  +18   6h 41m 32s ⏱️ + 1m 40s
 1 182 tests + 7   1 180 ✅ + 7   2 💤 ±0  0 ❌ ±0 
14 961 runs  +62  14 931 ✅ +62  30 💤 ±0  0 ❌ ±0 

Results for commit 0c89203. ± Comparison against base commit 4300f93.

♻️ This comment has been updated with latest results.

@jerqi jerqi requested a review from zuston March 17, 2025 03:23
Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

import java.util
import java.util.Map

object RMSparkSQLTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! . How about renaming the RemoteMergeSparkSQLTest . RM looks Yarn ResourceManager

Copy link
Collaborator Author

@zhengchenyu zhengchenyu Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, when I mention RM, the first thing that comes to my mind is Yarn ResourceManager. But I don't have any other good names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I will rename RM to RMS or RS. In fact, merge sort is implemented on the server side, but there is no combine, and remote sort is fine.
If so, the previous code and documentation need to be modified. Maybe a new PR is needed.

@zhengchenyu
Copy link
Collaborator Author

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

This test is based on draft pr apache/spark#50248.

@jerqi
Copy link
Contributor

jerqi commented Mar 18, 2025

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

This test is based on draft pr apache/spark#50248.

cc @LuciferYang

@jerqi
Copy link
Contributor

jerqi commented Mar 18, 2025

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

This test is based on draft pr apache/spark#50248.

This will break the code implement of Spark. You would better to insert a new logic plan represents the distribution and partitioning after shuffling. You only need to implement some optimization rules.

@zhengchenyu
Copy link
Collaborator Author

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

This test is based on draft pr apache/spark#50248.

This will break the code implement of Spark. You would better to insert a new logic plan represents the distribution and partitioning after shuffling. You only need to implement some optimization rules.

Are you talking about changes to Spark? My initial idea was also to see if I could add a new rule. Maybe for map side, I could add new rules. But for reduce, adding a new SortExec is determined by determining whether distribution and partitioning match, which is not easy to do by adding a new Rule.
For the draft pr about changes to spark. It is only a draft to verify the feasibility of this proposal. There are still some code architectures that need to be refactored. For example, some partial aggregation in memory logic, add some logic to the rule.

@LuciferYang
Copy link
Contributor

also cc @summaryzb

@LuciferYang
Copy link
Contributor

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

This test is based on draft pr apache/spark#50248.

Does this one depend on SPARK-51398 being merged first?

@jerqi
Copy link
Contributor

jerqi commented Mar 18, 2025

From my sight, this feature now can't be used in Spark SQL. Maybe RDD could use this.

This test is based on draft pr apache/spark#50248.

This will break the code implement of Spark. You would better to insert a new logic plan represents the distribution and partitioning after shuffling. You only need to implement some optimization rules.

Are you talking about changes to Spark? My initial idea was also to see if I could add a new rule. Maybe for map side, I could add new rules. But for reduce, adding a new SortExec is determined by determining whether distribution and partitioning match, which is not easy to do by adding a new Rule. For the draft pr about changes to spark. It is only a draft to verify the feasibility of this proposal. There are still some code architectures that need to be refactored. For example, some partial aggregation in memory logic, add some logic to the rule.

Yes, Meta Cosco ever did similar things. You can see
apache/spark#32944
apache/spark#34702

cc @c21 Excuse me, sorry to bother you. Is it possible that we don't change the code of Spark and only add some rules to implement this feature? Could you give us some suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants