Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2380] Improvement: Eagerly cancel rpc request #2381

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

summaryzb
Copy link
Contributor

What changes were proposed in this pull request?

needCancel takes effect in rpc retry

Why are the changes needed?

this is helpful when current task is killed since speculation task attempts succeed, but the rpc of which send data still keep retrying

Fix: #2380

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT

@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 51.21%. Comparing base (8ad0f8d) to head (7b0f563).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...ffle/client/request/RssSendShuffleDataRequest.java 0.00% 4 Missing ⚠️
...ffle/client/impl/grpc/ShuffleServerGrpcClient.java 0.00% 1 Missing ⚠️
...client/impl/grpc/ShuffleServerGrpcNettyClient.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2381      +/-   ##
============================================
- Coverage     51.34%   51.21%   -0.14%     
+ Complexity     3615     3016     -599     
============================================
  Files           571      481      -90     
  Lines         32892    23193    -9699     
  Branches       2833     2140     -693     
============================================
- Hits          16890    11878    -5012     
+ Misses        14932    10569    -4363     
+ Partials       1070      746     -324     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

github-actions bot commented Mar 6, 2025

Test Results

 2 945 files   -  66   2 945 suites   - 66   6h 21m 18s ⏱️ - 23m 19s
 1 147 tests  -  30   1 143 ✅  -  33   1 💤 ±0  0 ❌ ±0   3 🔥 + 3 
14 666 runs   - 248  14 621 ✅  - 278  15 💤 ±0  0 ❌ ±0  30 🔥 +30 

For more details on these errors, see this check.

Results for commit 212a7f9. ± Comparison against base commit 3874f15.

This pull request removes 30 tests.
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testCreateFallback
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testCreateInDriver
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testCreateInDriverDenied
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testCreateInExecutor
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testDefaultIncludeExcludeProperties
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testExcludeProperties
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testIncludeProperties
org.apache.spark.shuffle.DelegationRssShuffleManagerTest ‑ testTryAccessCluster
org.apache.spark.shuffle.FunctionUtilsTests ‑ testOnceFunction0
org.apache.spark.shuffle.RssShuffleManagerTest ‑ testCreateShuffleManagerServer
…

♻️ This comment has been updated with latest results.

@summaryzb
Copy link
Contributor Author

@jerqi @LuciferYang PTAL

@LuciferYang
Copy link
Contributor

also cc @advancedxy

@LuciferYang
Copy link
Contributor

Seems we should add a new test to cover this

@summaryzb summaryzb force-pushed the eagerly_cancel branch 2 times, most recently from 26b3240 to 56fa5ad Compare March 7, 2025 16:21
@summaryzb
Copy link
Contributor Author

gentle ping @LuciferYang @advancedxy

@@ -29,26 +30,29 @@ public class RssSendShuffleDataRequest {
private int retryMax;
private long retryIntervalMax;
private Map<Integer, Map<Integer, List<ShuffleBlockInfo>>> shuffleIdToBlocks;
private Supplier<Boolean> needCancel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is kind of leaking details or making RssSendShuffleDataRequest holding references to the sending class, for spark, it's DataPusher. I'm not sure this is the elegant way to do that.

Is it possible for
boolean result = ClientUtils.waitUntilDoneOrFail(futures, allowFastFail); in ShuffleWriteClientImpl to be aware of interruption/spark cancellation, and cancels all the sending futures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, current Datapusher leake details to sending class, this pr does not make it worse, but achive a eagerly cancel in rpc retry level.
Aware of interruption/spark cancellation is a good idea, i'll follow this way

@summaryzb summaryzb marked this pull request as draft March 17, 2025 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Eagerly cancel rpc request
4 participants