[SPARK-53898][CORE] Shuffle cleanup should not clean MapOutputTrackerMaster.shuffleStatuses in local cluster #52606

Ngone51 · 2025-10-14T04:57:54Z

What changes were proposed in this pull request?

This PR fixes a bug where MapOutputTrackerMaster.shuffleStatuses is mistakenly cleaned up by Shuffle Cleanup feature in local cluster. The fix is done by avoid invoking mapOutputTracker.unregisterShuffle() in BlockManagerStorageEndpoint when mapOutputTracker is MapOutputTrackerMaster as it only happens in local cluster (non-local cluster should use MapOutputTrackerWorker instead).

Why are the changes needed?

MapOutputTrackerMaster.shuffleStatuses should only be cleaned when ContextCleaner considers the shuffle is no longer referenced anywhere. Otherwise, any subsequent access (which still reference that shuffle) to the same shuffle metadata in MapOutputTrackerMaster can lead to SparkException and crash the SparkContext. Note this currently only happens in local cluster due to both driver and executor use the MapOutputTrackerMaster. E.g., an ongoing subquery could access the same shuffle metadata which could have been removed after the main query completes. See the detailed discussion at #52213 (comment).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated the existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Ngone51 · 2025-10-14T04:58:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala

      case _ =>
    }
+    // Shuffle cleanup should not clean up shuffle metadata on the driver
+    assert(mapOutputTrackerMaster.shuffleStatuses.nonEmpty)


Tests fails this assert before the fix.

karuppayya · 2025-10-14T18:23:39Z

core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala

+          // through `ContextCleaner` when the shuffle is considered no longer referenced anywhere.
+          // Otherwise, we might hit exceptions if there is any subsequent access (which still
+          // reference that shuffle) to that shuffle metadata in `MapOutputTrackerMaster`. E.g.,
+          // an ongoing subquery could access the same shuffle metadata which could have been


I am not sure if we should mention this, since the ideal behavior would be to terminate any subqueries when the main query completes.

There could be a race even if we terminating the subquery?

The main query has ended already. The subqueries results are anyway not going to be used (and is also a waste of resource in allowing it to continue).

I understand we should cancel the running subquery and we should do it. My point is the running subquery could still access MapOutputTrackerMaster even if we cancel it right after the main query ends due to the race between them. So I think it's fine to mention it here.

karuppayya · 2025-10-14T18:33:32Z

core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala

    case RemoveShuffle(shuffleId) =>
      doAsync[Boolean](log"removing shuffle ${MDC(SHUFFLE_ID, shuffleId)}", context) {
-        if (mapOutputTracker != null) {
+        if (mapOutputTracker != null && !mapOutputTracker.isInstanceOf[MapOutputTrackerMaster]) {


Should this be in check be in unregistershuffle if we dont expect this to be called in master.
There can be someone else calling this same method on local mode in future?

I was thinking about that. That way requires us to pass isLocal into MapOutputTrackerMaster, involves more changes. But I also agree it's safer.

So with local mode, we can't clean up shuffle files only?

I think shuffle cleanup still happens in local.
But the shuffle metadata cleanup only happens from ContextCleaner in driver.

karuppayya · 2025-10-14T18:42:51Z

cc: @cloud-fan

Ngone51 · 2025-10-15T01:17:42Z

cc @jiangxb1987 @bozhang2820

cloud-fan · 2025-10-18T03:01:28Z

core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala

+          // an ongoing subquery could access the same shuffle metadata which could have been
+          // cleaned up after the main query completes. Note this currently only happens in local
+          // cluster where both driver and executor use the `MapOutputTrackerMaster`.
          mapOutputTracker.unregisterShuffle(shuffleId)


so for non-local cluster, mapOutputTracker is at executors and unregisterShuffle only unregister the shuffle at executor side?

On executors, it would call the MapOutputTrackerWorker#unregisterShuffle, where the shuffle status are not cleaned up unlike in MapOutputTrackerMaster#unregisterShuffle

In local, we just have MapoutputTrackerMaster and ends up cleaning the shufflestatuses

MapOutputTrackerMaster

def unregisterShuffle(shuffleId: Int): Unit = { shuffleStatuses.remove(shuffleId).foreach { shuffleStatus => shuffleStatus.invalidateSerializedMapOutputStatusCache() shuffleStatus.invalidateSerializedMergeOutputStatusCache() } }

MapOutputTrackerWorker

def unregisterShuffle(shuffleId: Int): Unit = { mapStatuses.remove(shuffleId) mergeStatuses.remove(shuffleId) shufflePushMergerLocations.remove(shuffleId) }

so how do we clean up shuffle files with local mode?

The shuffle files are cleaned by rpc RemoveShuffle, sent from driver to executors and handled by BlockManagerStorageEndpoint that deletes the files on the disk. And the rpc RemoveShuffle nowadays can be raised in two ways: 1) ContextCleaner 2) Shuffle Cleanup feature at the end of a SQL query. These are the same for all the modes.

ah, so shuffle files are already cleaned up before we reach here?

One idea: shall we add a new method clearShuffleStatusCache and call it here? The executor side shuffle status is more like a cache and the driver side one is single source of truth. MapOutputTrackerMaster#clearShuffleStatusCache is noop.

ah, so shuffle files are already cleaned up before we reach here?

No. We're handling the RemoveShuffle rpc in BlockManagerStorageEndpoint right at this point. The files are deteleted at the line 72 by shuffleManager.unregisterShuffle(shuffleId).

One idea: shall we add a new method clearShuffleStatusCache and call it here? The executor side shuffle status is more like a cache and the driver side one is single source of truth.

Actaully we already do this in non-local mode. In non-local mode, we call MapOutputTrackerWorker. unregisterShuffle() to clean up the statues (that's exactly what line 68 does). In local mode, we don't use MapOutputTrackerWorker so there is no cached statuses to clean.

yea it's the same thing, but clearShuffleStatusCache is more explicit and clear than skipping unregisterShuffle if the instance is MapOutputTrackerWorker.

fix

4eaaea9

github-actions bot added SQL CORE labels Oct 14, 2025

Ngone51 commented Oct 14, 2025

View reviewed changes

Ngone51 mentioned this pull request Oct 14, 2025

[SPARK-53469][SQL] Ability to cleanup shuffle in Thrift server #52213

Open

karuppayya reviewed Oct 14, 2025

View reviewed changes

cloud-fan reviewed Oct 18, 2025

View reviewed changes

[SPARK-53898][CORE] Shuffle cleanup should not clean MapOutputTrackerMaster.shuffleStatuses in local cluster #52606

Are you sure you want to change the base?

[SPARK-53898][CORE] Shuffle cleanup should not clean MapOutputTrackerMaster.shuffleStatuses in local cluster #52606

Conversation

Ngone51 commented Oct 14, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Oct 14, 2025

Uh oh!

Ngone51 commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ngone51 Oct 15, 2025 •

edited

Loading

Ngone51 Oct 22, 2025 •

edited

Loading

cloud-fan Oct 22, 2025 •

edited

Loading