Replace HashSet with ConcurrentHashMap.newKeySet #3100

aidar-stripe · 2025-02-15T01:25:04Z

What changes were proposed in this pull request?

Replacing HashSet of PartitionLocations with concurrent version of it.

Why are the changes needed?

We are seeing some race conditions between handleGetReducerFileGroup& tryFinalCommit, where reducers complete without processing partition, even though there's data.

Problematic logs

On the driver side:

25/01/31 14:23:02 {} INFO org.apache.celeborn.client.commit.ReducePartitionCommitHandler: Shuffle 23 commit files complete. File count 23200 using 240180 ms
...
25/01/31 14:23:02 {} INFO org.apache.celeborn.client.commit.ReducePartitionCommitHandler: Shuffle 23 partition 11931-0: primary lost, use replica PartitionLocation[
  id-epoch:11931-0
  host-rpcPort-pushPort-fetchPort-replicatePort:10.68.138.242-39557-35555-37139-39685
  mode:REPLICA
  peer:(empty)
  storage hint:StorageInfo{type=SSD, mountPoint='', finalResult=true, filePath=}
  mapIdBitMap:null].
...
25/01/31 14:23:02 {} INFO org.apache.celeborn.client.commit.ReducePartitionCommitHandler: Succeed to handle stageEnd for 23.

On the executor side:

25/01/31 14:23:02 {executorId=92, jobId=28, partitionId=420, stageId=74, taskAttemptId=82047} INFO org.apache.celeborn.client.ShuffleClientImpl: Shuffle 23 request reducer file group success using 59315 ms, result partition size 12000
...
25/01/31 14:40:54 {executorId=92, partitionId=11931, taskAttemptId=93846} INFO org.apache.spark.executor.Executor: Running task 11931.0 in stage 74.0 (TID 93846)
25/01/31 14:40:54 {jobId=28, executorId=92, taskAttemptId=93846, partitionId=11931, stageId=74} INFO org.apache.spark.shuffle.celeborn.SparkShuffleManager: Shuffle 24 write mode is changed to SORT because partition count 12000 is greater than threshold 2000
25/01/31 14:40:54 {executorId=92, jobId=28, partitionId=11931, stageId=74, taskAttemptId=93846} INFO org.apache.spark.shuffle.celeborn.CelebornShuffleReader: BatchOpenStream for 0 cost 0ms
25/01/31 14:40:54 {} WARN org.apache.celeborn.client.ShuffleClientImpl: Shuffle data is empty for shuffle 23 partition 11931.

How was this patch tested?

No additional tests for this: I've tried to reproduce it, but we've only seen this happen with high number of nodes and during long execution time range.

More explanation on why/how this happens

// write path
 override def setStageEnd(shuffleId: Int): Unit = {
    getReducerFileGroupRequest synchronized {
      stageEndShuffleSet.add(shuffleId)
    }
....

// read path
 override def handleGetReducerFileGroup(context: RpcCallContext, shuffleId: Int): Unit = {
    // Quick return for ended stage, avoid occupy sync lock.
    if (isStageEnd(shuffleId)) {
      replyGetReducerFileGroup(context, shuffleId)
    } else {
      getReducerFileGroupRequest.synchronized {
...

override def isStageEnd(shuffleId: Int): Boolean = {
    stageEndShuffleSet.contains(shuffleId)
  }

Since concurrency guarantees between read/write path are based on ConcurrentHashMap's volatile values there's no guarantee that content of a HashSet would be seen fully by the reader thread.

…etween handleGetReducerFileGroup & tryFinalCommit

FMX · 2025-02-17T02:44:21Z

@aidar-stripe Hi, PR #2986 "[CELEBORN-1769] Fix packed partition location cause GetReducerFileGroupResponse lose location" fixed the scenario you might have encountered.

Is the PR in your distribution or can you provide your worker's distribution commit ID?

aidar-stripe · 2025-02-18T17:18:03Z

@FMX thanks for the link! I think you are absolutely right here, we were running a version of Celeborn client (it's been 0.5.1 with some of our commits for integrity checks, which were disabled).

I could confirm that PbGetReducerFileGroupResponse conversion code only takes primaries there:

        val fileGroup = pbGetReducerFileGroupResponse.getFileGroupsMap.asScala.map {
          case (partitionId, fileGroup) =>
            (
              partitionId,
              PbSerDeUtils.fromPbPackedPartitionLocationsPair(
                fileGroup.getPartitionLocationsPair)._1.asScala.toSet.asJava)
        }.asJava

This explains consistency of the failures that we've seen much better than the potential concurrency issue with the HashSet. I would still like to merge in the PR though, I think usage of ConcurrentHashSet still more appropriate there.

FMX · 2025-02-24T02:52:20Z

@aidar-stripe You can refer to the code

celeborn/client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala

Lines 154 to 166 in f094821

    
           if (isStageEnd(shuffleId)) { 
        
             logInfo(s"[handleStageEnd] Shuffle $shuffleId already ended!") 
        
             return false 
        
           } else { 
        
             inProcessStageEndShuffleSet.synchronized { 
        
               if (inProcessStageEndShuffleSet.contains(shuffleId)) { 
        
                 logWarning(s"[handleStageEnd] Shuffle $shuffleId is in process!") 
        
                 return false 
        
               } else { 
        
                 inProcessStageEndShuffleSet.add(shuffleId) 
        
               } 
        
             } 
        
           }

.
You will discover that only one thread will write to the partition location set at a time. Here is no concurrent issue.

aidar-stripe · 2025-02-25T05:02:17Z

@aidar-stripe You can refer to the code

celeborn/client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala

Lines 154 to 166 in f094821

if (isStageEnd(shuffleId)) {

logInfo(s"[handleStageEnd] Shuffle $shuffleId already ended!")

return false

} else {

inProcessStageEndShuffleSet.synchronized {

if (inProcessStageEndShuffleSet.contains(shuffleId)) {

logWarning(s"[handleStageEnd] Shuffle $shuffleId is in process!")

return false

} else {

inProcessStageEndShuffleSet.add(shuffleId)

}

}

}

.
You will discover that only one thread will write to the partition location set at a time. Here is no concurrent issue.

That's correct synchronization on inProcessStageEndShuffleSet of ReducePartitionCommitHandler#tryFinalCommit would ensure that only one thread would complete the commit. The reasoning behind the race here is that:

Commit handler (thread 1) calls collectResult and populates the reducerFileGroupsMap: ConcurrentHashMap[Int, ConcurrentHashMap[Integer, util.Set[PartitionLocation]]] with non-thread safe HashSet container
At the same time, if handleGetReducerFileGroup (thread 2) call comes in, it might not see the element in HashSet in here

celeborn/client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala

Line 351 in f094821

replyGetReducerFileGroup(context, shuffleId)

I think the closest analogy here would be this particular example: https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/#pitfall-volatiles-wrong. I don't think that there's a huge risk of this happening - for it to happen handleGetReducerFileGroup would have to come in at the same time as completing commit.

I'm happy with leaving this as is, since we've added some additional integrity checks on our side. But it feels like changing HashSet to concurrent version should be relatively cheap, especially considering that all the rest of the structures are concurrent. Regardless of the decision - thanks for responding and reviewing!

Replace HashSet with ConcurrentHashMap.newKeySet to get rid of race b…

d9c9024

…etween handleGetReducerFileGroup & tryFinalCommit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace HashSet with ConcurrentHashMap.newKeySet #3100

Replace HashSet with ConcurrentHashMap.newKeySet #3100

aidar-stripe commented Feb 15, 2025

FMX commented Feb 17, 2025

aidar-stripe commented Feb 18, 2025

FMX commented Feb 24, 2025

aidar-stripe commented Feb 25, 2025

Replace HashSet with ConcurrentHashMap.newKeySet #3100

Are you sure you want to change the base?

Replace HashSet with ConcurrentHashMap.newKeySet #3100

Conversation

aidar-stripe commented Feb 15, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Problematic logs

How was this patch tested?

More explanation on why/how this happens

FMX commented Feb 17, 2025

aidar-stripe commented Feb 18, 2025

FMX commented Feb 24, 2025

aidar-stripe commented Feb 25, 2025