[CELEBORN-1894] Allow skipping already read chunks during unreplicated shuffle read retried #3132

saurabhd336 · 2025-03-04T05:58:08Z

What changes were proposed in this pull request?

Whenever a WorkerPartitionReader is recreated (due celeborn worker restarts / any other chunk fetch failure), the entire shuffle partition file is re-read from beginning, discarding already read chunks in CelebornInputStream based on the batchIdSet metadata maintained.

This can be improved (only for cases where shuffle data is unreplicated) by skipping already read chunk id since they'd be discarded anyway. This improves overall shuffle read performance (reducer's total time, network usage etc).

Why are the changes needed?

Allow skipping already read shuffle chunks

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs added

…eplicated scenario

zaynt4606

plz run dev/reformat for [Style check]
and UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite for configuration change~

s0nskar

@saurabhd336 there seems to be some issues with the tests.

s0nskar · 2025-03-04T06:32:41Z

...er/src/test/scala/org/apache/celeborn/service/deploy/cluster/ReadWriteTestWithFailures.scala

+    shutdownMiniCluster()
+  }
+
+  test(s"test MiniCluster with connection resets, ensure no duplicate reads") {


Can we add a negative case to show that there will be duplicated reads if feature is disable. Also, a case where replication is enabled.

s0nskar · 2025-03-04T06:34:25Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+  val WORKER_PARTITION_READER_CHECKPOINT_ENABLE: ConfigEntry[Boolean] =
+    buildConf("celeborn.worker.partition.reader.checkpointEnabled")
+      .categories("client")
+      .version("0.5.0")


Suggested change

.version("0.5.0")

.version("0.6.0")

SteNicholas · 2025-03-04T09:35:32Z

@saurabhd336, please create new JIRA ticket for this pull request.

saurabhd336 · 2025-03-04T11:03:59Z

@zaynt4606 @s0nskar @SteNicholas Thanks for helping review this. I've fixed the tests + lint issues.
@SteNicholas could you please help me how to create a celeborn ticket for this change?

SteNicholas · 2025-03-04T11:04:48Z

@saurabhd336, you could refer to #1053 for creating new JIRA tiket.

s0nskar · 2025-03-05T04:49:13Z

@saurabhd336 had some trouble with the JIRA account, so created one – CELEBORN-1894

zaynt4606 · 2025-03-07T07:19:08Z

client/src/main/java/org/apache/celeborn/client/read/WorkerPartitionReader.java

+                  + " likely by a previous reader for the same partition.",
+              chunkIndex);
+          chunkIndex++;
+          returnedChunks++;


why toFetch dont need to decrease if toFetch is fetchMaxReqsInFlight - inFlight + 1 when chunkIdsAlreadyReturned.contains(chunkIndex)

The way I read toFetch, it seems like it's trying to ensure no more than fetchMaxReqsInFlight requests are submitted at once, while ensuring we fetch as many chunks as possible at once.

My thought process is that if we're skipping certain chunks, we could instead fetch other chunks in the list. WDYT?

Actually, come to think of it, incrementing toFetch here would be wrong and cause an infinite wait. I added a comment explaining it.

Got it.
The returnedChunks increased and the chunk actually to fetch has no change.

zaynt4606 · 2025-03-07T07:43:12Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -982,6 +982,8 @@ class CelebornConf(loadDefaults: Boolean) extends Cloneable with Logging with Se
  def clientFetchTimeoutMs: Long = get(CLIENT_FETCH_TIMEOUT)
  def clientFetchBufferSize: Int = get(CLIENT_FETCH_BUFFER_SIZE).toInt
  def clientFetchMaxReqsInFlight: Int = get(CLIENT_FETCH_MAX_REQS_IN_FLIGHT)
+  def isWorkerPartitionReaderCheckpointEnabled: Boolean =
+    get(WORKER_PARTITION_READER_CHECKPOINT_ENABLE)


what about disable this conf when replica enable.
def isWorkerPartitionReaderCheckpointEnabled: Boolean = if (clientPushReplicateEnabled) { false } else { get(WORKER_PARTITION_READER_CHECKPOINT_ENABLE) }

Makes sense. Ack

@zaynt4606 It is possible to fallback to reading from the same server even when replication is enabled. Eg: when celeborn.client.adaptive.optimizeSkewedPartitionRead.enabled is set to true and partitions are being split. I'd let CelebornInputStream itself decide whether or not to restore checkpoint when creating a reader

zaynt4606 · 2025-03-07T07:47:18Z

client/src/main/java/org/apache/celeborn/client/read/WorkerPartitionReader.java

-      results.forEach(ReferenceCounted::release);
+      results.forEach(
+          chunk -> {
+            chunk.getRight().release(); //


empty annotation

SteNicholas · 2025-03-10T03:26:59Z

client/src/main/java/org/apache/celeborn/client/read/CelebornInputStream.java

      Exception lastException = null;
+      PartitionReader reader = null;


Could move this definition into line 437? The catch blocker does not use this variable.

SteNicholas · 2025-03-10T03:27:45Z

client/src/main/java/org/apache/celeborn/client/read/PartitionReader.java

  boolean hasNext();

  ByteBuf next() throws IOException, InterruptedException;

  void close();

  PartitionLocation getLocation();
+
+  default T getPartitionReaderCheckpointMetadata() {


Why does this method has default implementation?

I initially thought of implementing it only for the WorkerPartitionReader, but I can add no-op impl for other types of readers, and can take up their implementation in a subsequent PR

SteNicholas · 2025-03-10T03:29:34Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -4691,6 +4698,15 @@ object CelebornConf extends Logging {
      .bytesConf(ByteUnit.BYTE)
      .createWithDefaultString("64k")

+  val WORKER_PARTITION_READER_CHECKPOINT_ENABLE: ConfigEntry[Boolean] =
+    buildConf("celeborn.worker.partition.reader.checkpointEnabled")


Suggested change

buildConf("celeborn.worker.partition.reader.checkpointEnabled")

buildConf("celeborn.worker.partition.reader.checkpoint.enabled")

SteNicholas · 2025-03-10T03:29:43Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -4691,6 +4698,15 @@ object CelebornConf extends Logging {
      .bytesConf(ByteUnit.BYTE)
      .createWithDefaultString("64k")

+  val WORKER_PARTITION_READER_CHECKPOINT_ENABLE: ConfigEntry[Boolean] =


Suggested change

val WORKER_PARTITION_READER_CHECKPOINT_ENABLE: ConfigEntry[Boolean] =

val WORKER_PARTITION_READER_CHECKPOINT_ENABLED: ConfigEntry[Boolean] =

FMX · 2025-03-10T03:45:15Z

...java/org/apache/celeborn/client/read/checkpoint/WorkerPartitionReaderCheckpointMetadata.java

+import java.util.Set;
+
+/** Checkpoint metadata for a partition reader on the worker side. */
+public class WorkerPartitionReaderCheckpointMetadata implements PartitionReaderCheckpointMetadata {


This class can be used in worker partition readers, DFS partition reader, and local partition readers. IMO, here is no need to extract an empty interface for it.

I initially thought of implementing it only for the WorkerPartitionReader, but I can add no-op impl for other types of readers, and can take up their implementation in a subsequent PR

FMX · 2025-03-10T03:46:11Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+    if (clientPushReplicateEnabled) {
+      false
+    } else
+      get(WORKER_PARTITION_READER_CHECKPOINT_ENABLE)


Save it to a val to avoid repeatedly parsing this config

saurabhd336 · 2025-03-11T14:21:01Z

@zaynt4606 @FMX @SteNicholas @s0nskar Review comments have been addressed PTAL!

FMX

Looks like this optimization is generic for all partition readers.

FMX · 2025-03-11T07:54:13Z

client/src/main/java/org/apache/celeborn/client/read/DfsPartitionReader.java

+
+  @Override
+  public Optional<PartitionReaderCheckpointMetadata> getPartitionReaderCheckpointMetadata() {
+    // TODO implement similar to {@link WorkerPartitionReader}


That implementation is helpful for all partition readers. Why should we leave the implementation as a to-do here?

I had planned to implement in a followup PR, after testing this for worker partition reader. Anway, I've added the implementation for DfsPartitionReader. For LocalPartitionReader, I think the complexity of implementation outweighs the potential benefits (since for both the worker / dfs readers, its a network call to fetch the chunks, while for local reader it's just a local file buffer read)

I can take up the impl for local reader in a followup PR. wdyt?

RexXiong · 2025-03-13T09:44:52Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -4691,6 +4694,15 @@ object CelebornConf extends Logging {
      .bytesConf(ByteUnit.BYTE)
      .createWithDefaultString("64k")

+  val PARTITION_READER_CHECKPOINT_ENABLED: ConfigEntry[Boolean] =
+    buildConf("celeborn.partition.reader.checkpoint.enabled")


Maybe rename the configuration to celeborn.client.partition.reader.checkpoint.enabled, since it is only for client use.

RexXiong · 2025-03-13T09:52:09Z

client/src/main/java/org/apache/celeborn/client/read/WorkerPartitionReader.java

@@ -176,7 +186,8 @@ public ByteBuf next() throws IOException, InterruptedException {
      throw e;
    }
    returnedChunks++;
-    return chunk;
+    chunkIdsAlreadyReturned.add(chunk.getLeft());


Only checkpoint when isCheckpointEnabled

With the new way of checkpointing, this is now handled

RexXiong · 2025-03-13T09:52:44Z

client/src/main/java/org/apache/celeborn/client/read/WorkerPartitionReader.java

+      int toFetch = Math.min(fetchMaxReqsInFlight - inFlight + 1, endChunkIndex + 1 - chunkIndex);
+
+      while (toFetch > 0 && chunkIndex <= endChunkIndex) {
+        if (chunkIdsAlreadyReturned.contains(chunkIndex)) {


isCheckpointEnabled && chunkIdsAlreadyReturned.contains(chunkIndex)

Now handled

saurabhd336 · 2025-03-13T17:22:15Z

@FMX I noticed the comment regarding checkpointing only after a chunk is fully read, while i'm not sure under what cases can a returned chunk be not fully read before calling next(), I nonetheless have addressed the comment, now checkpointing last returned chunk id on each next() call. PTAL!

FMX · 2025-03-14T02:19:52Z

@FMX I noticed the comment regarding checkpointing only after a chunk is fully read, while i'm not sure under what cases can a returned chunk be not fully read before calling next(), I nonetheless have addressed the comment, now checkpointing last returned chunk id on each next() call. PTAL!

Hi, after some discussion, we found that if a chunk is returned and is not fully read, the spark task will fail, so this won't be a problem here.

saurabhd336 · 2025-03-14T03:02:42Z

@FMX I noticed the comment regarding checkpointing only after a chunk is fully read, while i'm not sure under what cases can a returned chunk be not fully read before calling next(), I nonetheless have addressed the comment, now checkpointing last returned chunk id on each next() call. PTAL!

Hi, after some discussion, we found that if a chunk is returned and is not fully read, the spark task will fail, so this won't be a problem here.

Yes that's what I had noticed too, I've anyway changes the pr in a way that the last chunk returned is now checkpointed in the next next() call, ensuring we only checkpoint once the last returned chunk is fully read. PTAL!

RexXiong

LGTM, only a nit

RexXiong · 2025-03-14T03:04:38Z

client/src/main/java/org/apache/celeborn/client/read/WorkerPartitionReader.java

  @Override
  public ByteBuf next() throws IOException, InterruptedException {
    checkException();
+    checkpoint();


nit: IMO checkpoint can be placed before checkException

Yes makes sense. Ack

saurabhd336 · 2025-03-17T10:05:45Z

@zaynt4606 @FMX @SteNicholas @s0nskar PTAL!

RexXiong · 2025-03-18T03:38:48Z

Thanks. merge to main(v0.6.0)

Allow skipping already read chunks to improve read performance in unr…

4672655

…eplicated scenario

zaynt4606 reviewed Mar 4, 2025

View reviewed changes

s0nskar reviewed Mar 4, 2025

View reviewed changes

saurabhd336 added 7 commits March 4, 2025 15:55

Lint fix

a95dbb3

Lint + negative test case

3933c94

Commit missing files

9fea1b6

Fix interface

8ec61c5

Move to default interface method

b3c6eed

Add licenses

ebb165b

Lint

0a4dc46

s0nskar changed the title ~~Allow skipping already read chunks during unreplicated shuffle read retried~~ [CELEBORN-1894] Allow skipping already read chunks during unreplicated shuffle read retried Mar 5, 2025

zaynt4606 reviewed Mar 7, 2025

View reviewed changes

saurabhd336 added 2 commits March 7, 2025 16:53

Review comments

65bb3fc

Add comment

ac7ec43

saurabhd336 force-pushed the skipReadChunks branch from c63b347 to ac7ec43 Compare March 8, 2025 18:02

SteNicholas reviewed Mar 10, 2025

View reviewed changes

FMX reviewed Mar 10, 2025

View reviewed changes

Review comments

96fe400

saurabhd336 requested review from FMX, SteNicholas and s0nskar March 10, 2025 06:21

saurabhd336 requested a review from zaynt4606 March 10, 2025 06:21

FMX reviewed Mar 12, 2025

View reviewed changes

saurabhd336 requested a review from FMX March 12, 2025 18:11

saurabhd336 force-pushed the skipReadChunks branch from 41d1b4d to 31354ae Compare March 13, 2025 04:26

Impl for DFSPartitionReader

74706e9

saurabhd336 force-pushed the skipReadChunks branch from 31354ae to 74706e9 Compare March 13, 2025 07:50

RexXiong reviewed Mar 13, 2025

View reviewed changes

saurabhd336 requested a review from RexXiong March 13, 2025 17:20

saurabhd336 force-pushed the skipReadChunks branch from 811b0a6 to 038feb7 Compare March 13, 2025 17:40

saurabhd336 force-pushed the skipReadChunks branch 2 times, most recently from 162d4f1 to 16483dc Compare March 14, 2025 03:00

Review comments

b437dc3

saurabhd336 force-pushed the skipReadChunks branch from 16483dc to b437dc3 Compare March 14, 2025 03:03

RexXiong approved these changes Mar 14, 2025

View reviewed changes

review comments

6e88b60

saurabhd336 force-pushed the skipReadChunks branch from e016d4a to 6e88b60 Compare March 14, 2025 04:05

RexXiong closed this in 7571e10 Mar 18, 2025

		Exception lastException = null;
		PartitionReader reader = null;

	buildConf("celeborn.worker.partition.reader.checkpointEnabled")
	buildConf("celeborn.worker.partition.reader.checkpoint.enabled")

	val WORKER_PARTITION_READER_CHECKPOINT_ENABLE: ConfigEntry[Boolean] =
	val WORKER_PARTITION_READER_CHECKPOINT_ENABLED: ConfigEntry[Boolean] =

[CELEBORN-1894] Allow skipping already read chunks during unreplicated shuffle read retried #3132

[CELEBORN-1894] Allow skipping already read chunks during unreplicated shuffle read retried #3132

Conversation

saurabhd336 commented Mar 4, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zaynt4606 left a comment

Choose a reason for hiding this comment

s0nskar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteNicholas commented Mar 4, 2025

saurabhd336 commented Mar 4, 2025

SteNicholas commented Mar 4, 2025

s0nskar commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saurabhd336 commented Mar 11, 2025

FMX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saurabhd336 commented Mar 13, 2025

FMX commented Mar 14, 2025

saurabhd336 commented Mar 14, 2025

RexXiong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saurabhd336 commented Mar 17, 2025

RexXiong commented Mar 18, 2025