[CELEBORN-1469] Support writing shuffle data to OSS(S3 only) #2579

zhaohehuhu · 2024-06-19T07:22:24Z

What changes were proposed in this pull request?

as title

Why are the changes needed?

Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data.

Does this PR introduce any user-facing change?

No

How was this patch tested?

FMX · 2024-06-19T07:43:39Z

Hi @zhaohehuhu, there is some tools that can check codes PR locally.
Run ./dev/reformat could check code style and compilation errors.

SteNicholas · 2024-06-19T08:19:31Z

@zhaohehuhu, please add comment for the test result.

FMX · 2024-06-19T08:31:26Z

UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite
This could update the docs as you have changed CelebornConf.

zhaohehuhu · 2024-06-20T02:31:39Z

Hi @zhaohehuhu, there is some tools that can check codes PR locally. Run ./dev/reformat could check code style and compilation errors.
Done. Thanks.

zhaohehuhu · 2024-06-20T02:31:49Z

UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite This could update the docs as you have changed CelebornConf.

Done. Thanks.

zhaohehuhu · 2024-06-20T02:32:00Z

@zhaohehuhu, please add comment for the test result.

Got it. Thanks.

zhaohehuhu · 2024-06-20T08:33:44Z

@SteNicholas , please add comment for the test result.

Hi bro, this PR is only part of S3 storage layer feature, so I can't provide a complete test now. But the above screenshot shows the shuffle data can sink into S3.

FMX · 2024-06-21T03:23:59Z

Thanks for your effort. I'll complete the review within the next week.

FMX

Thanks for your contribution. Based on this PR, I think Celeborn worker should support using HDFS and S3 concurrently. There will need some changes.

FMX · 2024-06-25T06:12:15Z

common/pom.xml

@@ -187,6 +187,16 @@
          <artifactId>hadoop-client-runtime</artifactId>
          <version>${hadoop.version}</version>
        </dependency>
+        <dependency>
+          <groupId>org.apache.hadoop</groupId>


These two dependencies should be put into LICENSE-binary and NOTICE-binary.

Got it. Thanks

common/src/main/java/org/apache/celeborn/common/protocol/StorageInfo.java

FMX · 2024-06-25T09:28:08Z

common/src/main/java/org/apache/celeborn/common/protocol/StorageInfo.java

@@ -160,6 +160,10 @@ public boolean HDFSOnly() {
    return StorageInfo.HDFSOnly(availableStorageTypes);
  }

+  public static boolean OSSOnly(int availableStorageTypes) {


So I think this can be changed to S3Only

Got it. Thanks

FMX · 2024-06-25T10:42:32Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -1106,6 +1110,49 @@ class CelebornConf(loadDefaults: Boolean) extends Cloneable with Logging with Se
  def partitionSplitMinimumSize: Long = get(WORKER_PARTITION_SPLIT_MIN_SIZE)
  def partitionSplitMaximumSize: Long = get(WORKER_PARTITION_SPLIT_MAX_SIZE)

+  def s3AccessKey: String = get(S3_ACCESS_KEY).map {


get(S3_ACCESS_KEY) This method won't get an empty string. So this check is redundant.

Use getOrElse("") could be enough.

Got it. Thanks

FMX · 2024-06-25T11:39:07Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+      }
+  }.getOrElse("")
+
+  def s3SecretKey: String = get(S3_SECRET_KEY).map {


Got it. Thanks

FMX · 2024-06-26T03:24:54Z

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/FlushTask.scala

+    keepBuffer: Boolean) extends FlushTask(buffer, notifier, keepBuffer) {
+  override def flush(): Unit = {
+    if (StorageManager.hadoopFs.exists(path)) {
+      val conf = StorageManager.hadoopFs.getConf


Is there any difference for append between S3 and HDFS? What are the benefits of this logic?

Yup. The main reason is S3 doesn't support append mode yet.

@zhaohehuhu Is there another approach to support integrated a new Filesystem like s3 which cannot support append? Looks it is Inefficient, it copy the old data from s3 to worker and write to s3 again, this can scale out the write extremely.

Correct. The current implementation is just a workaround to solve the limitation that S3 doesn't support append mode. I'm figure out a better solution to avoid copy-and-write. @maobaolong

@zhaohehuhu That's great, look forward to know more about the new solution.

Thanks @maobaolong

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/Flusher.scala

FMX · 2024-06-26T03:33:20Z

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala

@@ -56,7 +56,7 @@ final private[worker] class StorageManager(conf: CelebornConf, workerSource: Abs
  // mount point -> file writer
  val workingDirWriters =
    JavaUtils.newConcurrentHashMap[File, ConcurrentHashMap[String, PartitionDataWriter]]()
-  val hdfsWriters = JavaUtils.newConcurrentHashMap[String, PartitionDataWriter]()
+  val dfsWriters = JavaUtils.newConcurrentHashMap[String, PartitionDataWriter]()


Do not merge. Split into two writers map.

FMX · 2024-06-26T03:33:35Z

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala

      throw new IOException("Empty working directory configuration!")
    }

    DeviceInfo.getDeviceAndDiskInfos(workingDirInfos, conf)
  }
  val mountPoints = new util.HashSet[String](diskInfos.keySet())
-  val hdfsDiskInfo =
+  val dfsDiskInfo =


Just add a new S3 diskinfo.

FMX · 2024-06-26T03:33:45Z

worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala

-  val (hdfsFlusher, _totalHdfsFlusherThread) =
-    if (hasHDFSStorage) {
-      logInfo(s"Initialize HDFS support with path $hdfsDir")
+  val (dfsFlusher, _totalDfsFlusherThread) =


FMX · 2024-07-08T05:46:01Z

@zhaohehuhu Thanks for your effort. The review will be done within this week.
BTW,I noticed the binary package of Celeborn grows up to 468MB and there is a single jar "aws-java-sdk-bundle-1.12.367.jar" which is 310 MB.
Is there any solution to reduce the binary package size?

pan3793 · 2024-07-08T05:58:32Z

The public cloud vendor's client jars size is crazy, especially aws(v1 300MiB+, v2 500MiB+), we should not ship them by default.

FMX · 2024-07-08T06:28:54Z

You can fix the style issues and license issues using the following commands

./dev/reformat
build/mvn org.apache.rat:apache-rat-plugin:check -Pgoogle-mirror,spark-3.3

zhaohehuhu · 2024-07-09T07:50:38Z

crazy

got it.

zhaohehuhu · 2024-07-09T07:51:15Z

You can fix the style issues and license issues using the following commands
./dev/reformat
build/mvn org.apache.rat:apache-rat-plugin:check -Pgoogle-mirror,spark-3.3

OK. Thanks

FMX

Thanks for your effort. But this PR still needs further changes.

FMX · 2024-07-16T07:11:14Z

pom.xml

@@ -71,6 +71,7 @@

    <!-- use hadoop-3 as default  -->
    <hadoop.version>3.3.6</hadoop.version>
+    <aws.version>1.12.367</aws.version>


This version can be moved to the hadoop-aws profile.

build/make-distribution.sh

FMX · 2024-07-16T07:14:53Z

build/make-distribution.sh


-  BUILD_COMMAND=("$SBT" clean package)
+  BUILD_COMMAND=("$SBT" clean package "$PROFILE")


These changes can be deleted.

./build/make-distribution.sh --sbt-enabled -Pspark-3.3,hadoop-aws maybe doesn't work as expected due to above command.

FMX · 2024-07-16T07:32:14Z

client/src/main/java/org/apache/celeborn/client/read/DfsPartitionReader.java

@@ -85,6 +87,13 @@ public DfsPartitionReader(

    this.metricsCallback = metricsCallback;
    this.location = location;
+    FileSystem hadoopFs = null;


A dfs partition reader will read one partition location only. So the hadoopFS can be cached as a variable in this class. This can eliminate the unnecessary condition blocks.

FMX · 2024-07-16T09:07:36Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionDataWriter.java

      // create a DataStreamer that is a thread.
-      // If we reuse HDFS output stream, we will exhaust the memory soon.
+      // If we reuse DFS output stream, we will exhaust the memory soon.


I wonder if the AWS Hadoop has some optimization about this point.

FMX · 2024-07-16T09:13:23Z

worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionDataWriter.java

      this.flusherBufferSize = localFlusherBufferSize;
      channel = FileChannelUtils.createWritableFileChannel(this.diskFileInfo.getFilePath());
    } else {
-      this.flusherBufferSize = hdfsFlusherBufferSize;
-      // We open the stream and close immediately because HDFS output stream will
+      FileSystem hadoopFs = null;


Keep this hadoopFs as a class variable. No need to do so many conditions to get the corresponding hadoopFS.

...er/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java

FMX · 2024-07-16T09:17:04Z

...er/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java

@@ -673,11 +690,17 @@ class FileSorter {
          indexFile.delete();
        }
      } else {
-        if (StorageManager.hadoopFs().exists(fileInfo.getHdfsSortedPath())) {
-          StorageManager.hadoopFs().delete(fileInfo.getHdfsSortedPath(), false);
+        FileSystem hadoopFs = null;


FMX · 2024-07-16T09:17:13Z

...c/main/java/org/apache/celeborn/service/deploy/worker/storage/ReducePartitionDataWriter.java

-                  if (StorageManager.hadoopFs()
-                      .exists(diskFileInfo.getHdfsPeerWriterSuccessPath())) {
-                    StorageManager.hadoopFs().delete(diskFileInfo.getHdfsPath(), false);
+                if (diskFileInfo.isDFS()) {


FMX · 2024-07-22T06:09:54Z

LICENSE-binary

@@ -307,6 +308,7 @@ org.slf4j:jcl-over-slf4j
 org.webjars:swagger-ui
 org.xerial.snappy:snappy-java
 org.yaml:snakeyaml
+com.amazonaws:aws-java-sdk-bundle


add dependency in alphabetical order

FMX

LGTM. Except a nit.

FMX · 2024-07-22T06:34:32Z

.../src/main/java/org/apache/celeborn/service/deploy/worker/storage/MapPartitionDataWriter.java

-            hdfsStream.write(indexBuffer.array());
-            hdfsStream.close();
+          } else if (diskFileInfo.isDFS()) {
+            FSDataOutputStream dfsStream = hadoopFs.append(diskFileInfo.getDfsIndexPath());


Is this safe for S3? As far as I know, S3 doesn't support append.

FMX · 2024-07-24T04:02:49Z

Merged into main(v0.6.0)

### What changes were proposed in this pull request? as title ### Why are the changes needed? Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes apache#2579 from zhaohehuhu/dev-0619. Authored-by: zhaohehuhu <[email protected]> Signed-off-by: mingji <[email protected]>

zhaohehuhu changed the title ~~support flush disk buffer to OSS(S3 only)~~ Support shuffle data to OSS(S3 only) Jun 19, 2024

zhaohehuhu changed the title ~~Support shuffle data to OSS(S3 only)~~ Support writing shuffle data to OSS(S3 only) Jun 19, 2024

FMX changed the title ~~Support writing shuffle data to OSS(S3 only)~~ [CELEBORN-1463] Support writing shuffle data to OSS(S3 only) Jun 19, 2024

FMX changed the title ~~[CELEBORN-1463] Support writing shuffle data to OSS(S3 only)~~ [CELEBORN-1469] Support writing shuffle data to OSS(S3 only) Jun 19, 2024

zhaohehuhu force-pushed the dev-0619 branch from 5c3de3b to 4c8b97e Compare June 20, 2024 02:52

FMX self-requested a review June 25, 2024 06:05

FMX reviewed Jun 26, 2024

View reviewed changes

zhaohehuhu force-pushed the dev-0619 branch 2 times, most recently from d40e3d5 to b24d3f2 Compare July 11, 2024 09:55

FMX reviewed Jul 16, 2024

View reviewed changes

zhaohehuhu added 7 commits July 17, 2024 17:08

support flush disk buffer to OSS(S3 only)

1c39672

reformat

e46ae3b

reformat

dea7521

fix UTs

7002dd5

fix UTs

e068904

fix deps

118d133

avoid hdfsDir and s3Dir are empty

0fe295e

zhaohehuhu added 16 commits July 17, 2024 17:08

fix Flusher

d90bfbd

add S3 storage policy

a929e24

code format

e12bf49

fix UTs

813638a

add aws-java-sdk-bundle profile in pom

d67ac58

add hadoop aws profile in poms

31b902e

add hadoop aws profile in poms

01c9d64

fix deps

744931b

restore CelebornBuild

65a9c31

add hadoopAwsDependencies into CelebornBuild

2eefc81

add hadoopAwsDependencies into CelebornBuild

1da508c

update doc and make some mirror changes for code

27fa998

update shell to support aws deps

cad3cd9

fix

75fe82b

Refactor

0d8de8d

refactor pom

ffd183d

zhaohehuhu force-pushed the dev-0619 branch from 2724122 to ffd183d Compare July 18, 2024 02:27

zhaohehuhu added 6 commits July 18, 2024 11:18

refactor ReducePartitionDataWriter

0fbba07

refactor DiskInfo

aaaf322

update doc

c6a2037

refactor MapPartitionDataWriter

7dcd133

reformat

89e16d6

reformat

d9db13b

FMX reviewed Jul 22, 2024

View reviewed changes

FMX approved these changes Jul 22, 2024

View reviewed changes

FMX closed this in 7a596bb Jul 24, 2024

zhaohehuhu deleted the dev-0619 branch July 26, 2024 03:00


		BUILD_COMMAND=("$SBT" clean package)
		BUILD_COMMAND=("$SBT" clean package "$PROFILE")

[CELEBORN-1469] Support writing shuffle data to OSS(S3 only) #2579

[CELEBORN-1469] Support writing shuffle data to OSS(S3 only) #2579

Conversation

zhaohehuhu commented Jun 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

FMX commented Jun 19, 2024

SteNicholas commented Jun 19, 2024

FMX commented Jun 19, 2024

zhaohehuhu commented Jun 20, 2024

zhaohehuhu commented Jun 20, 2024

zhaohehuhu commented Jun 20, 2024

zhaohehuhu commented Jun 20, 2024 • edited Loading

FMX commented Jun 21, 2024

FMX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaohehuhu Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FMX commented Jul 8, 2024

pan3793 commented Jul 8, 2024 • edited Loading

FMX commented Jul 8, 2024

zhaohehuhu commented Jul 9, 2024

zhaohehuhu commented Jul 9, 2024

FMX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FMX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FMX commented Jul 24, 2024

zhaohehuhu commented Jun 19, 2024 •

edited

Loading

zhaohehuhu commented Jun 20, 2024 •

edited

Loading

zhaohehuhu Jun 27, 2024 •

edited

Loading

pan3793 commented Jul 8, 2024 •

edited

Loading