[CORE]ZipPartitions for arbitrary number of RDDs. #49659

kyle-winkelman · 2025-01-24T21:09:20Z

What changes were proposed in this pull request?

Add a generic zipPartitions method to take an arbitrary number of RDDs (for cases when >4 RDDs are to be zipped).
Update the JavaRDD api to include the same set of zipPartition functions as RDD.

Why are the changes needed?

RDD.zipPartitions currently only allows up to 3 other RDDs to be zipped. This forces users to zipPartitions multiple times to zip more than 4 RDDs.

Also, the Java API only allows zipPartitions with 1 other RDD, so this brings parity between the two.

Does this PR introduce any user-facing change?

Yes, new zipPartitions functions in RDD and JavaRDD.

How was this patch tested?

Added unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon · 2025-01-25T07:38:07Z

core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala

+    def fn: (Iterator[T], Iterator[U1], Iterator[U2]) => Iterator[V] =
+      (t: Iterator[T], u1: Iterator[U1], u2: Iterator[U2]) =>
+        f.call(t.asJava, u1.asJava, u2.asJava).asScala
+    JavaRDD.fromRDD(


It can be easily worked around. I wouldn't add this also considering that we're being conservative on RDD API

Is this comment in regards to the entire PR or just the changes in JavaRDDLike? My long term goal was to add additional cogroup methods for 3, 4, and N number of KeyValueGroupedDatasets. I do not need all the logic from this PR for that goal, but thought it was a good small step in that direction.

Here is what my long term might look like master...kyle-winkelman:spark:everything (might have some noise in it, but it adds additional cogroup methods and SPARK-42349). If you would prefer I attempt to go straight for the big PR that does all the changes at once, I can repurpose this PR to target those changes.

Why don't we use Dataset instead? We're promoting it over RDD API actually.

github-actions bot added the CORE label Jan 24, 2025

This comment was marked as outdated.

Sign in to view

kyle-winkelman force-pushed the zipPartitions branch 3 times, most recently from 72e8f05 to fddfe7c Compare January 24, 2025 21:55

kyle-winkelman changed the title ~~ZipPartitions for arbitrary number of RDDs.~~ [CORE]ZipPartitions for arbitrary number of RDDs. Jan 25, 2025

ZipPartitions for arbitrary number of RDDs.

f005ef7

kyle-winkelman force-pushed the zipPartitions branch from fddfe7c to f005ef7 Compare January 25, 2025 01:03

HyukjinKwon reviewed Jan 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE]ZipPartitions for arbitrary number of RDDs. #49659

[CORE]ZipPartitions for arbitrary number of RDDs. #49659

kyle-winkelman commented Jan 24, 2025

This comment was marked as outdated.

HyukjinKwon Jan 25, 2025

kyle-winkelman Jan 25, 2025

kyle-winkelman Jan 25, 2025

HyukjinKwon Jan 27, 2025

[CORE]ZipPartitions for arbitrary number of RDDs. #49659

Are you sure you want to change the base?

[CORE]ZipPartitions for arbitrary number of RDDs. #49659

Conversation

kyle-winkelman commented Jan 24, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

This comment was marked as outdated.

HyukjinKwon Jan 25, 2025

Choose a reason for hiding this comment

kyle-winkelman Jan 25, 2025

Choose a reason for hiding this comment

kyle-winkelman Jan 25, 2025

Choose a reason for hiding this comment

HyukjinKwon Jan 27, 2025

Choose a reason for hiding this comment