Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE]ZipPartitions for arbitrary number of RDDs. #49659

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kyle-winkelman
Copy link

What changes were proposed in this pull request?

  • Add a generic zipPartitions method to take an arbitrary number of RDDs (for cases when >4 RDDs are to be zipped).
  • Update the JavaRDD api to include the same set of zipPartition functions as RDD.

Why are the changes needed?

RDD.zipPartitions currently only allows up to 3 other RDDs to be zipped. This forces users to zipPartitions multiple times to zip more than 4 RDDs.

Also, the Java API only allows zipPartitions with 1 other RDD, so this brings parity between the two.

Does this PR introduce any user-facing change?

Yes, new zipPartitions functions in RDD and JavaRDD.

How was this patch tested?

Added unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jan 24, 2025
kyle-winkelman

This comment was marked as outdated.

@kyle-winkelman kyle-winkelman force-pushed the zipPartitions branch 3 times, most recently from 72e8f05 to fddfe7c Compare January 24, 2025 21:55
@kyle-winkelman kyle-winkelman changed the title ZipPartitions for arbitrary number of RDDs. [CORE]ZipPartitions for arbitrary number of RDDs. Jan 25, 2025
def fn: (Iterator[T], Iterator[U1], Iterator[U2]) => Iterator[V] =
(t: Iterator[T], u1: Iterator[U1], u2: Iterator[U2]) =>
f.call(t.asJava, u1.asJava, u2.asJava).asScala
JavaRDD.fromRDD(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be easily worked around. I wouldn't add this also considering that we're being conservative on RDD API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment in regards to the entire PR or just the changes in JavaRDDLike? My long term goal was to add additional cogroup methods for 3, 4, and N number of KeyValueGroupedDatasets. I do not need all the logic from this PR for that goal, but thought it was a good small step in that direction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what my long term might look like master...kyle-winkelman:spark:everything (might have some noise in it, but it adds additional cogroup methods and SPARK-42349). If you would prefer I attempt to go straight for the big PR that does all the changes at once, I can repurpose this PR to target those changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use Dataset instead? We're promoting it over RDD API actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants