-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE]ZipPartitions for arbitrary number of RDDs. #49659
base: master
Are you sure you want to change the base?
Conversation
72e8f05
to
fddfe7c
Compare
fddfe7c
to
f005ef7
Compare
def fn: (Iterator[T], Iterator[U1], Iterator[U2]) => Iterator[V] = | ||
(t: Iterator[T], u1: Iterator[U1], u2: Iterator[U2]) => | ||
f.call(t.asJava, u1.asJava, u2.asJava).asScala | ||
JavaRDD.fromRDD( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be easily worked around. I wouldn't add this also considering that we're being conservative on RDD API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment in regards to the entire PR or just the changes in JavaRDDLike? My long term goal was to add additional cogroup methods for 3, 4, and N number of KeyValueGroupedDatasets. I do not need all the logic from this PR for that goal, but thought it was a good small step in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what my long term might look like master...kyle-winkelman:spark:everything (might have some noise in it, but it adds additional cogroup methods and SPARK-42349). If you would prefer I attempt to go straight for the big PR that does all the changes at once, I can repurpose this PR to target those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use Dataset instead? We're promoting it over RDD API actually.
What changes were proposed in this pull request?
Why are the changes needed?
RDD.zipPartitions currently only allows up to 3 other RDDs to be zipped. This forces users to zipPartitions multiple times to zip more than 4 RDDs.
Also, the Java API only allows zipPartitions with 1 other RDD, so this brings parity between the two.
Does this PR introduce any user-facing change?
Yes, new zipPartitions functions in RDD and JavaRDD.
How was this patch tested?
Added unit tests.
Was this patch authored or co-authored using generative AI tooling?
No