-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add repartition by maximum number of rows per block #50179
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
python/ray/data/dataset.py
Outdated
@@ -1319,11 +1320,18 @@ def filter( | |||
@PublicAPI(api_group=SSR_API_GROUP) | |||
def repartition( | |||
self, | |||
num_blocks: int, | |||
num_blocks: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randomly came across this. I think we do not need num_blocks
to be default-ed to None. It's uncommon to have the default arguments that leads to the error path.
It will be helpful if the docstring includes all the invalid argument combinations (e.g., a Raises
section).
Signed-off-by: srinathk10 <[email protected]>
|
||
# Determine the slice range for the next partition. | ||
end_idx = start_idx + min(remaining_rows, max_num_rows_per_block - cur_rows) | ||
cur_block_builder.add_block(accessor.slice(start_idx, end_idx, copy=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do copy=False
? I don't think we need to copy the data here.
cur_block_builder.add_block(accessor.slice(start_idx, end_idx, copy=True)) | ||
|
||
# If the current block reaches the size limit, finalize and store it. | ||
if cur_block_builder.num_rows() == max_num_rows_per_block: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this condition isn't needed. Because the current block will either have max_num_rows_per_block
or be the last block.
Also, the builder isn't needed either. Because we are not combining slices across multiple input blocks. We can just return the sliced block.
if cur_block_builder.num_rows() > 0: | ||
block_list.append(cur_block_builder.build()) | ||
|
||
return block_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to yield blocks are soon as they are sliced.
@@ -520,6 +584,17 @@ def transform_fn(blocks: Iterable[Block], _: TaskContext) -> Iterable[Block]: | |||
return transform_fn | |||
|
|||
|
|||
def _generate_transform_fn_for_repartition_block( | |||
fn: UserDefinedFunction, | |||
) -> MapTransformCallable[Block, Block]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just inline this function, it's simple and won't be reused.
I just realized that there is a simpler solution. |
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
map_transformer = MapTransformer(transform_fns) | ||
map_transformer.set_target_max_block_size( | ||
target_max_block_size=op.max_num_rows_per_block | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target_max_block_size
is based on the data size in bytes, not num rows.
We also need to make OutputBuffer
to support num rows.
Btw, let's also update the target_max_block_size
arg to a struct that accepts either size-bytes or num-rows-based arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Missed this one.
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
@@ -31,8 +33,11 @@ class BlockOutputBuffer: | |||
... yield output.next() # doctest: +SKIP | |||
""" | |||
|
|||
def __init__(self, target_max_block_size: int): | |||
def __init__( | |||
self, target_max_block_size: int, target_max_rows_per_block: int = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check that only one of target_max_block_size
and target_max_rows_per_block
is non-None.
Maybe introduce such a struct to make code cleaner.
@dataclass
class OutputBlockSizeOption:
target_max_block_size: Optional[int]
target_max_rows_per_block: Optional[int]
def __post_init__(sefl):
# check Nones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the type as well
@@ -57,14 +62,33 @@ def finalize(self) -> None: | |||
assert not self._finalized | |||
self._finalized = True | |||
|
|||
def _buffer_row_limit(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _buffer_row_limit(self) -> bool: | |
def _exceeds_buffer_row_limit(self) -> bool: |
target_num_rows_by_rows = ( | ||
self._target_max_rows_per_block or block.num_rows() | ||
) | ||
target_num_rows = min(target_num_rows_by_size, target_num_rows_by_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the logic, I think we can just consider one factor a time. I.E., only one factor can be non-None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -31,8 +33,11 @@ class BlockOutputBuffer: | |||
... yield output.next() # doctest: +SKIP | |||
""" | |||
|
|||
def __init__(self, target_max_block_size: int): | |||
def __init__( | |||
self, target_max_block_size: int, target_max_rows_per_block: int = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the type as well
and self._buffer.num_rows() > self._target_max_rows_per_block | ||
) | ||
|
||
def _buffer_size_limit(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto everywhere
target_num_rows_by_rows = ( | ||
self._target_max_rows_per_block or block.num_rows() | ||
) | ||
target_num_rows = min(target_num_rows_by_size, target_num_rows_by_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
assert not ( | ||
self.target_max_block_size is not None | ||
and self.target_max_rows_per_block is not None | ||
), "Only one of target_max_block_size or target_max_rows_per_block should be set." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they cannot be both None either.
This is a cleaner way to check one is None and the other is non-None. assert (x is None) != (y is None)
@@ -63,6 +63,7 @@ def __init__( | |||
self._output_type = output_type | |||
self._category = category | |||
self._target_max_block_size = None | |||
self._target_max_rows_per_block = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, also use OutputBlockSizeOption
here for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and same for the map physical operator.
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
Why are these changes needed?
Add repartition by maximum number of rows per block
Addresses #36724
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.