Add repartition by maximum number of rows per block #50179

srinathk10 · 2025-02-02T07:23:55Z

Why are these changes needed?

Add repartition by maximum number of rows per block
Addresses #36724

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Srinath Krishnamachari <[email protected]>

wingkitlee0 · 2025-02-02T18:03:35Z

python/ray/data/dataset.py

@@ -1319,11 +1320,18 @@ def filter(
    @PublicAPI(api_group=SSR_API_GROUP)
    def repartition(
        self,
-        num_blocks: int,
+        num_blocks: Optional[int] = None,


randomly came across this. I think we do not need num_blocks to be default-ed to None. It's uncommon to have the default arguments that leads to the error path.

It will be helpful if the docstring includes all the invalid argument combinations (e.g., a Raises section).

Signed-off-by: srinathk10 <[email protected]>

raulchen · 2025-02-04T00:39:07Z

python/ray/data/_internal/planner/plan_udf_map_op.py

+
+            # Determine the slice range for the next partition.
+            end_idx = start_idx + min(remaining_rows, max_num_rows_per_block - cur_rows)
+            cur_block_builder.add_block(accessor.slice(start_idx, end_idx, copy=True))


Can we do copy=False? I don't think we need to copy the data here.

raulchen · 2025-02-04T00:41:27Z

python/ray/data/_internal/planner/plan_udf_map_op.py

+            cur_block_builder.add_block(accessor.slice(start_idx, end_idx, copy=True))
+
+            # If the current block reaches the size limit, finalize and store it.
+            if cur_block_builder.num_rows() == max_num_rows_per_block:


this condition isn't needed. Because the current block will either have max_num_rows_per_block or be the last block.
Also, the builder isn't needed either. Because we are not combining slices across multiple input blocks. We can just return the sliced block.

raulchen · 2025-02-04T00:42:59Z

python/ray/data/_internal/planner/plan_udf_map_op.py

+        if cur_block_builder.num_rows() > 0:
+            block_list.append(cur_block_builder.build())
+
+        return block_list


Better to yield blocks are soon as they are sliced.

raulchen · 2025-02-04T00:43:32Z

python/ray/data/_internal/planner/plan_udf_map_op.py

@@ -520,6 +584,17 @@ def transform_fn(blocks: Iterable[Block], _: TaskContext) -> Iterable[Block]:
    return transform_fn


+def _generate_transform_fn_for_repartition_block(
+    fn: UserDefinedFunction,
+) -> MapTransformCallable[Block, Block]:


maybe just inline this function, it's simple and won't be reused.

raulchen · 2025-02-04T00:54:57Z

I just realized that there is a simpler solution.
We should just extend OutputBuffer to support num-rows-based target block size.

Signed-off-by: Srinath Krishnamachari <[email protected]>

Signed-off-by: srinathk10 <[email protected]>

Signed-off-by: Srinath Krishnamachari <[email protected]>

Signed-off-by: srinathk10 <[email protected]>

raulchen · 2025-02-04T21:05:27Z

python/ray/data/_internal/planner/plan_udf_map_op.py

+    map_transformer = MapTransformer(transform_fns)
+    map_transformer.set_target_max_block_size(
+        target_max_block_size=op.max_num_rows_per_block
+    )


target_max_block_size is based on the data size in bytes, not num rows.
We also need to make OutputBuffer to support num rows.
Btw, let's also update the target_max_block_size arg to a struct that accepts either size-bytes or num-rows-based arguments.

I see. Missed this one.

Signed-off-by: Srinath Krishnamachari <[email protected]>

Signed-off-by: srinathk10 <[email protected]>

raulchen · 2025-02-05T19:00:44Z