[Feat] Add rollout scale-up support #592

HwVanICI · 2025-11-18T01:53:25Z

Description

Currently, AReaL implements a static resource allocation scheme, where fixed amounts of devices are allocated to rollout/training stages. This MR introduces rollout scale-up, where rollout engines can be added to a running training job. This feature can benefit workloads when the optimal resource allocation ratio changes during training.

Design Summary

Usage: A user (operator) can request adding rollout workers via an HTTP endpoint while training is running.
A Ray launcher starts an independent scaling_controller that exposes HTTP endpoints for scaling actions.
On a scale-up request, scaling_controller launches additional generation servers (validated with a vLLM-Ascend backend in our tests, but the design is engine-agnostic).
Trainers periodically call handle_scale_up() (from utils/scaling.py) at the end of each step:
1. Check if there is a scale_up_request.
2. Wait until the new generation servers are ready.
3. When ready, scaling_controller posts a scale_up_done signal via name_resolve.
4. The trainer calls actor(fsdp_engine)._re_init_weight_update_from_distributed(), which in turn invokes actor._init_weight_update_from_distributed(). The actor then recreates the process group with the new world size to include the newly added generation servers.
5. The rollout engine executes init_weights_update_group() to recreate the process group to include both previous and newly added generation servers.
6. remote_inf_engine.py::refresh_addresses() refreshes endpoints to old + new server addresses.
7. The request dispatcher (choose_server) is adjusted so that at the start of a step it includes newly added servers when routing requests.

Workflow (rollout engine scale-up)

POST /scale_up { "scaled_k": K } received by scaling_controller.
scaling_controller allocates resources and launches K new generation servers.
New servers are registered and their readiness is tracked.
When all K generation servers are ready, scaling_controller signals scale_up_done via name_resolve.
At the next end-of-step boundary, trainers run handle_scale_up() to perform distributed re-initialization and address refresh.
Next step begins with a scaled pool of rollout servers.

Instructions

Use the provided config: examples/math/gsm8k_grpo_npu_scale.yaml
- This version has been validated with Ray + FSDP training backend + vLLM-Ascend generation backend.
Multi-node setup:
- Example: 3 nodes × 16 devices each, for a total of 48 devices.
- Start with DP=16 for rollout and DP=16 for training across 2 nodes in the AReaL YAML config.
- Initialize Ray on all 3 nodes, leaving 1 node (16 devices) as headroom for future scaling.
Start training with the YAML above and ensure scaling_controller_port is set.

Trigger scale-up at any time:

curl -X POST http://localhost:8899/scale_up \
     -H "Content-Type: application/json" \
     -d '{"scaled_k": 1}'

Replace 8899 with the configured scaling_controller_port if applicable.

Scaling K > 1:
- Multiple rollout servers can be added at once via scaled_k.
- The trainer will block at the end-of-step in handle_scale_up() until all K new servers are ready. For larger K, expect more startup time before training proceeds.

Compatibility & Assumptions

Engine: The mechanism is validated with FSDP and vLLM (with vLLM-Ascend) on NPU in this MR.
Ray-based orchestration: We assume using Ray for process launch, name resolution, and resource accounting.
Reconfiguration point: Re-init happens between steps to keep on-step logic simple and deterministic.

Related Issue

N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

N/A

gemini-code-assist · 2025-11-18T01:53:48Z

Summary of Changes

Hello @HwVanICI, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new feature to AReaL: dynamic scale-up support for rollout engines during active training. This enhancement moves away from static resource allocation, enabling operators to add more inference servers on the fly via an HTTP endpoint. The system is designed to seamlessly integrate new servers into the existing training process by coordinating their launch, updating server address lists, and reconfiguring distributed communication groups within the trainers, ensuring continuous and adaptable resource utilization.

Highlights

Dynamic Rollout Scale-Up: Introduces the ability to dynamically add more rollout (inference) engines to a running training job, allowing for flexible resource allocation based on changing workload demands.
Scaling Controller Service: A new scaling_controller service is implemented using FastAPI, which exposes an HTTP endpoint to receive scale-up requests and orchestrate the launch of additional generation servers via Ray.
Trainer Reconfiguration: Trainers are updated to detect scale-up requests at the end of each training step. Upon detection, they wait for new servers to be ready, then re-initialize their distributed communication groups and refresh their list of available inference server addresses.
Configuration and Example: A new ScalingConfig dataclass is added for managing scaling parameters, and a new example YAML configuration (gsm8k_grpo_npu_scale.yaml) demonstrates how to enable and use the scale-up feature.
Distributed Group Management: Modifications to FSDPPPOActor and VLLMBackend ensure that distributed process groups are correctly re-initialized with the updated world size and unique group names when new servers are added.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: dynamic scaling of rollout workers. The implementation is comprehensive, touching configuration, launchers, core inference engines, and the training loop. The design using a separate scaling_controller with an HTTP endpoint is a solid approach for manual scaling. The changes to handle process group re-initialization are well-thought-out, especially the use of unique group names for each re-initialization.

My review focuses on a few areas for improvement regarding correctness, maintainability, and logging practices. I've identified a bug in the launcher logic, a minor inconsistency in the scaling controller, and an opportunity to reduce log noise in the scaling utility.

Overall, this is a great addition to the project.

areal/launcher/ray.py

areal/launcher/scaler/scaling_controller.py

areal/utils/scaling.py

HwVanICI · 2025-11-19T20:16:15Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant new feature: dynamic scaling of rollout workers. The implementation adds a scaling_controller that exposes an HTTP endpoint for triggering scale-up operations. The changes are well-structured, touching configuration, the remote inference engine, the FSDP engine, and the Ray launcher to support this new functionality. The core logic for handling scale-up requests on the trainer side is encapsulated in a new scaling.py utility.

My review focuses on the new scaling_controller.py file, where I've identified a few issues related to thread safety, correctness in state handling, and code style. These include a race condition in handling concurrent scale-up requests, an incorrect calculation of existing servers during scale-up, and some opportunities for code quality improvements. Addressing these points will make the new scaling feature more robust and maintainable.

areal/launcher/scaler/scaling_controller.py

dazhuangzhuang1024 · 2025-11-21T07:05:46Z

Wouldn't it be more appropriate to call this feature "scale out"?

garrett4wade · 2025-11-26T06:45:40Z

Or we can just call it "rollout scaling" since the boundary is not very clear in the context of RL.

FYI #611 is about to merge. We can implement the scaler within RolloutController.

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

areal/launcher/ray.py Show resolved Hide resolved

areal/launcher/scaler/scaling_controller.py Show resolved Hide resolved

areal/utils/scaling.py Outdated Show resolved Hide resolved

HwVanICI force-pushed the rollout-scale-up branch from 168f421 to 1b4e441 Compare November 19, 2025 20:02

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

HwVanICI added 5 commits November 20, 2025 10:48

Add rollout scale-up feature

2b92885

Fix typo

3b8152c

Adjust logging based on Gemini suggestion

375b2a0

Fix config loading

fb5b72f

Add locks etc. in scaling controller

4f6eb01

HwVanICI force-pushed the rollout-scale-up branch from 1b4e441 to 4f6eb01 Compare November 20, 2025 18:48

HwVanICI changed the title ~~[WIP][Feat] Add rollout scale-up support~~ [Feat] Add rollout scale-up support Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add rollout scale-up support #592

[Feat] Add rollout scale-up support #592

Uh oh!

HwVanICI commented Nov 18, 2025

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HwVanICI commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dazhuangzhuang1024 commented Nov 21, 2025

Uh oh!

garrett4wade commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feat] Add rollout scale-up support #592

Are you sure you want to change the base?

[Feat] Add rollout scale-up support #592

Uh oh!

Conversation

HwVanICI commented Nov 18, 2025

Description

Design Summary

Workflow (rollout engine scale-up)

Instructions

Compatibility & Assumptions

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HwVanICI commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dazhuangzhuang1024 commented Nov 21, 2025

Uh oh!

garrett4wade commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants