Skip to content

Conversation

@HwVanICI
Copy link
Collaborator

Description

Currently, AReaL implements a static resource allocation scheme, where fixed amounts of devices are allocated to rollout/training stages. This MR introduces rollout scale-up, where rollout engines can be added to a running training job. This feature can benefit workloads when the optimal resource allocation ratio changes during training.

Design Summary

  • Usage: A user (operator) can request adding rollout workers via an HTTP endpoint while training is running.
  • A Ray launcher starts an independent scaling_controller that exposes HTTP endpoints for scaling actions.
  • On a scale-up request, scaling_controller launches additional generation servers (validated with a vLLM-Ascend backend in our tests, but the design is engine-agnostic).
  • Trainers periodically call handle_scale_up() (from utils/scaling.py) at the end of each step:
    1. Check if there is a scale_up_request.
    2. Wait until the new generation servers are ready.
    3. When ready, scaling_controller posts a scale_up_done signal via name_resolve.
    4. The trainer calls actor(fsdp_engine)._re_init_weight_update_from_distributed(), which in turn invokes actor._init_weight_update_from_distributed(). The actor then recreates the process group with the new world size to include the newly added generation servers.
    5. The rollout engine executes init_weights_update_group() to recreate the process group to include both previous and newly added generation servers.
    6. remote_inf_engine.py::refresh_addresses() refreshes endpoints to old + new server addresses.
    7. The request dispatcher (choose_server) is adjusted so that at the start of a step it includes newly added servers when routing requests.

Workflow (rollout engine scale-up)

  1. POST /scale_up { "scaled_k": K } received by scaling_controller.
  2. scaling_controller allocates resources and launches K new generation servers.
  3. New servers are registered and their readiness is tracked.
  4. When all K generation servers are ready, scaling_controller signals scale_up_done via name_resolve.
  5. At the next end-of-step boundary, trainers run handle_scale_up() to perform distributed re-initialization and address refresh.
  6. Next step begins with a scaled pool of rollout servers.

Instructions

  1. Use the provided config: examples/math/gsm8k_grpo_npu_scale.yaml
    • This version has been validated with Ray + FSDP training backend + vLLM-Ascend generation backend.
  2. Multi-node setup:
    • Example: 3 nodes × 16 devices each, for a total of 48 devices.
    • Start with DP=16 for rollout and DP=16 for training across 2 nodes in the AReaL YAML config.
    • Initialize Ray on all 3 nodes, leaving 1 node (16 devices) as headroom for future scaling.
  3. Start training with the YAML above and ensure scaling_controller_port is set.
  4. Trigger scale-up at any time:
    curl -X POST http://localhost:8899/scale_up \
         -H "Content-Type: application/json" \
         -d '{"scaled_k": 1}'
    Replace 8899 with the configured scaling_controller_port if applicable.
  5. Scaling K > 1:
    • Multiple rollout servers can be added at once via scaled_k.
    • The trainer will block at the end-of-step in handle_scale_up() until all K new servers are ready. For larger K, expect more startup time before training proceeds.

Compatibility & Assumptions

  • Engine: The mechanism is validated with FSDP and vLLM (with vLLM-Ascend) on NPU in this MR.
  • Ray-based orchestration: We assume using Ray for process launch, name resolution, and resource accounting.
  • Reconfiguration point: Re-init happens between steps to keep on-step logic simple and deterministic.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

N/A

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @HwVanICI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant new feature to AReaL: dynamic scale-up support for rollout engines during active training. This enhancement moves away from static resource allocation, enabling operators to add more inference servers on the fly via an HTTP endpoint. The system is designed to seamlessly integrate new servers into the existing training process by coordinating their launch, updating server address lists, and reconfiguring distributed communication groups within the trainers, ensuring continuous and adaptable resource utilization.

Highlights

  • Dynamic Rollout Scale-Up: Introduces the ability to dynamically add more rollout (inference) engines to a running training job, allowing for flexible resource allocation based on changing workload demands.
  • Scaling Controller Service: A new scaling_controller service is implemented using FastAPI, which exposes an HTTP endpoint to receive scale-up requests and orchestrate the launch of additional generation servers via Ray.
  • Trainer Reconfiguration: Trainers are updated to detect scale-up requests at the end of each training step. Upon detection, they wait for new servers to be ready, then re-initialize their distributed communication groups and refresh their list of available inference server addresses.
  • Configuration and Example: A new ScalingConfig dataclass is added for managing scaling parameters, and a new example YAML configuration (gsm8k_grpo_npu_scale.yaml) demonstrates how to enable and use the scale-up feature.
  • Distributed Group Management: Modifications to FSDPPPOActor and VLLMBackend ensure that distributed process groups are correctly re-initialized with the updated world size and unique group names when new servers are added.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: dynamic scaling of rollout workers. The implementation is comprehensive, touching configuration, launchers, core inference engines, and the training loop. The design using a separate scaling_controller with an HTTP endpoint is a solid approach for manual scaling. The changes to handle process group re-initialization are well-thought-out, especially the use of unique group names for each re-initialization.

My review focuses on a few areas for improvement regarding correctness, maintainability, and logging practices. I've identified a bug in the launcher logic, a minor inconsistency in the scaling controller, and an opportunity to reduce log noise in the scaling utility.

Overall, this is a great addition to the project.

@HwVanICI
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: dynamic scaling of rollout workers. The implementation adds a scaling_controller that exposes an HTTP endpoint for triggering scale-up operations. The changes are well-structured, touching configuration, the remote inference engine, the FSDP engine, and the Ray launcher to support this new functionality. The core logic for handling scale-up requests on the trainer side is encapsulated in a new scaling.py utility.

My review focuses on the new scaling_controller.py file, where I've identified a few issues related to thread safety, correctness in state handling, and code style. These include a race condition in handling concurrent scale-up requests, an incorrect calculation of existing servers during scale-up, and some opportunities for code quality improvements. Addressing these points will make the new scaling feature more robust and maintainable.

@HwVanICI HwVanICI changed the title [WIP][Feat] Add rollout scale-up support [Feat] Add rollout scale-up support Nov 20, 2025
@dazhuangzhuang1024
Copy link
Collaborator

Wouldn't it be more appropriate to call this feature "scale out"?

@garrett4wade
Copy link
Collaborator

Or we can just call it "rollout scaling" since the boundary is not very clear in the context of RL.

FYI #611 is about to merge. We can implement the scaler within RolloutController.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants