-
Notifications
You must be signed in to change notification settings - Fork 239
[Feat] Add rollout scale-up support #592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @HwVanICI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant new feature to AReaL: dynamic scale-up support for rollout engines during active training. This enhancement moves away from static resource allocation, enabling operators to add more inference servers on the fly via an HTTP endpoint. The system is designed to seamlessly integrate new servers into the existing training process by coordinating their launch, updating server address lists, and reconfiguring distributed communication groups within the trainers, ensuring continuous and adaptable resource utilization. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: dynamic scaling of rollout workers. The implementation is comprehensive, touching configuration, launchers, core inference engines, and the training loop. The design using a separate scaling_controller with an HTTP endpoint is a solid approach for manual scaling. The changes to handle process group re-initialization are well-thought-out, especially the use of unique group names for each re-initialization.
My review focuses on a few areas for improvement regarding correctness, maintainability, and logging practices. I've identified a bug in the launcher logic, a minor inconsistency in the scaling controller, and an opportunity to reduce log noise in the scaling utility.
Overall, this is a great addition to the project.
168f421 to
1b4e441
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: dynamic scaling of rollout workers. The implementation adds a scaling_controller that exposes an HTTP endpoint for triggering scale-up operations. The changes are well-structured, touching configuration, the remote inference engine, the FSDP engine, and the Ray launcher to support this new functionality. The core logic for handling scale-up requests on the trainer side is encapsulated in a new scaling.py utility.
My review focuses on the new scaling_controller.py file, where I've identified a few issues related to thread safety, correctness in state handling, and code style. These include a race condition in handling concurrent scale-up requests, an incorrect calculation of existing servers during scale-up, and some opportunities for code quality improvements. Addressing these points will make the new scaling feature more robust and maintainable.
1b4e441 to
4f6eb01
Compare
|
Wouldn't it be more appropriate to call this feature "scale out"? |
|
Or we can just call it "rollout scaling" since the boundary is not very clear in the context of RL. FYI #611 is about to merge. We can implement the scaler within |
Description
Currently, AReaL implements a static resource allocation scheme, where fixed amounts of devices are allocated to rollout/training stages. This MR introduces rollout scale-up, where rollout engines can be added to a running training job. This feature can benefit workloads when the optimal resource allocation ratio changes during training.
Design Summary
scaling_controllerthat exposes HTTP endpoints for scaling actions.scaling_controllerlaunches additional generation servers (validated with a vLLM-Ascend backend in our tests, but the design is engine-agnostic).handle_scale_up()(fromutils/scaling.py) at the end of each step:scale_up_request.scaling_controllerposts ascale_up_donesignal vianame_resolve.actor(fsdp_engine)._re_init_weight_update_from_distributed(), which in turn invokesactor._init_weight_update_from_distributed(). The actor then recreates the process group with the new world size to include the newly added generation servers.init_weights_update_group()to recreate the process group to include both previous and newly added generation servers.remote_inf_engine.py::refresh_addresses()refreshes endpoints to old + new server addresses.choose_server) is adjusted so that at the start of a step it includes newly added servers when routing requests.Workflow (rollout engine scale-up)
POST /scale_up { "scaled_k": K }received byscaling_controller.scaling_controllerallocates resources and launches K new generation servers.scaling_controllersignalsscale_up_donevianame_resolve.handle_scale_up()to perform distributed re-initialization and address refresh.Instructions
examples/math/gsm8k_grpo_npu_scale.yamlDP=16for rollout andDP=16for training across 2 nodes in the AReaL YAML config.scaling_controller_portis set.curl -X POST http://localhost:8899/scale_up \ -H "Content-Type: application/json" \ -d '{"scaled_k": 1}'8899with the configuredscaling_controller_portif applicable.scaled_k.handle_scale_up()until all K new servers are ready. For larger K, expect more startup time before training proceeds.Compatibility & Assumptions
Related Issue
N/A
Type of Change
work as expected)
Checklist
jb build docs/gemini review)Breaking Change Details (if applicable):
N/A
Additional Context
N/A