-
Notifications
You must be signed in to change notification settings - Fork 239
Description
Checklist
- This feature will maintain backward compatibility with the current APIs in
areal/api/. If not, please raise a refactor issue first.
Motivation
Currently, AReaL implements a static resource allocation scheme, where a fixed number of devices are allocated to rollout/training stages. However, there are cases where users start with fewer (or more) rollouts than needed and later recognize the job requires more or less generation capacity. In such scenarios, users should be able to add or remove rollout workers mid-run without restarting.
Also, as workloads evolve (for example, using algorithms like dynamic sampling), pipeline bubbles can appear for empirically optimized configurations that have achieved optimal stage balancing at initial time. Scaling up/down at step boundaries helps reduce these bubbles and can lower the total training cost.
We propose a rollout scaling feature that allows users to scale up or down rollout workers dynamically after an RL fine-tuning job has started without pausing or restarting the job.
Proposed Change
From a high level, we propose to add support for dynamic rollout scaling at instance level:
- Dynamically scale up rollout workers during training
- Dynamically scale down rollout workers during training
- Introduce an auto-scaling monitor that adjusts based on rollout utilization and training bubbles
This will be delivered across multiple milestones to enable full elastic scaling support in AReaL.
Milestone 1: Scale-Up Functionality
- Add the ability to create new vLLM instances during training.
- New rollout engines load initial weights from disk and synchronize via XCCL weight transfer on the next step.
Milestone 2: Fine-Grained Control in Ray Launcher
- Currently, LLM servers are initialized via placement groups. When scaling down, the released resources can’t be reused because they’re reserved.
- We’ll break this limitation to allow flexible resource reuse at runtime.
Milestone 3: Scale-Down Functionality
- Add the ability to remove vLLM instances without pausing training.
- Ensure removed resources can be reused for future scale-ups or other workloads.
Milestone 4: Monitoring & Optimization
- Implement auto-scaling monitor to decide when to scale up or down (using M1 and M3).
- Patch vLLM to skip disk reloads on restarts.
Milestone Overview
| Milestone | Description |
|---|---|
| M1 — Dynamic Rollout Scale-Up | Add new vLLM instances during training with XCCL weight updates |
| M2 — Ray Launcher Control Improvements | Enable fine-grained resource allocation and reuse |
| M3 — Dynamic Rollout Scale-Down | Remove rollout instances safely and free up resources |
| M4 — Monitoring & Optimization | Optimize total training cost by monitoring when to scale up/down |
Potential Solution
Compatibility with workloads where static allocation is optimal. No overhead should be introduced when scaling is disabled.
Additional Information
(Add any relevant context, references, or supporting data here.)
References
- Cosmos RL Elastic Scaling: https://nvidia-cosmos.github.io/cosmos-rl/elastic/overview.html
- Dynamic Sampling: DAPO: An Open-Source LLM Reinforcement Learning System at Scale