[Feature] Dynamic rollout instance scaling

## Checklist

- [x] This feature will maintain backward compatibility with the current APIs in
  `areal/api/`. If not, please raise a refactor issue first.

## Motivation

Currently, AReaL implements a static resource allocation scheme, where a fixed number of devices are allocated to rollout/training stages. However, there are cases where users start with fewer (or more) rollouts than needed and later recognize the job requires more or less generation capacity. In such scenarios, users should be able to add or remove rollout workers mid-run without restarting.

Also, as workloads evolve (for example, using algorithms like dynamic sampling), *pipeline bubbles* can appear for empirically optimized configurations that have achieved optimal stage balancing at initial time. Scaling up/down at step boundaries helps reduce these bubbles and can lower the total training cost.

We propose a **rollout scaling** feature that allows users to scale up or down rollout workers dynamically after an RL fine-tuning job has started without pausing or restarting the job.

## Proposed Change

From a high level, we propose to add support for dynamic rollout scaling at instance level:

- Dynamically scale up rollout workers during training
- Dynamically scale down rollout workers during training
- Introduce an auto-scaling monitor that adjusts based on rollout utilization and training bubbles

This will be delivered across multiple milestones to enable full elastic scaling support in AReaL.

### Milestone 1: Scale-Up Functionality

- Add the ability to create new vLLM instances during training.
- New rollout engines load initial weights from disk and synchronize via XCCL weight transfer on the next step.

### Milestone 2: Fine-Grained Control in Ray Launcher

- Currently, LLM servers are initialized via placement groups. When scaling down, the released resources can’t be reused because they’re reserved.
- We’ll break this limitation to allow flexible resource reuse at runtime.

### Milestone 3: Scale-Down Functionality

- Add the ability to remove vLLM instances without pausing training.
- Ensure removed resources can be reused for future scale-ups or other workloads.

### Milestone 4: Monitoring & Optimization

- Implement auto-scaling monitor to decide when to scale up or down (using M1 and M3).
- Patch vLLM to skip disk reloads on restarts.

### Milestone Overview

| Milestone                                  | Description                                                      |
| ------------------------------------------ | ---------------------------------------------------------------- |
| **M1 — Dynamic Rollout Scale-Up**          | Add new vLLM instances during training with XCCL weight updates  |
| **M2 — Ray Launcher Control Improvements** | Enable fine-grained resource allocation and reuse                |
| **M3 — Dynamic Rollout Scale-Down**        | Remove rollout instances safely and free up resources            |
| **M4 — Monitoring & Optimization**         | Optimize total training cost by monitoring when to scale up/down |

## Potential Solution

Compatibility with workloads where static allocation is optimal. No overhead should be introduced when scaling is disabled.

## Additional Information

(Add any relevant context, references, or supporting data here.)
### References

- Cosmos RL Elastic Scaling: [https://nvidia-cosmos.github.io/cosmos-rl/elastic/overview.html](https://nvidia-cosmos.github.io/cosmos-rl/elastic/overview.html)
- Dynamic Sampling: [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://arxiv.org/pdf/2503.14476)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Dynamic rollout instance scaling #606

Checklist

Motivation

Proposed Change

Milestone 1: Scale-Up Functionality

Milestone 2: Fine-Grained Control in Ray Launcher

Milestone 3: Scale-Down Functionality

Milestone 4: Monitoring & Optimization

Milestone Overview

Potential Solution

Additional Information

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Milestone	Description
M1 — Dynamic Rollout Scale-Up	Add new vLLM instances during training with XCCL weight updates
M2 — Ray Launcher Control Improvements	Enable fine-grained resource allocation and reuse
M3 — Dynamic Rollout Scale-Down	Remove rollout instances safely and free up resources
M4 — Monitoring & Optimization	Optimize total training cost by monitoring when to scale up/down

[Feature] Dynamic rollout instance scaling #606

Description

Checklist

Motivation

Proposed Change

Milestone 1: Scale-Up Functionality

Milestone 2: Fine-Grained Control in Ray Launcher

Milestone 3: Scale-Down Functionality

Milestone 4: Monitoring & Optimization

Milestone Overview

Potential Solution

Additional Information

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions