Skip to content

[Feature] Dynamic rollout instance scaling #606

@HwVanICI

Description

@HwVanICI

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Motivation

Currently, AReaL implements a static resource allocation scheme, where a fixed number of devices are allocated to rollout/training stages. However, there are cases where users start with fewer (or more) rollouts than needed and later recognize the job requires more or less generation capacity. In such scenarios, users should be able to add or remove rollout workers mid-run without restarting.

Also, as workloads evolve (for example, using algorithms like dynamic sampling), pipeline bubbles can appear for empirically optimized configurations that have achieved optimal stage balancing at initial time. Scaling up/down at step boundaries helps reduce these bubbles and can lower the total training cost.

We propose a rollout scaling feature that allows users to scale up or down rollout workers dynamically after an RL fine-tuning job has started without pausing or restarting the job.

Proposed Change

From a high level, we propose to add support for dynamic rollout scaling at instance level:

  • Dynamically scale up rollout workers during training
  • Dynamically scale down rollout workers during training
  • Introduce an auto-scaling monitor that adjusts based on rollout utilization and training bubbles

This will be delivered across multiple milestones to enable full elastic scaling support in AReaL.

Milestone 1: Scale-Up Functionality

  • Add the ability to create new vLLM instances during training.
  • New rollout engines load initial weights from disk and synchronize via XCCL weight transfer on the next step.

Milestone 2: Fine-Grained Control in Ray Launcher

  • Currently, LLM servers are initialized via placement groups. When scaling down, the released resources can’t be reused because they’re reserved.
  • We’ll break this limitation to allow flexible resource reuse at runtime.

Milestone 3: Scale-Down Functionality

  • Add the ability to remove vLLM instances without pausing training.
  • Ensure removed resources can be reused for future scale-ups or other workloads.

Milestone 4: Monitoring & Optimization

  • Implement auto-scaling monitor to decide when to scale up or down (using M1 and M3).
  • Patch vLLM to skip disk reloads on restarts.

Milestone Overview

Milestone Description
M1 — Dynamic Rollout Scale-Up Add new vLLM instances during training with XCCL weight updates
M2 — Ray Launcher Control Improvements Enable fine-grained resource allocation and reuse
M3 — Dynamic Rollout Scale-Down Remove rollout instances safely and free up resources
M4 — Monitoring & Optimization Optimize total training cost by monitoring when to scale up/down

Potential Solution

Compatibility with workloads where static allocation is optimal. No overhead should be introduced when scaling is disabled.

Additional Information

(Add any relevant context, references, or supporting data here.)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions