Skip to content

Tencent-Hunyuan/SAGE-GRPO

Repository files navigation

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

arXiv Webpage PyTorch Transformers Diffusers Video Model

Mingzhe Zheng*1,2, Weijie Kong*2, Yue Wu‡2, Dengyang Jiang1, Yue Ma1, Xuanhua He1, Bin Lin2, Kaixiong Gong2, Zhao Zhong2, Liefeng Bo2, Qifeng Chen†1, Harry Yang†1

1HKUST   2Tencent Hunyuan
*Equal contribution   Corresponding Authors   Project Leader
Work done during internship at Tencent Hunyuan

SAGE-GRPO is an open-source post-training framework for aligning video generation models via GRPO, built on top of HunyuanVideo-1.5. It features a precise manifold-aware SDE for exploration, dual trust-region KL regularization, gradient norm equalization, and scalable multi-node multi-GPU training with sequence parallelism and FSDP.

Figure 1. Illustration of SAGE-GRPO. (Left) (a.1) At higher noise regions, Euler-style discretization introduces extra energy (discretization error) beyond the true integral. (a.2) Our precise SDE removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) (b) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO, FlowGRPO, and CPS.

Highlights

We formulate GRPO for video generation as a manifold-constrained exploration problem:

Figure 2. Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, causing off-manifold drift and temporal jitter. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise stays close to the flow trajectory and the video manifold.

  • Core Problem: We show that the ODE-to-SDE conversions used in existing video GRPO methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable.
  • Micro-level: We constrain exploration with a Precise Manifold-Aware SDE and a Gradient Norm Equalizer, so that sampling noise stays manifold-consistent and updates are balanced across timesteps.
  • Macro-level: We constrain long-horizon exploration with a Dual Trust Region using moving anchors and step-wise constraints, so that the trust region tracks more manifold-consistent checkpoints and prevents drift.

Abstract

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment.

To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable.

We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift.

We evaluate SAGE-GRPO on HunyuanVideo-1.5 using VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.

Table of Contents

Installation

1. Clone the repository

git clone <your-fork-or-public-url>
cd SAGE-GRPO

2. Install Python dependencies

pip install -r requirements.txt

3. Download the reward model helper

bash download_weights.sh

4. Download the remaining HunyuanVideo checkpoints

After download_weights.sh, follow checkpoints-download.md to download the remaining base model, text encoder, and vision encoder weights.

Checkpoint Preparation

SAGE-GRPO expects both the HunyuanVideo-1.5 base checkpoints and the VideoReward reward model to be available under ./ckpts.

Useful references:

  • Base model documentation: README_HYVideo.md
  • Detailed checkpoint download instructions: checkpoints-download.md
  • Reward checkpoint helper: download_weights.sh

Expected Checkpoint Layout

ckpts/
├── assets
├── config.json
├── LICENSE
├── NOTICE
├── README.md
├── README_CN.md
├── scheduler
├── text_encoder
│   ├── byt5-small
│   ├── Glyph-SDXL-v2
│   └── llm
├── transformer
├── upsampler
├── vae
├── VideoReward
│   ├── checkpoint-11352
│   ├── model_config.json
│   └── README.md
└── vision_encoder
    └── siglip

If your local structure differs substantially from the above, training usually fails during model or reward initialization.

Post-Training

Hardware Recommendation

Requirement Recommended
GPU memory 80 GB per GPU
GPU count 64 GPUs (8 nodes x 8)
OS Linux
PyTorch 2.6+

Single-node multi-GPU

For a single machine with 8 GPUs:

bash run_post_train.sh

This launches post_train.py with the default GRPO configuration via torchrun --nproc_per_node=8.

Multi-node multi-GPU

For multi-node training:

bash run_post_train_multinode.sh

The multi-node entry internally calls:

bash scripts/post_train/pdsh_train.sh "scripts/post_train/train_grpo.sh"

Edit or export the node list and rendezvous-related environment expected by your cluster launcher before starting.

Key Training Parameters

Distributed Training

The three most important distributed-training knobs are sp_size, batch_size, and num_generations.

dp_degree = world_size / sp_size

There is a validity constraint:

(batch_size * dp_degree) % num_generations == 0
Parameter Default Description
sp_size 8 Sequence parallel degree. Must evenly divide world_size.
batch_size 2 Per-rank video micro-batch size.
num_generations 4 Number of rollout samples per prompt in GRPO group.
learning_rate 1e-5 Learning rate.
max_steps 10000 Maximum training steps.

SAGE-GRPO Method Parameters

These are the core parameters that distinguish SAGE-GRPO from other video GRPO methods:

Exploration (Micro-level)

Parameter Default Description
sde_type sage_grpo SDE type for GRPO rollout. Choices: sage_grpo, dance_grpo, flow_grpo, cps.
use_grad_balancing True Enable gradient norm equalizer across timesteps.
enable_timestep_permutation True Enable timestep permutation for training.

Trust Region (Macro-level)

Parameter Default Description
kl_weight 1e-5 KL regularization weight.
kl_coef 1e-7 Initial KL coefficient.
kl_min_coef 1e-7 Lower bound for adaptive KL coefficient.
use_moving_KL True Enable periodic ref-model update (moving anchor).
update_ref_model_step 10 Ref-model update interval (optimizer update steps).
use_dual_kl True Enable dual KL: moving/fixed + step-wise constraints.
dual_kl_moving_weight 1.0 Weight for moving/fixed KL term.
dual_kl_step_weight 0.1 Weight for step-wise KL term.

Reward & Validation

Parameter Default Description
validate_at_step0 False Run sample validation at step 0.
validate_video_length 81 Number of frames for validation videos.
validation_timestep_shift 5.0 Timestep shift for validation sampling.
reference_mode_offload False Offload KL reference model to CPU when not in use.

Recommended 64-GPU Default

The default recommended large-scale setting:

world_size = 64     sp_size = 2     batch_size = 2     num_generations = 4

From this:

dp_degree           = 64 / 2              = 32
global_video_batch  = 2 * 32              = 64
num_prompt_groups   = 64 / 4              = 16
  • 32 effective data-parallel replicas
  • 64 rollout videos per GRPO sampling round
  • 16 prompts grouped globally when num_generations=4

Default single-node entry

The current single-node helper (run_post_train.sh) uses:

torchrun --nproc_per_node=8 post_train.py \
  --pretrained_model_root ./ckpts \
  --learning_rate 1e-5 \
  --batch_size 2 \
  --num_generations 4 \
  --max_steps 10000 \
  --output_dir ./outputs \
  --enable_fsdp \
  --enable_gradient_checkpointing \
  --sp_size 2 \
  --sde_type "sage_grpo" \
  --use_grad_balancing True \
  --enable_timestep_permutation True \
  --kl_weight 1e-5 \
  --kl_coef 1e-7 \
  --use_moving_KL True \
  --update_ref_model_step 10 \
  --use_dual_kl True \
  --dual_kl_moving_weight 1.0 \
  --dual_kl_step_weight 0.1 \
  --reference_mode_offload True

Practical notes

  • sp_size=2 is the recommended starting point. The default in argparse is 8 but the launch script overrides it to 2.
  • batch_size=2 and num_generations=4 are the default GRPO-friendly settings.
  • If you scale down GPU count, re-check dp_degree and the divisibility constraint before launching.
  • reference_mode_offload is helpful when KL reference model memory becomes a bottleneck.

Visualization Gallery

All visual results are under assets/Visual_Results/.
For a cleaner and fully curated presentation, please visit the project webpage: SAGE-GRPO Webpage.

1. Compare with Baseline

Case HunyuanVideo-1.5 (Baseline) SAGE-GRPO (Ours)
Case 1
case1_baseline.mp4
case1_ours_full.mp4
Prompt: The scene opens on a medium, low-angle shot of a teenage boy on an empty, red-surfaced running track during sunset. He is positioned on the right third of the frame, having just completed an intense sprint. He wears a striking neon green athletic jacket, unzipped to reveal a dark shirt underneath, and black running shorts. His body is bent sharply at the waist, his hands pressed firmly onto his knees for support as he struggles to recover. His dark, curly hair is damp with sweat, which also beads on his forehead and temples. His chest rises and falls rapidly and deeply, and with each ragged exhalation, a faint mist of his breath is visible in the cooling air, illuminated by the strong backlight from the setting sun. The sun, low on the horizon, casts long shadows and bathes the scene in a warm, orange glow, creating a cinematic lens flare that streaks across the frame. After a few moments of labored breathing, he slowly and painfully straightens his posture, his eyes remaining fixed on the track ahead with a look of fierce determination mixed with utter exhaustion.
Case 2
case2_baseline.mp4
case2_ours_full.mp4
Prompt: The scene opens on a tranquil, sun-drenched meadow in the late afternoon. An eye-level full shot frames Isaac Newton, a man with long hair dressed in simple 17th-century clothing, sitting at the base of a large, gnarled apple tree. He leans against the trunk, positioned according to the rule of thirds, creating a sense of balance and space. Dappled sunlight streams through the leafy canopy, casting soft, moving shadows on the ground. Newton is completely absorbed in thought, his gaze distant and unfocused. A gentle breeze rustles the leaves. High above him, a ripe red apple loosens from its stem. It drops silently at first, then lands with a distinct 'thump' on top of Newton's head. He flinches, startled out of his deep thoughts, and instinctively raises a hand to the point of impact. His eyes dart upwards towards the branches, then scan the ground around him. He spots the offending red apple lying in the grass. His initial annoyance gives way to curiosity as he reaches down, picks it up, and holds it in his palm. He turns it over, examining it, and his expression slowly transforms into one of profound, dawning realization, the genesis of a revolutionary idea.
Case 3
case3_baseline.mp4
case3_ours_full.mp4
Prompt: The scene opens with a stunning wide shot, filmed in slow motion from a low angle. Five children, a diverse group of boys and girls aged between six and ten, are running exuberantly across a vast field. The field is filled with tall, golden-yellow grass that sways gently in the breeze and reaches their waists. It's the golden hour, and the setting sun, positioned behind the children, creates a brilliant backlight. This light forms a radiant halo around their hair and outlines their bodies, separating them from the lush background. Dust motes and pollen kicked up by their running feet dance and sparkle in the sunbeams. The children are spread out, yet moving together as a group from right to left across the frame. Their faces are alight with pure joy; mouths are open in laughter, and their eyes are bright with excitement. One girl with long blonde pigtails leads the pack, looking back over her shoulder with a wide grin. A boy in a red t-shirt leaps playfully into the air. The slow-motion effect accentuates every detail: the bounce of their hair, the flowing fabric of their clothes, and the effortless grace of their youthful movements. The sky above is a soft, clear blue, providing a cool contrast to the warm tones of the field below. The atmosphere is overwhelmingly joyful, nostalgic, and evocative of the perfect, endless days of summer childhood.

2. Compare with Other Methods (20 steps)

Case DanceGRPO FlowGRPO CPS Ours
Showcase 1
showcase1_dancegrpo.mp4
showcase1_flowgrpo.mp4
showcase1_cps.mp4
showcase1_ours.mp4
Showcase 2
showcase2_dancegrpo.mp4
showcase2_flowgrpo.mp4
showcase2_cps.mp4
showcase2_ours.mp4
Showcase 3
showcase3_dancegrpo.mp4
showcase3_flowgrpo.mp4
showcase3_cps.mp4
showcase3_ours.mp4
Showcase 4
showcase4_dancegrpo.mp4
showcase4_flowgrpo.mp4
showcase4_cps.mp4
showcase4_ours.mp4

3. Compare with Other Methods (40 steps)

Case DanceGRPO FlowGRPO CPS Ours
Case 1
case1_dancegrpo.mp4
case1_flowgrpo.mp4
case1_cps.mp4
case1_ours.mp4
Case 2
case2_dancegrpo.mp4
case2_flowgrpo.mp4
case2_cps.mp4
case2_ours.mp4

4. KL Ablation

Case No KL Standard KL Stepwise Moving KL Dual Moving KL
Case 1
case1_no_kl.mp4
case1_std_kl.mp4
case1_stepwise.mp4
case1_moving_kl.mp4
case1_dual_mov_kl.mp4
Case 2
case2_no_kl.mp4
case2_std_kl.mp4
case2_stepwise.mp4
case2_moving_kl.mp4
case2_dual_mov_kl.mp4
Case 3
case3_no_kl.mp4
case3_std_kl.mp4
case3_stepwise.mp4
case3_moving_kl.mp4
case3_dual_mov_kl.mp4
Case 4
case4_no_kl.mp4
case4_std_kl.mp4
case4_stepwise.mp4
case4_moving_kl.mp4
case4_dual_mov_kl.mp4

Acknowledgements

Citation

If you find our work useful, please consider citing:

@article{zheng2026sagegrpo,
  title={Manifold-Aware Exploration for Reinforcement Learning in Video Generation},
  author={Zheng, Mingzhe and Kong, Weijie and Wu, Yue and Jiang, Dengyang and Ma, Yue and He, Xuanhua and Lin, Bin and Gong, Kaixiong and Zhong, Zhao and Bo, Liefeng and Chen, Qifeng and Yang, Harry},
  journal={arXiv preprint arXiv:2603.21872},
  year={2026}
}

About

Official Implementation of SAGE-GRPO:Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors