GitHub - Tencent-Hunyuan/SAGE-GRPO: Official Implementation of SAGE-GRPO:Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Mingzhe Zheng^*1,2, Weijie Kong^*2, Yue Wu^‡2, Dengyang Jiang¹, Yue Ma¹, Xuanhua He¹, Bin Lin², Kaixiong Gong², Zhao Zhong², Liefeng Bo², Qifeng Chen^†1, Harry Yang^†1

¹HKUST ²Tencent Hunyuan
^*Equal contribution ^†Corresponding Authors ^‡Project Leader
Work done during internship at Tencent Hunyuan

SAGE-GRPO is an open-source post-training framework for aligning video generation models via GRPO, built on top of HunyuanVideo-1.5. It features a precise manifold-aware SDE for exploration, dual trust-region KL regularization, gradient norm equalization, and scalable multi-node multi-GPU training with sequence parallelism and FSDP.

Figure 1. Illustration of SAGE-GRPO. (Left) (a.1) At higher noise regions, Euler-style discretization introduces extra energy (discretization error) beyond the true integral. (a.2) Our precise SDE removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) (b) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO, FlowGRPO, and CPS.

Highlights

We formulate GRPO for video generation as a manifold-constrained exploration problem:

Figure 2. Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, causing off-manifold drift and temporal jitter. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise stays close to the flow trajectory and the video manifold.

Core Problem: We show that the ODE-to-SDE conversions used in existing video GRPO methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable.
Micro-level: We constrain exploration with a Precise Manifold-Aware SDE and a Gradient Norm Equalizer, so that sampling noise stays manifold-consistent and updates are balanced across timesteps.
Macro-level: We constrain long-horizon exploration with a Dual Trust Region using moving anchors and step-wise constraints, so that the trust region tracks more manifold-consistent checkpoints and prevents drift.

Abstract

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment.

To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable.

We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift.

We evaluate SAGE-GRPO on HunyuanVideo-1.5 using VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.

Installation

1. Clone the repository

git clone <your-fork-or-public-url>
cd SAGE-GRPO

2. Install Python dependencies

pip install -r requirements.txt

3. Download the reward model helper

bash download_weights.sh

4. Download the remaining HunyuanVideo checkpoints

After download_weights.sh, follow checkpoints-download.md to download the remaining base model, text encoder, and vision encoder weights.

Checkpoint Preparation

SAGE-GRPO expects both the HunyuanVideo-1.5 base checkpoints and the VideoReward reward model to be available under ./ckpts.

Useful references:

Base model documentation: README_HYVideo.md
Detailed checkpoint download instructions: checkpoints-download.md
Reward checkpoint helper: download_weights.sh

Expected Checkpoint Layout

ckpts/
├── assets
├── config.json
├── LICENSE
├── NOTICE
├── README.md
├── README_CN.md
├── scheduler
├── text_encoder
│   ├── byt5-small
│   ├── Glyph-SDXL-v2
│   └── llm
├── transformer
├── upsampler
├── vae
├── VideoReward
│   ├── checkpoint-11352
│   ├── model_config.json
│   └── README.md
└── vision_encoder
    └── siglip

If your local structure differs substantially from the above, training usually fails during model or reward initialization.

Post-Training

Hardware Recommendation

Requirement	Recommended
GPU memory	80 GB per GPU
GPU count	64 GPUs (8 nodes x 8)
OS	Linux
PyTorch	2.6+

Single-node multi-GPU

For a single machine with 8 GPUs:

bash run_post_train.sh

This launches post_train.py with the default GRPO configuration via torchrun --nproc_per_node=8.

Multi-node multi-GPU

For multi-node training:

bash run_post_train_multinode.sh

The multi-node entry internally calls:

bash scripts/post_train/pdsh_train.sh "scripts/post_train/train_grpo.sh"

Edit or export the node list and rendezvous-related environment expected by your cluster launcher before starting.

Key Training Parameters

Distributed Training

The three most important distributed-training knobs are sp_size, batch_size, and num_generations.

dp_degree = world_size / sp_size

There is a validity constraint:

(batch_size * dp_degree) % num_generations == 0

Parameter	Default	Description
`sp_size`	8	Sequence parallel degree. Must evenly divide `world_size`.
`batch_size`	2	Per-rank video micro-batch size.
`num_generations`	4	Number of rollout samples per prompt in GRPO group.
`learning_rate`	1e-5	Learning rate.
`max_steps`	10000	Maximum training steps.

SAGE-GRPO Method Parameters

These are the core parameters that distinguish SAGE-GRPO from other video GRPO methods:

Exploration (Micro-level)

Parameter	Default	Description
`sde_type`	`sage_grpo`	SDE type for GRPO rollout. Choices: `sage_grpo`, `dance_grpo`, `flow_grpo`, `cps`.
`use_grad_balancing`	`True`	Enable gradient norm equalizer across timesteps.
`enable_timestep_permutation`	`True`	Enable timestep permutation for training.

Trust Region (Macro-level)

Parameter	Default	Description
`kl_weight`	1e-5	KL regularization weight.
`kl_coef`	1e-7	Initial KL coefficient.
`kl_min_coef`	1e-7	Lower bound for adaptive KL coefficient.
`use_moving_KL`	`True`	Enable periodic ref-model update (moving anchor).
`update_ref_model_step`	10	Ref-model update interval (optimizer update steps).
`use_dual_kl`	`True`	Enable dual KL: moving/fixed + step-wise constraints.
`dual_kl_moving_weight`	1.0	Weight for moving/fixed KL term.
`dual_kl_step_weight`	0.1	Weight for step-wise KL term.

Reward & Validation

Parameter	Default	Description
`validate_at_step0`	`False`	Run sample validation at step 0.
`validate_video_length`	81	Number of frames for validation videos.
`validation_timestep_shift`	5.0	Timestep shift for validation sampling.
`reference_mode_offload`	`False`	Offload KL reference model to CPU when not in use.

Recommended 64-GPU Default

The default recommended large-scale setting:

world_size = 64     sp_size = 2     batch_size = 2     num_generations = 4

From this:

dp_degree           = 64 / 2              = 32
global_video_batch  = 2 * 32              = 64
num_prompt_groups   = 64 / 4              = 16

32 effective data-parallel replicas
64 rollout videos per GRPO sampling round
16 prompts grouped globally when num_generations=4

Default single-node entry

The current single-node helper (run_post_train.sh) uses:

torchrun --nproc_per_node=8 post_train.py \
  --pretrained_model_root ./ckpts \
  --learning_rate 1e-5 \
  --batch_size 2 \
  --num_generations 4 \
  --max_steps 10000 \
  --output_dir ./outputs \
  --enable_fsdp \
  --enable_gradient_checkpointing \
  --sp_size 2 \
  --sde_type "sage_grpo" \
  --use_grad_balancing True \
  --enable_timestep_permutation True \
  --kl_weight 1e-5 \
  --kl_coef 1e-7 \
  --use_moving_KL True \
  --update_ref_model_step 10 \
  --use_dual_kl True \
  --dual_kl_moving_weight 1.0 \
  --dual_kl_step_weight 0.1 \
  --reference_mode_offload True

Practical notes

sp_size=2 is the recommended starting point. The default in argparse is 8 but the launch script overrides it to 2.
batch_size=2 and num_generations=4 are the default GRPO-friendly settings.
If you scale down GPU count, re-check dp_degree and the divisibility constraint before launching.
reference_mode_offload is helpful when KL reference model memory becomes a bottleneck.

Visualization Gallery

All visual results are under assets/Visual_Results/.
For a cleaner and fully curated presentation, please visit the project webpage: SAGE-GRPO Webpage.

1. Compare with Baseline

Case	HunyuanVideo-1.5 (Baseline)	SAGE-GRPO (Ours)
Case 1	case1_baseline.mp4	case1_ours_full.mp4
Prompt: The scene opens on a medium, low-angle shot of a teenage boy on an empty, red-surfaced running track during sunset. He is positioned on the right third of the frame, having just completed an intense sprint. He wears a striking neon green athletic jacket, unzipped to reveal a dark shirt underneath, and black running shorts. His body is bent sharply at the waist, his hands pressed firmly onto his knees for support as he struggles to recover. His dark, curly hair is damp with sweat, which also beads on his forehead and temples. His chest rises and falls rapidly and deeply, and with each ragged exhalation, a faint mist of his breath is visible in the cooling air, illuminated by the strong backlight from the setting sun. The sun, low on the horizon, casts long shadows and bathes the scene in a warm, orange glow, creating a cinematic lens flare that streaks across the frame. After a few moments of labored breathing, he slowly and painfully straightens his posture, his eyes remaining fixed on the track ahead with a look of fierce determination mixed with utter exhaustion.
Case 2	case2_baseline.mp4	case2_ours_full.mp4
Prompt: The scene opens on a tranquil, sun-drenched meadow in the late afternoon. An eye-level full shot frames Isaac Newton, a man with long hair dressed in simple 17th-century clothing, sitting at the base of a large, gnarled apple tree. He leans against the trunk, positioned according to the rule of thirds, creating a sense of balance and space. Dappled sunlight streams through the leafy canopy, casting soft, moving shadows on the ground. Newton is completely absorbed in thought, his gaze distant and unfocused. A gentle breeze rustles the leaves. High above him, a ripe red apple loosens from its stem. It drops silently at first, then lands with a distinct 'thump' on top of Newton's head. He flinches, startled out of his deep thoughts, and instinctively raises a hand to the point of impact. His eyes dart upwards towards the branches, then scan the ground around him. He spots the offending red apple lying in the grass. His initial annoyance gives way to curiosity as he reaches down, picks it up, and holds it in his palm. He turns it over, examining it, and his expression slowly transforms into one of profound, dawning realization, the genesis of a revolutionary idea.
Case 3	case3_baseline.mp4	case3_ours_full.mp4
Prompt: The scene opens with a stunning wide shot, filmed in slow motion from a low angle. Five children, a diverse group of boys and girls aged between six and ten, are running exuberantly across a vast field. The field is filled with tall, golden-yellow grass that sways gently in the breeze and reaches their waists. It's the golden hour, and the setting sun, positioned behind the children, creates a brilliant backlight. This light forms a radiant halo around their hair and outlines their bodies, separating them from the lush background. Dust motes and pollen kicked up by their running feet dance and sparkle in the sunbeams. The children are spread out, yet moving together as a group from right to left across the frame. Their faces are alight with pure joy; mouths are open in laughter, and their eyes are bright with excitement. One girl with long blonde pigtails leads the pack, looking back over her shoulder with a wide grin. A boy in a red t-shirt leaps playfully into the air. The slow-motion effect accentuates every detail: the bounce of their hair, the flowing fabric of their clothes, and the effortless grace of their youthful movements. The sky above is a soft, clear blue, providing a cool contrast to the warm tones of the field below. The atmosphere is overwhelmingly joyful, nostalgic, and evocative of the perfect, endless days of summer childhood.

2. Compare with Other Methods (20 steps)

Case	DanceGRPO	FlowGRPO	CPS	Ours
Showcase 1	showcase1_dancegrpo.mp4	showcase1_flowgrpo.mp4	showcase1_cps.mp4	showcase1_ours.mp4
Showcase 2	showcase2_dancegrpo.mp4	showcase2_flowgrpo.mp4	showcase2_cps.mp4	showcase2_ours.mp4
Showcase 3	showcase3_dancegrpo.mp4	showcase3_flowgrpo.mp4	showcase3_cps.mp4	showcase3_ours.mp4
Showcase 4	showcase4_dancegrpo.mp4	showcase4_flowgrpo.mp4	showcase4_cps.mp4	showcase4_ours.mp4

3. Compare with Other Methods (40 steps)

Case	DanceGRPO	FlowGRPO	CPS	Ours
Case 1	case1_dancegrpo.mp4	case1_flowgrpo.mp4	case1_cps.mp4	case1_ours.mp4
Case 2	case2_dancegrpo.mp4	case2_flowgrpo.mp4	case2_cps.mp4	case2_ours.mp4

4. KL Ablation

Case	No KL	Standard KL	Stepwise	Moving KL	Dual Moving KL
Case 1	case1_no_kl.mp4	case1_std_kl.mp4	case1_stepwise.mp4	case1_moving_kl.mp4	case1_dual_mov_kl.mp4
Case 2	case2_no_kl.mp4	case2_std_kl.mp4	case2_stepwise.mp4	case2_moving_kl.mp4	case2_dual_mov_kl.mp4
Case 3	case3_no_kl.mp4	case3_std_kl.mp4	case3_stepwise.mp4	case3_moving_kl.mp4	case3_dual_mov_kl.mp4
Case 4	case4_no_kl.mp4	case4_std_kl.mp4	case4_stepwise.mp4	case4_moving_kl.mp4	case4_dual_mov_kl.mp4

Acknowledgements

Base model and inference/training foundation: HunyuanVideo-1.5
Reward model: VideoAlign
Baseline algorithms: FlowGRPO, DanceGRPO, CPS

Citation

If you find our work useful, please consider citing:

@article{zheng2026sagegrpo,
  title={Manifold-Aware Exploration for Reinforcement Learning in Video Generation},
  author={Zheng, Mingzhe and Kong, Weijie and Wu, Yue and Jiang, Dengyang and Ma, Yue and He, Xuanhua and Lin, Bin and Gong, Kaixiong and Zhong, Zhao and Bo, Liefeng and Chen, Qifeng and Yang, Harry},
  journal={arXiv preprint arXiv:2603.21872},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
hyvideo		hyvideo
scripts/post_train		scripts/post_train
LICENSE		LICENSE
README.md		README.md
checkpoints-download.md		checkpoints-download.md
download_weights.sh		download_weights.sh
generate.py		generate.py
post_train.py		post_train.py
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh
run_post_train.sh		run_post_train.sh
run_post_train_multinode.sh		run_post_train_multinode.sh

Folders and files

Latest commit

History

Repository files navigation

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Highlights

Abstract

Table of Contents

Installation

1. Clone the repository

2. Install Python dependencies

3. Download the reward model helper

4. Download the remaining HunyuanVideo checkpoints

Checkpoint Preparation

Expected Checkpoint Layout

Post-Training

Hardware Recommendation

Single-node multi-GPU

Multi-node multi-GPU

Key Training Parameters

Distributed Training

SAGE-GRPO Method Parameters

Recommended 64-GPU Default

Default single-node entry

Practical notes

Visualization Gallery

1. Compare with Baseline

2. Compare with Other Methods (20 steps)

3. Compare with Other Methods (40 steps)

4. KL Ablation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages