ztotal_fast_zip.mp4
Self Forcing trains autoregressive video diffusion models by simulating the inference process during training, performing autoregressive rollout with KV caching. It resolves the train-test distribution mismatch and enables real-time, streaming video generation on a single RTX 4090 while matching the quality of state-of-the-art diffusion models.
TL;DR: (1) Treat the initially generated
sink_sizeframes as V-sink; (2) Incorporate V-sink into training; (3) Apply RoPE operation after retrieving KV cache.
We found that the Self-Forcing repository has implemented sink_size, but the specific usage method is not discussed in their paper. By printing the attention map, we did not observe phenomena similar to the attention_sink in LLMs. The experimental configuration for the attention map below is in configs/self_forcing_dmd_frame1.yaml.
|
|
|
|
| frame 5 | frame 10 | frame 20 | frame 40 |
When sink_size = 1 and inference exceeds the training length (> 20 frame), it can be observed that the model assigns a larger attention weight to the frame serving as the sink only in the final block. However, in experiments, we found that even without the appearance of an attention map similar to that in LLMs, using the first sink_size frames as sinks can significantly improve visual quality.
We interpret this as the first
sink_sizeframes providing a larger memory context, enabling the model to access earlier memories, which helps mitigate exposure bias (AKA drift).
Moreover, the implementation in the Self-Forcing repository applies the RoPE operation before storing the KV cache. We observed that when the inference length becomes too long, the effectiveness of the sink diminishes. Therefore, adopting an approach similar to streamingLLM, we incorporated the sink frame (which we refer to as V-sink) into the training process and moved the RoPE operation to after retrieving the KV cache. The specific implementation can be found in the CausalWanSelfAttention class in wan/modules/causal_model.py.
V-sink differs from attention sink in LLMs. V-sink is a complete frame, whereas in LLMs the sink is the first token of the sequence (typically <bos>). Thus, their working mechanisms are distinct.
We compared the inference performance of three methods:
- the original Self-Forcing implementation (Left)
- Self-Forcing w/ V-sink (Mid)
- Infinite-Forcing (Right)
output_zip.mp4
output_zip.mp4
output_zip.mp4
output_zip.mp4
output_zip.mp4
output.mp4 |
output.mp4 |
output.mp4 |
output.mp4 |
output_zip.mp4 |
output.mp4 |
output.mp4 |
output.mp4 |
output.mp4 |
output.mp4 |
output.mp4 |
output.mp4 |
1.mp4 |
2.mp4 |
3.mp4 |
4.mp4 |
5.mp4 |
6.mp4 |
7.mp4 |
8.mp4 |
9.mp4 |
10.mp4 |
Since Infinite-Forcing / Self-Forcing ultimately produces a causal autoregressive video generation model, we can modify the text prompts during the generation process to control the video output in real-time.
We initially used the initial prompt and switched to the interaction prompt midway through inference.
-
demo1
-
Initial prompt: Wide shot of an elegant urban café: A sharp-dressed businessman in a navy suit focused on a sleek silver laptop, steam rising from a white ceramic coffee cup beside him. Soft morning light streams through large windows, casting warm reflections on polished wooden tables. Background features blurred patrons and baristas in motion, creating a sophisticated yet bustling atmosphere. Cinematic shallow depth of field, muted earth tones, 4K realism.
-
Interaction prompt: Medium close-up in a medieval tavern: An elderly wizard with a long grey beard studies an ancient leather-bound spellbook under flickering candlelight. A luminous purple crystal ball pulses on the rough oak table, casting dancing shadows on stone walls. Smoky atmosphere with floating dust particles, barrels and copper mugs in background. Dark fantasy style, volumetric lighting, mystical glow, detailed texture of wrinkled hands and aged parchment, 35mm film grain.
-
output1.mp4
-
demo2
-
Initial prompt: A cinematic wide shot of a serene forest river, its crystal-clear water flowing gently over smooth stones. Sunlight filters through the canopy, creating dancing caustics on the riverbed. The camera tracks slowly alongside the flow.
-
Interaction prompt: In the same wide shot, a wave of freezing energy passes through the frame. The flowing water instantly crystallizes into a solid, glistening sheet of ice, trapping air bubbles inside. The camera continues its track, now revealing a solitary red fox cautiously stepping onto the frozen surface, its breath visible in the suddenly cold air.
-
output1.mp4
-
demo3
-
Initial prompt: A dynamic drone shot circles a bustling medieval town square at high noon. People in colorful period clothing crowd the market stalls. The sun is bright, casting sharp, short shadows. Flags flutter in a gentle breeze.
-
Interaction prompt: From the same drone perspective, day rapidly shifts to a deep, starry night. The scene is now illuminated by the warm glow of torches in iron sconces and a cool, full moon. The bustling crowd is replaced by a few mysterious, cloaked figures moving through the long, dramatic shadows. The earlier gentle breeze is now a visible mist of cold breath.
-
output1.mp4
-
demo4
-
Initial prompt: A macro, still-life shot of a half-full glass of water on a rustic wooden table. Morning light streams in from a window, highlighting the condensation on the glass. A few coffee beans and a newspaper are also on the table.
-
Interaction prompt: In the same macro shot, gravity reverses. The water elegantly pulls upwards out of the glass, morphing into a perfect, wobbling sphere that hovers mid-air. The coffee beans and a corner of the newspaper also begin to float upwards in a slow, graceful ballet. The condensation droplets on the glass now drift away like tiny planets.
-
output1.mp4
Create a conda environment and install dependencies:
conda create -n self_forcing python=3.10 -y
conda activate self_forcing
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download SOTAMak1r/Infinite-Forcing --local-dir checkpoints/
Example inference script using the chunk-wise autoregressive checkpoint trained with DMD:
python inference.py \
--config_path configs/self_forcing_dmd_vsink1.yaml \
--output_folder videos/self_forcing_dmd_vsink1 \
--checkpoint_path path/to/your/pt/checkpoint.pt \
--data_path prompts/MovieGenVideoBench_extended.txt \
--use_ema
Follow Self-Forcing
torchrun --nnodes=2 --nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/self_forcing_dmd_vsink1.yaml \
--logdir logs/self_forcing_dmd_vsink1 \
--disable-wandb
Due to resource constraints, we trained the model using 16 A800 GPUs with a gradient accumulation of 4 to simulate the original Self-Forcing configuration.
-
[2025.9.30] We observed that incorporating V-sink results in a reduction of dynamic motion in the generated frames, and this issue worsens as training steps increase. We hypothesize that this occurs because the base model's target distribution includes a portion of static scenes, causing the model to "cut corners" by learning that static videos yield lower loss values—ultimately converging to a suboptimal sub-distribution. Additionally, the introduction of V-sink tends to make subsequent videos overly resemble the initial frames (Brainstorm: Could this potentially serve as a memory mechanism for video-generation-based world models?).
-
[2025.10.22] Thanks to Xianglong for helping validate the effectiveness of our method on matrix-game-v2. Compared to their Self-Forcing-based baseline, it achieved fewer error accumulations and maintained consistency in scenario style! (Video below at 4x speed)
mg2.mp4
We will continue to update! Stay tuned!
This codebase is built on top of the open-source implementation of Self-Forcing repo.
If you find this codebase useful for your research, please kindly cite:
@article{huang2025selfforcing,
title={Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
journal={arXiv preprint arXiv:2506.08009},
year={2025}
}




