Skip to content

SOTAMak1r/Infinite-Forcing

 
 

Repository files navigation

Towards Infinite-Long Video Generation

huggingface weights 

ztotal_fast_zip.mp4

👀 Preliminary: Self-Forcing

Self Forcing trains autoregressive video diffusion models by simulating the inference process during training, performing autoregressive rollout with KV caching. It resolves the train-test distribution mismatch and enables real-time, streaming video generation on a single RTX 4090 while matching the quality of state-of-the-art diffusion models.

🦄 Difference with Self-Forcing: V-sink

TL;DR: (1) Treat the initially generated sink_size frames as V-sink; (2) Incorporate V-sink into training; (3) Apply RoPE operation after retrieving KV cache.

We found that the Self-Forcing repository has implemented sink_size, but the specific usage method is not discussed in their paper. By printing the attention map, we did not observe phenomena similar to the attention_sink in LLMs. The experimental configuration for the attention map below is in configs/self_forcing_dmd_frame1.yaml.

frame 5 frame 10 frame 20 frame 40

When sink_size = 1 and inference exceeds the training length (> 20 frame), it can be observed that the model assigns a larger attention weight to the frame serving as the sink only in the final block. However, in experiments, we found that even without the appearance of an attention map similar to that in LLMs, using the first sink_size frames as sinks can significantly improve visual quality.

We interpret this as the first sink_size frames providing a larger memory context, enabling the model to access earlier memories, which helps mitigate exposure bias (AKA drift).

Moreover, the implementation in the Self-Forcing repository applies the RoPE operation before storing the KV cache. We observed that when the inference length becomes too long, the effectiveness of the sink diminishes. Therefore, adopting an approach similar to streamingLLM, we incorporated the sink frame (which we refer to as V-sink) into the training process and moved the RoPE operation to after retrieving the KV cache. The specific implementation can be found in the CausalWanSelfAttention class in wan/modules/causal_model.py.

V-sink differs from attention sink in LLMs. V-sink is a complete frame, whereas in LLMs the sink is the first token of the sequence (typically <bos>). Thus, their working mechanisms are distinct.

Comparison

We compared the inference performance of three methods:

  • the original Self-Forcing implementation (Left)
  • Self-Forcing w/ V-sink (Mid)
  • Infinite-Forcing (Right)

output_zip.mp4

output_zip.mp4

output_zip.mp4

output_zip.mp4

output_zip.mp4

📹️ Gallery

output.mp4
output.mp4
output.mp4
output.mp4
output_zip.mp4
output.mp4
output.mp4
output.mp4
output.mp4
output.mp4
output.mp4
output.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
8.mp4
9.mp4
10.mp4

Application: interactive video generation

Since Infinite-Forcing / Self-Forcing ultimately produces a causal autoregressive video generation model, we can modify the text prompts during the generation process to control the video output in real-time.

We initially used the initial prompt and switched to the interaction prompt midway through inference.

  • demo1

    • Initial prompt: Wide shot of an elegant urban café: A sharp-dressed businessman in a navy suit focused on a sleek silver laptop, steam rising from a white ceramic coffee cup beside him. Soft morning light streams through large windows, casting warm reflections on polished wooden tables. Background features blurred patrons and baristas in motion, creating a sophisticated yet bustling atmosphere. Cinematic shallow depth of field, muted earth tones, 4K realism.

    • Interaction prompt: Medium close-up in a medieval tavern: An elderly wizard with a long grey beard studies an ancient leather-bound spellbook under flickering candlelight. A luminous purple crystal ball pulses on the rough oak table, casting dancing shadows on stone walls. Smoky atmosphere with floating dust particles, barrels and copper mugs in background. Dark fantasy style, volumetric lighting, mystical glow, detailed texture of wrinkled hands and aged parchment, 35mm film grain.

output1.mp4

  • demo2

    • Initial prompt: A cinematic wide shot of a serene forest river, its crystal-clear water flowing gently over smooth stones. Sunlight filters through the canopy, creating dancing caustics on the riverbed. The camera tracks slowly alongside the flow.

    • Interaction prompt: In the same wide shot, a wave of freezing energy passes through the frame. The flowing water instantly crystallizes into a solid, glistening sheet of ice, trapping air bubbles inside. The camera continues its track, now revealing a solitary red fox cautiously stepping onto the frozen surface, its breath visible in the suddenly cold air.

output1.mp4

  • demo3

    • Initial prompt: A dynamic drone shot circles a bustling medieval town square at high noon. People in colorful period clothing crowd the market stalls. The sun is bright, casting sharp, short shadows. Flags flutter in a gentle breeze.

    • Interaction prompt: From the same drone perspective, day rapidly shifts to a deep, starry night. The scene is now illuminated by the warm glow of torches in iron sconces and a cool, full moon. The bustling crowd is replaced by a few mysterious, cloaked figures moving through the long, dramatic shadows. The earlier gentle breeze is now a visible mist of cold breath.

output1.mp4

  • demo4

    • Initial prompt: A macro, still-life shot of a half-full glass of water on a rustic wooden table. Morning light streams in from a window, highlighting the condensation on the glass. A few coffee beans and a newspaper are also on the table.

    • Interaction prompt: In the same macro shot, gravity reverses. The water elegantly pulls upwards out of the glass, morphing into a perfect, wobbling sphere that hovers mid-air. The coffee beans and a corner of the newspaper also begin to float upwards in a slow, graceful ballet. The condensation droplets on the glass now drift away like tiny planets.

output1.mp4

🛠️ Installation

Create a conda environment and install dependencies:

conda create -n self_forcing python=3.10 -y
conda activate self_forcing
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
python setup.py develop

🚀 Quick Start

Download checkpoints

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir-use-symlinks False --local-dir wan_models/Wan2.1-T2V-1.3B
huggingface-cli download SOTAMak1r/Infinite-Forcing --local-dir checkpoints/

CLI Inference

Example inference script using the chunk-wise autoregressive checkpoint trained with DMD:

python inference.py \
    --config_path configs/self_forcing_dmd_vsink1.yaml \
    --output_folder videos/self_forcing_dmd_vsink1 \
    --checkpoint_path path/to/your/pt/checkpoint.pt \
    --data_path prompts/MovieGenVideoBench_extended.txt \
    --use_ema

🚂 Training

Download text prompts and ODE initialized checkpoint

Follow Self-Forcing

Infinite Forcing Training with V-sink

torchrun --nnodes=2 --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint $MASTER_ADDR \
  train.py \
  --config_path configs/self_forcing_dmd_vsink1.yaml \
  --logdir logs/self_forcing_dmd_vsink1 \
  --disable-wandb

Due to resource constraints, we trained the model using 16 A800 GPUs with a gradient accumulation of 4 to simulate the original Self-Forcing configuration.

💬 Discussion

  • [2025.9.30] We observed that incorporating V-sink results in a reduction of dynamic motion in the generated frames, and this issue worsens as training steps increase. We hypothesize that this occurs because the base model's target distribution includes a portion of static scenes, causing the model to "cut corners" by learning that static videos yield lower loss values—ultimately converging to a suboptimal sub-distribution. Additionally, the introduction of V-sink tends to make subsequent videos overly resemble the initial frames (Brainstorm: Could this potentially serve as a memory mechanism for video-generation-based world models?).

  • [2025.10.22] Thanks to Xianglong for helping validate the effectiveness of our method on matrix-game-v2. Compared to their Self-Forcing-based baseline, it achieved fewer error accumulations and maintained consistency in scenario style! (Video below at 4x speed)

mg2.mp4

We will continue to update! Stay tuned!

Acknowledgements

This codebase is built on top of the open-source implementation of Self-Forcing repo.

Citation

If you find this codebase useful for your research, please kindly cite:

@article{huang2025selfforcing,
  title={Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion},
  author={Huang, Xun and Li, Zhengqi and He, Guande and Zhou, Mingyuan and Shechtman, Eli},
  journal={arXiv preprint arXiv:2506.08009},
  year={2025}
}

About

Infinite-Forcing: Towards Infinite-Long Video Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.4%
  • HTML 3.6%