|
| 1 | +<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved. |
| 2 | +# |
| 3 | +# Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +# you may not use this file except in compliance with the License. |
| 5 | +# You may obtain a copy of the License at |
| 6 | +# |
| 7 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +# |
| 9 | +# Unless required by applicable law or agreed to in writing, software |
| 10 | +# distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +# See the License for the specific language governing permissions and |
| 13 | +# limitations under the License. --> |
| 14 | + |
| 15 | +# SanaVideoPipeline |
| 16 | + |
| 17 | +<div class="flex flex-wrap space-x-1"> |
| 18 | + <img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/> |
| 19 | +</div> |
| 20 | + |
| 21 | +[SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie. |
| 22 | + |
| 23 | +The abstract from the paper is: |
| 24 | + |
| 25 | +*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).* |
| 26 | + |
| 27 | +This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video). |
| 28 | + |
| 29 | +Available models: |
| 30 | + |
| 31 | +| Model | Recommended dtype | |
| 32 | +|:-----:|:-----------------:| |
| 33 | +| [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `mindspore.bfloat16` | |
| 34 | + |
| 35 | +Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information. |
| 36 | + |
| 37 | +Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `mindspore.bfloat16` or `mindspore.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. |
| 38 | + |
| 39 | +::: mindone.diffusers.SanaVideoPipeline |
| 40 | + |
| 41 | +::: mindone.diffusers.pipelines.sana.pipeline_sana_video.pipeline_output.SanaVideoPipelineOutput |
0 commit comments