Skip to content

Commit 071654f

Browse files
committed
add sana_video
1 parent 4be9653 commit 071654f

File tree

13 files changed

+2845
-4
lines changed

13 files changed

+2845
-4
lines changed

docs/diffusers/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,8 @@
293293
title: QwenImageTransformer2DModel
294294
- local: api/models/sana_transformer2d
295295
title: SanaTransformer2DModel
296+
- local: api/models/sana_video_transformer3d
297+
title: SanaVideoTransformer3DModel
296298
- local: api/models/sd3_transformer2d
297299
title: SD3Transformer2DModel
298300
- local: api/models/skyreels_v2_transformer_3d
@@ -489,6 +491,8 @@
489491
title: Sana
490492
- local: api/pipelines/sana_sprint
491493
title: Sana Sprint
494+
- local: api/pipelines/sana_video
495+
title: Sana Video
492496
- local: api/pipelines/self_attention_guidance
493497
title: Self-Attention Guidance
494498
- local: api/pipelines/semantic_stable_diffusion
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# SanaVideoTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data (video) from [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.
15+
16+
The abstract from the paper is:
17+
18+
*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.*
19+
20+
The model can be loaded with the following code snippet.
21+
22+
```python
23+
from mindone.diffusers import SanaVideoTransformer3DModel
24+
import mindspore as ms
25+
26+
transformer = SanaVideoTransformer3DModel.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", mindspore_dtype=ms.float16)
27+
```
28+
29+
::: mindone.diffusers.SanaVideoTransformer3DModel
30+
31+
::: mindone.diffusers.models.modeling_outputs.Transformer2DModelOutput

docs/diffusers/api/pipelines/sana_sprint.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,6 @@ The abstract from the paper is:
2424

2525
*This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.*
2626

27-
!!! tip
28-
29-
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
30-
3127
This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj), [shuchen Xue](https://github.com/scxue) and [Enze Xie](https://github.com/xieenze). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model/).
3228

3329
Available models:
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# SanaVideoPipeline
16+
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</div>
20+
21+
[SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.
22+
23+
The abstract from the paper is:
24+
25+
*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).*
26+
27+
This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video).
28+
29+
Available models:
30+
31+
| Model | Recommended dtype |
32+
|:-----:|:-----------------:|
33+
| [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `mindspore.bfloat16` |
34+
35+
Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information.
36+
37+
Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `mindspore.bfloat16` or `mindspore.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.
38+
39+
::: mindone.diffusers.SanaVideoPipeline
40+
41+
::: mindone.diffusers.pipelines.sana.pipeline_sana_video.pipeline_output.SanaVideoPipelineOutput

mindone/diffusers/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@
111111
"QwenImageTransformer2DModel",
112112
"SanaControlNetModel",
113113
"SanaTransformer2DModel",
114+
"SanaVideoTransformer3DModel",
114115
"SD3ControlNetModel",
115116
"SD3MultiControlNetModel",
116117
"SD3Transformer2DModel",
@@ -275,10 +276,12 @@
275276
"QwenImageEditInpaintPipeline",
276277
"ReduxImageEncoder",
277278
"SanaControlNetPipeline",
279+
"SanaImageToVideoPipeline",
278280
"SanaPAGPipeline",
279281
"SanaPipeline",
280282
"SanaSprintImg2ImgPipeline",
281283
"SanaSprintPipeline",
284+
"SanaVideoPipeline",
282285
"SemanticStableDiffusionPipeline",
283286
"ShapEImg2ImgPipeline",
284287
"ShapEPipeline",
@@ -497,6 +500,7 @@
497500
QwenImageTransformer2DModel,
498501
SanaControlNetModel,
499502
SanaTransformer2DModel,
503+
SanaVideoTransformer3DModel,
500504
SD3ControlNetModel,
501505
SD3MultiControlNetModel,
502506
SD3Transformer2DModel,
@@ -672,10 +676,12 @@
672676
QwenImagePipeline,
673677
ReduxImageEncoder,
674678
SanaControlNetPipeline,
679+
SanaImageToVideoPipeline,
675680
SanaPAGPipeline,
676681
SanaPipeline,
677682
SanaSprintImg2ImgPipeline,
678683
SanaSprintPipeline,
684+
SanaVideoPipeline,
679685
SemanticStableDiffusionPipeline,
680686
ShapEImg2ImgPipeline,
681687
ShapEPipeline,

mindone/diffusers/models/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@
8686
"transformers.transformer_mochi": ["MochiTransformer3DModel"],
8787
"transformers.transformer_omnigen": ["OmniGenTransformer2DModel"],
8888
"transformers.transformer_qwenimage": ["QwenImageTransformer2DModel"],
89+
"transformers.transformer_sana_video": ["SanaVideoTransformer3DModel"],
8990
"transformers.transformer_sd3": ["SD3Transformer2DModel"],
9091
"transformers.transformer_skyreels_v2": ["SkyReelsV2Transformer3DModel"],
9192
"transformers.transformer_temporal": ["TransformerTemporalModel"],
@@ -173,6 +174,7 @@
173174
PriorTransformer,
174175
QwenImageTransformer2DModel,
175176
SanaTransformer2DModel,
177+
SanaVideoTransformer3DModel,
176178
SD3Transformer2DModel,
177179
SkyReelsV2Transformer3DModel,
178180
StableAudioDiTModel,

mindone/diffusers/models/transformers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
from .transformer_mochi import MochiTransformer3DModel
3232
from .transformer_omnigen import OmniGenTransformer2DModel
3333
from .transformer_qwenimage import QwenImageTransformer2DModel
34+
from .transformer_sana_video import SanaVideoTransformer3DModel
3435
from .transformer_sd3 import SD3Transformer2DModel
3536
from .transformer_skyreels_v2 import SkyReelsV2Transformer3DModel
3637
from .transformer_temporal import TransformerTemporalModel

0 commit comments

Comments
 (0)