Skip to content

Conversation

@Cui-yshoho
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Adds # (feature)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

@Cui-yshoho Cui-yshoho requested a review from vigo999 as a code owner December 8, 2025 07:25
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Cui-yshoho, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the video generation capabilities of the mindone/diffusers library by integrating the SANA-Video model. The primary goal is to provide users with efficient tools for generating high-quality videos from text prompts or initial images. The underlying architecture is designed for speed and memory efficiency, making it suitable for demanding video synthesis tasks. This addition enhances the library's offering in the domain of generative AI for multimedia content.

Highlights

  • New SANA-Video Model Integration: This pull request introduces the SANA-Video model, a small yet efficient diffusion model for high-resolution, high-quality, and long video generation, into the mindone/diffusers library.
  • Text-to-Video and Image-to-Video Pipelines: Two new pipelines, SanaVideoPipeline and SanaImageToVideoPipeline, have been added to enable both text-to-video and image-to-video generation capabilities using the SANA-Video model.
  • Core Architectural Components: The integration includes the SanaVideoTransformer3DModel, which leverages Linear DiT and a Constant-Memory KV cache for Block Linear Attention to achieve efficient video processing.
  • Comprehensive Documentation and Module Exposure: New documentation files for the SANA-Video transformer and pipelines have been added, and the necessary modules are exposed through updated __init__.py files for seamless integration and usability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Sana-Video model and its associated pipelines (SanaVideoPipeline and SanaImageToVideoPipeline) into the diffusers library. The implementation is comprehensive, covering the model, pipelines, documentation, and necessary __init__ updates. While the core logic appears sound, I've identified a few areas for improvement, primarily in the documentation for clarity and in the code for better organization and maintainability. My feedback focuses on correcting some inaccuracies in the documentation and suggesting structural improvements to the code.

Comment on lines 1 to 41
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->

# SanaVideoPipeline

<div class="flex flex-wrap space-x-1">
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
</div>

[SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.

The abstract from the paper is:

*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).*

This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video).

Available models:

| Model | Recommended dtype |
|:-----:|:-----------------:|
| [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `mindspore.bfloat16` |

Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information.

Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `mindspore.bfloat16` or `mindspore.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.

::: mindone.diffusers.SanaVideoPipeline

::: mindone.diffusers.pipelines.sana.pipeline_sana_video.SanaVideoPipelineOutput
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

I've found a few small issues in this documentation file:

  • The copyright header from lines 2-13 uses a mix of HTML and markdown comments. It would be cleaner to use a consistent style, preferably HTML comments (<!-- ... -->).
  • In the abstract on line 25, the link to the original codebase is malformed: [this https URL](...). It should be [this URL](...).
  • In the 'Available models' table on line 33, the link to the Hugging Face model has a typo: ANA-Video_2B_480p_diffusers instead of SANA-Video_2B_480p_diffusers.
  • On line 41, the path to the output class seems incorrect. It should likely be mindone.diffusers.pipelines.sana_video.SanaVideoPipelineOutput.

Comment on lines +133 to +144
def apply_rotary_emb(
hidden_states: ms.Tensor,
freqs_cos: ms.Tensor,
freqs_sin: ms.Tensor,
):
x1, x2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1)
cos = freqs_cos[..., 0::2]
sin = freqs_sin[..., 1::2]
out = mint.empty_like(hidden_states)
out[..., 0::2] = x1 * cos - x2 * sin
out[..., 1::2] = x1 * sin + x2 * cos
return out.type_as(hidden_states)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The apply_rotary_emb function is defined inside the __call__ method. For better code organization, readability, and potential reuse, it would be better to define it as a static method of the SanaLinearAttnProcessor3_0 class.

if not return_dict:
return (output,)

return Transformer2DModelOutput(sample=output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SanaVideoTransformer3DModel returns a Transformer2DModelOutput. This is misleading, as this is a 3D model that outputs a 5D video tensor. The name Transformer2DModelOutput and its docstring (which describes a 4D tensor) can cause confusion. Consider defining and using a Transformer3DModelOutput for 3D models to improve clarity and API consistency.

@Cui-yshoho Cui-yshoho force-pushed the sana_video_1208 branch 4 times, most recently from 5c9a41a to 071654f Compare December 8, 2025 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant