Skip to content

Commit 486376a

Browse files
authored
Merge branch 'main' into reorder-pipelines
2 parents 06bfdaa + 84e1657 commit 486376a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+2726
-95
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,8 @@
323323
title: AllegroTransformer3DModel
324324
- local: api/models/aura_flow_transformer2d
325325
title: AuraFlowTransformer2DModel
326+
- local: api/models/transformer_bria_fibo
327+
title: BriaFiboTransformer2DModel
326328
- local: api/models/bria_transformer
327329
title: BriaTransformer2DModel
328330
- local: api/models/chroma_transformer
@@ -469,6 +471,8 @@
469471
title: BLIP-Diffusion
470472
- local: api/pipelines/bria_3_2
471473
title: Bria 3.2
474+
- local: api/pipelines/bria_fibo
475+
title: Bria Fibo
472476
- local: api/pipelines/chroma
473477
title: Chroma
474478
- local: api/pipelines/cogview3
@@ -527,6 +531,8 @@
527531
title: Kandinsky 2.2
528532
- local: api/pipelines/kandinsky3
529533
title: Kandinsky 3
534+
- local: api/pipelines/kandinsky5
535+
title: Kandinsky 5
530536
- local: api/pipelines/kolors
531537
title: Kolors
532538
- local: api/pipelines/latent_consistency_models

docs/source/en/api/models/chroma_transformer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# ChromaTransformer2DModel
1414

15-
A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma)
15+
A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma1-HD)
1616

1717
## ChromaTransformer2DModel
1818

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# BriaFiboTransformer2DModel
14+
15+
A modified flux Transformer model from [Bria](https://huggingface.co/briaai/FIBO)
16+
17+
## BriaFiboTransformer2DModel
18+
19+
[[autodoc]] BriaFiboTransformer2DModel
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Bria Fibo
14+
15+
Text-to-image models have mastered imagination - but not control. FIBO changes that.
16+
17+
FIBO is trained on structured JSON captions up to 1,000+ words and designed to understand and control different visual parameters such as lighting, composition, color, and camera settings, enabling precise and reproducible outputs.
18+
19+
With only 8 billion parameters, FIBO provides a new level of image quality, prompt adherence and proffesional control.
20+
21+
FIBO is trained exclusively on a structured prompt and will not work with freeform text prompts.
22+
you can use the [FIBO-VLM-prompt-to-JSON](https://huggingface.co/briaai/FIBO-VLM-prompt-to-JSON) model or the [FIBO-gemini-prompt-to-JSON](https://huggingface.co/briaai/FIBO-gemini-prompt-to-JSON) to convert your freeform text prompt to a structured JSON prompt.
23+
24+
its not recommended to use freeform text prompts directly with FIBO, as it will not produce the best results.
25+
26+
you can learn more about FIBO in [Bria Fibo Hugging Face page](https://huggingface.co/briaai/FIBO).
27+
28+
29+
## Usage
30+
31+
_As the model is gated, before using it with diffusers you first need to go to the [Bria Fibo Hugging Face page](https://huggingface.co/briaai/FIBO), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._
32+
33+
Use the command below to log in:
34+
35+
```bash
36+
hf auth login
37+
```
38+
39+
40+
## BriaPipeline
41+
42+
[[autodoc]] BriaPipeline
43+
- all
44+
- __call__
45+

docs/source/en/api/pipelines/chroma.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,21 @@ specific language governing permissions and limitations under the License.
1919

2020
Chroma is a text to image generation model based on Flux.
2121

22-
Original model checkpoints for Chroma can be found [here](https://huggingface.co/lodestones/Chroma).
22+
Original model checkpoints for Chroma can be found here:
23+
* High-resolution finetune: [lodestones/Chroma1-HD](https://huggingface.co/lodestones/Chroma1-HD)
24+
* Base model: [lodestones/Chroma1-Base](https://huggingface.co/lodestones/Chroma1-Base)
25+
* Original repo with progress checkpoints: [lodestones/Chroma](https://huggingface.co/lodestones/Chroma) (loading this repo with `from_pretrained` will load a Diffusers-compatible version of the `unlocked-v37` checkpoint)
2326

2427
> [!TIP]
2528
> Chroma can use all the same optimizations as Flux.
2629
2730
## Inference
2831

29-
The Diffusers version of Chroma is based on the [`unlocked-v37`](https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors) version of the original model, which is available in the [Chroma repository](https://huggingface.co/lodestones/Chroma).
30-
3132
```python
3233
import torch
3334
from diffusers import ChromaPipeline
3435

35-
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma", torch_dtype=torch.bfloat16)
36+
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-HD", torch_dtype=torch.bfloat16)
3637
pipe.enable_model_cpu_offload()
3738

3839
prompt = [
@@ -63,10 +64,10 @@ Then run the following example
6364
import torch
6465
from diffusers import ChromaTransformer2DModel, ChromaPipeline
6566

66-
model_id = "lodestones/Chroma"
67+
model_id = "lodestones/Chroma1-HD"
6768
dtype = torch.bfloat16
6869

69-
transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors", torch_dtype=dtype)
70+
transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma1-HD/blob/main/Chroma1-HD.safetensors", torch_dtype=dtype)
7071

7172
pipe = ChromaPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=dtype)
7273
pipe.enable_model_cpu_offload()
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
3+
the License. You may obtain a copy of the License at
4+
http://www.apache.org/licenses/LICENSE-2.0
5+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
6+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
7+
specific language governing permissions and limitations under the License.
8+
-->
9+
10+
# Kandinsky 5.0
11+
12+
Kandinsky 5.0 is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
13+
14+
15+
Kandinsky 5.0 is a family of diffusion models for Video & Image generation. Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
16+
17+
The model introduces several key innovations:
18+
- **Latent diffusion pipeline** with **Flow Matching** for improved training stability
19+
- **Diffusion Transformer (DiT)** as the main generative backbone with cross-attention to text embeddings
20+
- Dual text encoding using **Qwen2.5-VL** and **CLIP** for comprehensive text understanding
21+
- **HunyuanVideo 3D VAE** for efficient video encoding and decoding
22+
- **Sparse attention mechanisms** (NABLA) for efficient long-sequence processing
23+
24+
The original codebase can be found at [ai-forever/Kandinsky-5](https://github.com/ai-forever/Kandinsky-5).
25+
26+
> [!TIP]
27+
> Check out the [AI Forever](https://huggingface.co/ai-forever) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
28+
29+
## Available Models
30+
31+
Kandinsky 5.0 T2V Lite comes in several variants optimized for different use cases:
32+
33+
| model_id | Description | Use Cases |
34+
|------------|-------------|-----------|
35+
| **ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers** | 5 second Supervised Fine-Tuned model | Highest generation quality |
36+
| **ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers** | 10 second Supervised Fine-Tuned model | Highest generation quality |
37+
| **ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers** | 5 second Classifier-Free Guidance distilled | 2× faster inference |
38+
| **ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers** | 10 second Classifier-Free Guidance distilled | 2× faster inference |
39+
| **ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers** | 5 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
40+
| **ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers** | 10 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
41+
| **ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers** | 5 second Base pretrained model | Research and fine-tuning |
42+
| **ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers** | 10 second Base pretrained model | Research and fine-tuning |
43+
44+
All models are available in 5-second and 10-second video generation versions.
45+
46+
## Kandinsky5T2VPipeline
47+
48+
[[autodoc]] Kandinsky5T2VPipeline
49+
- all
50+
- __call__
51+
52+
## Usage Examples
53+
54+
### Basic Text-to-Video Generation
55+
56+
```python
57+
import torch
58+
from diffusers import Kandinsky5T2VPipeline
59+
from diffusers.utils import export_to_video
60+
61+
# Load the pipeline
62+
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers"
63+
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
64+
pipe = pipe.to("cuda")
65+
66+
# Generate video
67+
prompt = "A cat and a dog baking a cake together in a kitchen."
68+
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
69+
70+
output = pipe(
71+
prompt=prompt,
72+
negative_prompt=negative_prompt,
73+
height=512,
74+
width=768,
75+
num_frames=121, # ~5 seconds at 24fps
76+
num_inference_steps=50,
77+
guidance_scale=5.0,
78+
).frames[0]
79+
80+
export_to_video(output, "output.mp4", fps=24, quality=9)
81+
```
82+
83+
### 10 second Models
84+
**⚠️ Warning!** all 10 second models should be used with Flex attention and max-autotune-no-cudagraphs compilation:
85+
86+
```python
87+
pipe = Kandinsky5T2VPipeline.from_pretrained(
88+
"ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
89+
torch_dtype=torch.bfloat16
90+
)
91+
pipe = pipe.to("cuda")
92+
93+
pipe.transformer.set_attention_backend(
94+
"flex"
95+
) # <--- Set attention backend to Flex
96+
pipe.transformer.compile(
97+
mode="max-autotune-no-cudagraphs",
98+
dynamic=True
99+
) # <--- Compile with max-autotune-no-cudagraphs
100+
101+
prompt = "A cat and a dog baking a cake together in a kitchen."
102+
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
103+
104+
output = pipe(
105+
prompt=prompt,
106+
negative_prompt=negative_prompt,
107+
height=512,
108+
width=768,
109+
num_frames=241,
110+
num_inference_steps=50,
111+
guidance_scale=5.0,
112+
).frames[0]
113+
114+
export_to_video(output, "output.mp4", fps=24, quality=9)
115+
```
116+
117+
### Diffusion Distilled model
118+
**⚠️ Warning!** all nocfg and diffusion distilled models should be inferred without CFG (```guidance_scale=1.0```):
119+
120+
```python
121+
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers"
122+
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
123+
pipe = pipe.to("cuda")
124+
125+
output = pipe(
126+
prompt="A beautiful sunset over mountains",
127+
num_inference_steps=16, # <--- Model is distilled in 16 steps
128+
guidance_scale=1.0, # <--- no CFG
129+
).frames[0]
130+
131+
export_to_video(output, "output.mp4", fps=24, quality=9)
132+
```
133+
134+
135+
## Citation
136+
```bibtex
137+
@misc{kandinsky2025,
138+
author = {Alexey Letunovskiy and Maria Kovaleva and Ivan Kirillov and Lev Novitskiy and Denis Koposov and
139+
Dmitrii Mikhailov and Anna Averchenkova and Andrey Shutkin and Julia Agafonova and Olga Kim and
140+
Anastasiia Kargapoltseva and Nikita Kiselev and Vladimir Arkhipkin and Vladimir Korviakov and
141+
Nikolai Gerasimenko and Denis Parkhomenko and Anna Dmitrienko and Anastasia Maltseva and
142+
Kirill Chernyshev and Ilia Vasiliev and Viacheslav Vasilev and Vladimir Polovnikov and
143+
Yury Kolabushin and Alexander Belykh and Mikhail Mamaev and Anastasia Aliaskina and
144+
Tatiana Nikulina and Polina Gavrilova and Denis Dimitrov},
145+
title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
146+
howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
147+
year = 2025
148+
}
149+
```

docs/source/en/optimization/attention_backends.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Refer to the table below for an overview of the available attention families and
2121
| attention family | main feature |
2222
|---|---|
2323
| FlashAttention | minimizes memory reads/writes through tiling and recomputation |
24+
| AI Tensor Engine for ROCm | FlashAttention implementation optimized for AMD ROCm accelerators |
2425
| SageAttention | quantizes attention to int8 |
2526
| PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) |
2627
| xFormers | memory-efficient attention with support for various attention kernels |
@@ -139,6 +140,7 @@ Refer to the table below for a complete list of available attention backends and
139140
| `_native_xla` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | XLA-optimized attention |
140141
| `flash` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 |
141142
| `flash_varlen` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention |
143+
| `aiter` | [AI Tensor Engine for ROCm](https://github.com/ROCm/aiter) | FlashAttention for AMD ROCm |
142144
| `_flash_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 |
143145
| `_flash_varlen_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 |
144146
| `_flash_3_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 from kernels |

src/diffusers/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,7 @@
198198
"AutoencoderOobleck",
199199
"AutoencoderTiny",
200200
"AutoModel",
201+
"BriaFiboTransformer2DModel",
201202
"BriaTransformer2DModel",
202203
"CacheMixin",
203204
"ChromaTransformer2DModel",
@@ -430,6 +431,7 @@
430431
"AuraFlowPipeline",
431432
"BlipDiffusionControlNetPipeline",
432433
"BlipDiffusionPipeline",
434+
"BriaFiboPipeline",
433435
"BriaPipeline",
434436
"ChromaImg2ImgPipeline",
435437
"ChromaPipeline",
@@ -901,6 +903,7 @@
901903
AutoencoderOobleck,
902904
AutoencoderTiny,
903905
AutoModel,
906+
BriaFiboTransformer2DModel,
904907
BriaTransformer2DModel,
905908
CacheMixin,
906909
ChromaTransformer2DModel,
@@ -1103,6 +1106,7 @@
11031106
AudioLDM2UNet2DConditionModel,
11041107
AudioLDMPipeline,
11051108
AuraFlowPipeline,
1109+
BriaFiboPipeline,
11061110
BriaPipeline,
11071111
ChromaImg2ImgPipeline,
11081112
ChromaPipeline,

src/diffusers/loaders/lora_conversion_utils.py

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1977,14 +1977,34 @@ def get_alpha_scales(down_weight, alpha_key):
19771977
"time_projection.1.diff_b"
19781978
)
19791979

1980-
if any("head.head" in k for k in state_dict):
1981-
converted_state_dict["proj_out.lora_A.weight"] = original_state_dict.pop(
1982-
f"head.head.{lora_down_key}.weight"
1983-
)
1984-
converted_state_dict["proj_out.lora_B.weight"] = original_state_dict.pop(f"head.head.{lora_up_key}.weight")
1980+
if any("head.head" in k for k in original_state_dict):
1981+
if any(f"head.head.{lora_down_key}.weight" in k for k in state_dict):
1982+
converted_state_dict["proj_out.lora_A.weight"] = original_state_dict.pop(
1983+
f"head.head.{lora_down_key}.weight"
1984+
)
1985+
if any(f"head.head.{lora_up_key}.weight" in k for k in state_dict):
1986+
converted_state_dict["proj_out.lora_B.weight"] = original_state_dict.pop(
1987+
f"head.head.{lora_up_key}.weight"
1988+
)
19851989
if "head.head.diff_b" in original_state_dict:
19861990
converted_state_dict["proj_out.lora_B.bias"] = original_state_dict.pop("head.head.diff_b")
19871991

1992+
# Notes: https://huggingface.co/lightx2v/Wan2.2-Distill-Loras
1993+
# This is my (sayakpaul) assumption that this particular key belongs to the down matrix.
1994+
# Since for this particular LoRA, we don't have the corresponding up matrix, I will use
1995+
# an identity.
1996+
if any("head.head" in k and k.endswith(".diff") for k in state_dict):
1997+
if f"head.head.{lora_down_key}.weight" in state_dict:
1998+
logger.info(
1999+
f"The state dict seems to be have both `head.head.diff` and `head.head.{lora_down_key}.weight` keys, which is unexpected."
2000+
)
2001+
converted_state_dict["proj_out.lora_A.weight"] = original_state_dict.pop("head.head.diff")
2002+
down_matrix_head = converted_state_dict["proj_out.lora_A.weight"]
2003+
up_matrix_shape = (down_matrix_head.shape[0], converted_state_dict["proj_out.lora_B.bias"].shape[0])
2004+
converted_state_dict["proj_out.lora_B.weight"] = torch.eye(
2005+
*up_matrix_shape, dtype=down_matrix_head.dtype, device=down_matrix_head.device
2006+
).T
2007+
19882008
for text_time in ["text_embedding", "time_embedding"]:
19892009
if any(text_time in k for k in original_state_dict):
19902010
for b_n in [0, 2]:

src/diffusers/models/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@
8484
_import_structure["transformers.transformer_2d"] = ["Transformer2DModel"]
8585
_import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
8686
_import_structure["transformers.transformer_bria"] = ["BriaTransformer2DModel"]
87+
_import_structure["transformers.transformer_bria_fibo"] = ["BriaFiboTransformer2DModel"]
8788
_import_structure["transformers.transformer_chroma"] = ["ChromaTransformer2DModel"]
8889
_import_structure["transformers.transformer_cogview3plus"] = ["CogView3PlusTransformer2DModel"]
8990
_import_structure["transformers.transformer_cogview4"] = ["CogView4Transformer2DModel"]
@@ -174,6 +175,7 @@
174175
from .transformers import (
175176
AllegroTransformer3DModel,
176177
AuraFlowTransformer2DModel,
178+
BriaFiboTransformer2DModel,
177179
BriaTransformer2DModel,
178180
ChromaTransformer2DModel,
179181
CogVideoXTransformer3DModel,

0 commit comments

Comments
 (0)