Flex Block Attn

Introduction

Flex-Block-Attn is an efficient block sparse attention communication library specifically designed for Hunyuan Video. It supports various sparse attention strategies including STA, MOBA, and SSTA (selective and sliding tile attention, a hybrid of STA and MOBA) for both training and inference. Built upon ThunderKitten's attention demo implementation, this library delivers arbitrary sparse attention computation capabilities optimized for Hopper architecture GPUs. It features PyTorch-like mask expressions that ensure high usability while enabling efficient sparse mask generation.

Project Updates

[2025-11-19] We have released the Flex-Block-Attn implementation along with comprehensive benchmark results. We welcome the community to test and provide feedback!

🛠️ Quick start

Requirements

Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support
Python 3.8 and above
CUDA version 12.8 CUDA Toolkit

Installation

git submodule update --init --recursive
python setup.py install

🔑 Usage

Custom kernel

from flex_block_attn import flex_block_attn_func
from benchmark.utils.utils import create_sparse_mask
# take a (block_size * 2)*(block_size * 2) as an example
# block size can be 64,128,192...
selected_blocks = [[0,1],[1,0]]
# create block mask with selected blocks
sparse_mask = create_sparse_mask(q, block_size, selected_blocks)

'''
sparse mask: torch.tensor([[0,1],[1,0]])
for example if block_size=64, ths shape of torch mask is [128,128]

our sparse mask：
[[0,1],
 [1,0]]

original torch mask:
[[0,0,0...,0],[1,1,1...,1],
 [0,0,0...,0],[1,1,1...,1],
 ...         , ...        ,
 [0,0,0...,0],[1,1,1...,1],
 [1,1,1...,1],[0,0,0...,0],
 [1,1,1...,1],[0,0,0...,0],
 ...         , ...        ,
 [1,1,1...,1],[0,0,0...,0],]
'''
#compute 
output = flex_block_attn_func(query, key, value, q_block_size, k_block_size, block_mask)

SSTA kernel

SSTA is a novel attention mechanism that integrates the sparse attention of both Moba and STA. It has been utilized in both the training and inference processes of Hunyuan Video. We will be open-sourcing all related code in the near future – stay tuned!

❗️Notes

The head dim must be 128.
The q tile_size can be any multiple of 16, k/v tile_size can be any multiple of 64, with 384 recommended (as we have performed additional optimizations for this size).
The sequence length of q and kv must be divisible by their respective tile sizes.
The attention_mask only supports block-level masking. block_mask supports two shapes: [seq_len, seq_len] or [batch, head_num, seq_len, seq_len].
Within selected blocks, full attention computation is performed.

🚀 Performance

We provide performance comparisons in the benchmark folder, including measurements for mask creation time, forward/backward execution time, and GPU memory usage across the following attention types: full attention, sparse static attention, and sparse dynamic attention. Meanwhile, we have provided all the results(full attn, static sparse attn, dynamic sparse attn) obtained from testing on the H800 GPU and H20 GPU.

Sparse dynamic attention

In sparse dynamic attention tasks, attention mask is generated randomly with a a specified sparsity ratio.We display FlexBlockAttn speedup using these parameters:

Sequence length 11520, 19200, 30720, 38400, 46080, 53760, 61440, 69120
Block_size 384
Sparse rate 0.6

The performance(combined mask creation, forward and backward) of Flex Block Attention is better than mainstream sparse attention libraries.

H800 Dynamic Attention Speedup

H20 Dynamic Attention Speedup

Full attention

In full attention tasks, Flex Block Attention continues to deliver robust performance.

H800 Full Attention Speedup

H20 Full Attention Speedup

🙏 Acknowledgments

This project stands on the shoulders of the following amazing projects and resources. We extend our sincere gratitude to:

ThunderKittens : Our project extends its computational engine, building additional logic layers while leveraging its core calculation capabilities. The underlying computational power is entirely provided by its excellent infrastructure.
STA(Sliding Tile Attention), MoBA: In our video model training, we have drawn inspiration from the innovative contributions of these projects in sparse attention computation.
flex attention, flash-attention, MagiAttention, SpargeAttn, Triton: These projects have been pivotal in advancing efficient and flexible attention mechanisms and high-performance GPU programming. Their collective work in long-sequence processing, sparsity optimization, and providing efficient computational backends has been a crucial source of inspiration, performance benchmarking, and validation foundation for our design and implementation.

We are grateful to the entire open-source community for their invaluable contributions.

🔗Citation

If you use this codebase or otherwise find our work valuable, please cite:

@misc{flex_block_attn2025,
      title={flex-block-attn: an efficient block sparse attention computation library},
      author={Yuanbo Peng*, Penghao Zhao*, Jiangfeng Xiong, Fang Yang, Songtao Liu, Jianbing Wu,
              Zhao Zhong, Key, Linus, Peng Chen, Jie Jiang},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/Tencent-Hunyuan/flex-block-attn}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
assets		assets
benchmark		benchmark
flex_block_attn		flex_block_attn
test		test
tk @ aaab847		tk @ aaab847
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
attn.cpp		attn.cpp
config.py		config.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flex Block Attn

Introduction

Project Updates

🛠️ Quick start

Requirements

Installation

🔑 Usage

Custom kernel

SSTA kernel

❗️Notes

🚀 Performance

Sparse dynamic attention

H800 Dynamic Attention Speedup

H20 Dynamic Attention Speedup

Full attention

H800 Full Attention Speedup

H20 Full Attention Speedup

🙏 Acknowledgments

🔗Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

Tencent-Hunyuan/flex-block-attn

Folders and files

Latest commit

History

Repository files navigation

Flex Block Attn

Introduction

Project Updates

🛠️ Quick start

Requirements

Installation

🔑 Usage

Custom kernel

SSTA kernel

❗️Notes

🚀 Performance

Sparse dynamic attention

H800 Dynamic Attention Speedup

H20 Dynamic Attention Speedup

Full attention

H800 Full Attention Speedup

H20 Full Attention Speedup

🙏 Acknowledgments

🔗Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages