Skip to content

Tencent-Hunyuan/flex-block-attn

Repository files navigation

Flex Block Attn

Introduction

Flex-Block-Attn is an efficient block sparse attention communication library specifically designed for Hunyuan Video. It supports various sparse attention strategies including STA, MOBA, and SSTA (selective and sliding tile attention, a hybrid of STA and MOBA) for both training and inference. Built upon ThunderKitten's attention demo implementation, this library delivers arbitrary sparse attention computation capabilities optimized for Hopper architecture GPUs. It features PyTorch-like mask expressions that ensure high usability while enabling efficient sparse mask generation.

flex block attn

Project Updates

[2025-11-19] We have released the Flex-Block-Attn implementation along with comprehensive benchmark results. We welcome the community to test and provide feedback!

🛠️ Quick start

Requirements

  • Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support
  • Python 3.8 and above
  • CUDA version 12.8 CUDA Toolkit

Installation

git submodule update --init --recursive
python setup.py install

🔑 Usage

Custom kernel

from flex_block_attn import flex_block_attn_func
from benchmark.utils.utils import create_sparse_mask
# take a (block_size * 2)*(block_size * 2) as an example
# block size can be 64,128,192...
selected_blocks = [[0,1],[1,0]]
# create block mask with selected blocks
sparse_mask = create_sparse_mask(q, block_size, selected_blocks)

'''
sparse mask: torch.tensor([[0,1],[1,0]])
for example if block_size=64, ths shape of torch mask is [128,128]

our sparse mask:
[[0,1],
 [1,0]]

original torch mask:
[[0,0,0...,0],[1,1,1...,1],
 [0,0,0...,0],[1,1,1...,1],
 ...         , ...        ,
 [0,0,0...,0],[1,1,1...,1],
 [1,1,1...,1],[0,0,0...,0],
 [1,1,1...,1],[0,0,0...,0],
 ...         , ...        ,
 [1,1,1...,1],[0,0,0...,0],]
'''
#compute 
output = flex_block_attn_func(query, key, value, q_block_size, k_block_size, block_mask) 

SSTA kernel

SSTA is a novel attention mechanism that integrates the sparse attention of both Moba and STA. It has been utilized in both the training and inference processes of Hunyuan Video. We will be open-sourcing all related code in the near future – stay tuned!

❗️Notes

  • The head dim must be 128.
  • The q tile_size can be any multiple of 16, k/v tile_size can be any multiple of 64, with 384 recommended (as we have performed additional optimizations for this size).
  • The sequence length of q and kv must be divisible by their respective tile sizes.
  • The attention_mask only supports block-level masking. block_mask supports two shapes: [seq_len, seq_len] or [batch, head_num, seq_len, seq_len].
  • Within selected blocks, full attention computation is performed.

🚀 Performance

We provide performance comparisons in the benchmark folder, including measurements for mask creation time, forward/backward execution time, and GPU memory usage across the following attention types: full attention, sparse static attention, and sparse dynamic attention. Meanwhile, we have provided all the results(full attn, static sparse attn, dynamic sparse attn) obtained from testing on the H800 GPU and H20 GPU.

Sparse dynamic attention

In sparse dynamic attention tasks, attention mask is generated randomly with a a specified sparsity ratio.We display FlexBlockAttn speedup using these parameters:

  • Sequence length 11520, 19200, 30720, 38400, 46080, 53760, 61440, 69120
  • Block_size 384
  • Sparse rate 0.6

The performance(combined mask creation, forward and backward) of Flex Block Attention is better than mainstream sparse attention libraries.

H800 Dynamic Attention Speedup

FlexBlockAttn speedup on H800

H20 Dynamic Attention Speedup

FlexBlockAttn speedup on H20

Full attention

In full attention tasks, Flex Block Attention continues to deliver robust performance.

H800 Full Attention Speedup

FlexBlockAttn full attn speedup on H20

H20 Full Attention Speedup

FlexBlockAttn full attn speedup on H20

🙏 Acknowledgments

This project stands on the shoulders of the following amazing projects and resources. We extend our sincere gratitude to:

  • ThunderKittens : Our project extends its computational engine, building additional logic layers while leveraging its core calculation capabilities. The underlying computational power is entirely provided by its excellent infrastructure.
  • STA(Sliding Tile Attention), MoBA: In our video model training, we have drawn inspiration from the innovative contributions of these projects in sparse attention computation.
  • flex attention, flash-attention, MagiAttention, SpargeAttn, Triton: These projects have been pivotal in advancing efficient and flexible attention mechanisms and high-performance GPU programming. Their collective work in long-sequence processing, sparsity optimization, and providing efficient computational backends has been a crucial source of inspiration, performance benchmarking, and validation foundation for our design and implementation.

We are grateful to the entire open-source community for their invaluable contributions.

🔗Citation

If you use this codebase or otherwise find our work valuable, please cite:

@misc{flex_block_attn2025,
      title={flex-block-attn: an efficient block sparse attention computation library},
      author={Yuanbo Peng*, Penghao Zhao*, Jiangfeng Xiong, Fang Yang, Songtao Liu, Jianbing Wu,
              Zhao Zhong, Key, Linus, Peng Chen, Jie Jiang},
      year={2025},
      publisher = {GitHub},
      howpublished = {\url{https://github.com/Tencent-Hunyuan/flex-block-attn}},
}

About

flex-block-attn: an efficient block sparse attention computation library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •