Butterfly

Butterfly Network Attention (bna) is a training-free sparse-attention runtime for long-context inference. It is aimed at engineers who want measurable speed or memory wins without retraining the model.

What this repo contains

A PyTorch package, bna, for sparse-attention research and integration work
CUDA and MLX benchmark scripts for Qwen, GLM, GPT-2, and related paths
Measured benchmark artifacts under benchmarks/, results/, and notes/
Older docs and scripts that still use the legacy names Wayfinder and HCSA

Public naming note: Butterfly / BNA is the current public project name. Wayfinder / HCSA are legacy names still present in deeper docs, scripts, benchmark artifact paths, and archived research material.

Status

Tier	What to trust	Evidence
Validated	GLM-4.7-Flash-4bit on MLX at the public stable profile	docs/FIRST_RELEASE.md
Experimental	Qwen 3.5 CUDA block-sparse path and long-context scaling work	`scripts/bench_qwen35_cuda_wayfinder.py`, `benchmarks/cuda/qwen35_wayfinder/`
Experimental	Qwen 3.5 MLX / Apple Silicon path	`scripts/bench_qwen_consumer_mlx.py`, `results/benchmarks/`
Research / archive	Older Wayfinder/HCSA docs, prompts, and exploratory runs	`docs/`, `notes/`, `archive/`

If you are new to the project, start from the validated GLM path first. The Qwen work is promising, but it should still be read as active engineering rather than a locked public release.

How it works

Dense causal attention does O(T^2) work per layer. Butterfly replaces that with a bounded sparse pattern over fixed-size token blocks.

At a high level, each block attends to:

its local neighborhood
a small number of deterministic long-range partners
optional global or anchor-style connections, depending on the backend

The exact sparse pattern differs across code paths. Older Wayfinder/HCSA integrations describe this as window + cycle + landmarks; the current Butterfly README uses the simpler butterfly-partner framing. In both cases the goal is the same: keep attention neighborhoods explicit, bounded, and cheap enough to help at long context.

For contributor-facing implementation details, see docs/ARCHITECTURE.md.

Measured evidence

Validated public path: GLM on MLX

The clearest in-repo release evidence today is the GLM-4.7-Flash-4bit stable profile documented in docs/FIRST_RELEASE.md.

At seq_len=8192 and decode_len=32 on the validated MLX path:

Mode	E2E	Prefill	Decode tok/s	Peak memory
Dense	17.15s	16.36s	40.58	20.66 GB
Butterfly	10.56s	9.75s	39.85	20.07 GB
Delta vs dense	-38.44%	-40.38%	-1.79%	-2.85%

That is the safest benchmark slice to cite publicly from this tree today.

Experimental CUDA path: Qwen 3.5 9B

The repo also contains experimental CUDA benchmark results for a Triton block-sparse path on Qwen 3.5 9B, where 8 of 32 layers are replaced and the remaining DeltaNet layers stay untouched.

Context	Dense tok/s	Butterfly tok/s	Top-1 agreement
4,096	—	—	99.88%
8,192	1,651	1,698	—
16,384	—	—	94.44%
32,768	1,585	1,688	—
65,536	1,475	1,724	—
98,304	1,413	1,660	—
131,072	1,365	1,667	—
262,144	1,257	1,712	—

These numbers suggest flatter throughput than dense attention at long context, but this path should still be treated as experimental until the quality and support boundaries are documented as tightly as the GLM release path.

Experimental CUDA path: Qwen 3.5 35B A3B FP8

Context	Dense tok/s	Butterfly tok/s
8,192	931	954
32,768	1,280	1,301
65,536	1,241	1,326
131,072	1,131	1,331
163,840	—	1,306
196,608	—	1,364
229,376	—	1,233

Experimental Apple Silicon path: Qwen 3.5 9B on M4 Max

MLX permute-window path with K6 fused Metal kernel, window=64. 8 of 32 attention layers are replaced. Model: mlx-community/Qwen3.5-9B-MLX-4bit.

Context	Dense TTFT	Butterfly TTFT	Dense tok/s	Butterfly tok/s	Peak memory
2,048	71 ms	49 ms	62.2	62.0	7.1 GB
8,192	116 ms	86 ms	57.2	58.8	9.9 GB
32,768	100 ms	99 ms	49.6	47.1	13.7 GB
65,536	160 ms	202 ms	41.5	39.8	18.9 GB
98,304	2.0 s	1.2 s	17.2	22.4	24.0 GB
131,072	6.9 s	7.5 s	7.3	6.8	29.1 GB
163,840	26.8 s	21.5 s	2.2	2.7	34.2 GB

This MLX path uses chunked-gather plus native SDPA for prefill and a fused Metal kernel for decode. It shows wins at short context and again near the memory wall, but it is still an experimental path rather than a validated public release.

Top-1 agreement in the Qwen 9B experiments is 99.88% at 4K and 94.44% at 16K. Perplexity and downstream evaluation are still in progress, so avoid treating these tables as universal quality-parity claims.

Quick start

CUDA (NVIDIA GPU)

git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[dev,kernels]"

Validated public path:

./scripts/run_public_stable_profile_glm.sh

Experimental Qwen CUDA benchmark:

python scripts/bench_qwen35_cuda_wayfinder.py \
    --model-path <path-to-Qwen3.5-9B> \
    --path block_sparse \
    --engine triton \
    --block-size 128 \
    --seq-lens 4096 8192 16384 32768

MLX (Apple Silicon)

git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[mlx]"
pip install mlx-lm zmlx

Environment check:

python scripts/env_check_mlx.py

Experimental Qwen MLX benchmark:

python scripts/bench_qwen_consumer_mlx.py \
    --model-path mlx-community/Qwen3.5-9B-MLX-4bit \
    --mode wayfinder \
    --seq-lens 2048 8192 32768 \
    --decode-len 256 \
    --repeats 3 \
    --out-dir results/benchmarks/my_run

The --mode dense flag runs the stock attention baseline for comparison. Add --skip-quality to benchmark only throughput.

Basic checks

pytest
ruff check bna tests

Repo map

Path	What it is
`bna/`	Core package and backend integrations
`scripts/`	Benchmarks, diagnostics, serving helpers, and figure generation
`docs/`	Contributor-facing architecture, release evidence, and research notes
`benchmarks/`, `results/`	Raw benchmark outputs and summaries
`notes/`	Lab notebook, experiment log, handoff prompts, and planning material
`archive/`	Older exploratory code and preserved artifacts

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
archive		archive
benchmarks		benchmarks
bna		bna
configs		configs
data		data
discover_sessions		discover_sessions
docs		docs
examples		examples
hcsa		hcsa
notes		notes
results		results
runs		runs
scripts		scripts
tests		tests
wayfinder_nemotron3super_cuda_overlay		wayfinder_nemotron3super_cuda_overlay
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
INTEGRATION_TESTING_PROMPT.txt		INTEGRATION_TESTING_PROMPT.txt
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Butterfly

What this repo contains

Status

How it works

Measured evidence

Validated public path: GLM on MLX

Experimental CUDA path: Qwen 3.5 9B

Experimental CUDA path: Qwen 3.5 35B A3B FP8

Experimental Apple Silicon path: Qwen 3.5 9B on M4 Max

Quick start

CUDA (NVIDIA GPU)

MLX (Apple Silicon)

Basic checks

Repo map

Where to read next

Related work

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

Butterfly

What this repo contains

Status

How it works

Measured evidence

Validated public path: GLM on MLX

Experimental CUDA path: Qwen 3.5 9B

Experimental CUDA path: Qwen 3.5 35B A3B FP8

Experimental Apple Silicon path: Qwen 3.5 9B on M4 Max

Quick start

CUDA (NVIDIA GPU)

MLX (Apple Silicon)

Basic checks

Repo map

Where to read next

Related work

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages