Butterfly Network Attention (bna) is a training-free sparse-attention runtime for long-context inference. It is aimed at engineers who want measurable speed or memory wins without retraining the model.
- A PyTorch package,
bna, for sparse-attention research and integration work - CUDA and MLX benchmark scripts for Qwen, GLM, GPT-2, and related paths
- Measured benchmark artifacts under
benchmarks/,results/, andnotes/ - Older docs and scripts that still use the legacy names
WayfinderandHCSA
Public naming note: Butterfly / BNA is the current public project name. Wayfinder / HCSA are legacy names still present in deeper docs, scripts, benchmark artifact paths, and archived research material.
| Tier | What to trust | Evidence |
|---|---|---|
| Validated | GLM-4.7-Flash-4bit on MLX at the public stable profile | docs/FIRST_RELEASE.md |
| Experimental | Qwen 3.5 CUDA block-sparse path and long-context scaling work | scripts/bench_qwen35_cuda_wayfinder.py, benchmarks/cuda/qwen35_wayfinder/ |
| Experimental | Qwen 3.5 MLX / Apple Silicon path | scripts/bench_qwen_consumer_mlx.py, results/benchmarks/ |
| Research / archive | Older Wayfinder/HCSA docs, prompts, and exploratory runs | docs/, notes/, archive/ |
If you are new to the project, start from the validated GLM path first. The Qwen work is promising, but it should still be read as active engineering rather than a locked public release.
Dense causal attention does O(T^2) work per layer. Butterfly replaces that with a bounded sparse pattern over fixed-size token blocks.
At a high level, each block attends to:
- its local neighborhood
- a small number of deterministic long-range partners
- optional global or anchor-style connections, depending on the backend
The exact sparse pattern differs across code paths. Older Wayfinder/HCSA integrations describe this as window + cycle + landmarks; the current Butterfly README uses the simpler butterfly-partner framing. In both cases the goal is the same: keep attention neighborhoods explicit, bounded, and cheap enough to help at long context.
For contributor-facing implementation details, see docs/ARCHITECTURE.md.
The clearest in-repo release evidence today is the GLM-4.7-Flash-4bit stable profile documented in docs/FIRST_RELEASE.md.
At seq_len=8192 and decode_len=32 on the validated MLX path:
| Mode | E2E | Prefill | Decode tok/s | Peak memory |
|---|---|---|---|---|
| Dense | 17.15s | 16.36s | 40.58 | 20.66 GB |
| Butterfly | 10.56s | 9.75s | 39.85 | 20.07 GB |
| Delta vs dense | -38.44% | -40.38% | -1.79% | -2.85% |
That is the safest benchmark slice to cite publicly from this tree today.
The repo also contains experimental CUDA benchmark results for a Triton block-sparse path on Qwen 3.5 9B, where 8 of 32 layers are replaced and the remaining DeltaNet layers stay untouched.
| Context | Dense tok/s | Butterfly tok/s | Top-1 agreement |
|---|---|---|---|
| 4,096 | — | — | 99.88% |
| 8,192 | 1,651 | 1,698 | — |
| 16,384 | — | — | 94.44% |
| 32,768 | 1,585 | 1,688 | — |
| 65,536 | 1,475 | 1,724 | — |
| 98,304 | 1,413 | 1,660 | — |
| 131,072 | 1,365 | 1,667 | — |
| 262,144 | 1,257 | 1,712 | — |
These numbers suggest flatter throughput than dense attention at long context, but this path should still be treated as experimental until the quality and support boundaries are documented as tightly as the GLM release path.
| Context | Dense tok/s | Butterfly tok/s |
|---|---|---|
| 8,192 | 931 | 954 |
| 32,768 | 1,280 | 1,301 |
| 65,536 | 1,241 | 1,326 |
| 131,072 | 1,131 | 1,331 |
| 163,840 | — | 1,306 |
| 196,608 | — | 1,364 |
| 229,376 | — | 1,233 |
MLX permute-window path with K6 fused Metal kernel, window=64. 8 of 32 attention layers are replaced. Model: mlx-community/Qwen3.5-9B-MLX-4bit.
| Context | Dense TTFT | Butterfly TTFT | Dense tok/s | Butterfly tok/s | Peak memory |
|---|---|---|---|---|---|
| 2,048 | 71 ms | 49 ms | 62.2 | 62.0 | 7.1 GB |
| 8,192 | 116 ms | 86 ms | 57.2 | 58.8 | 9.9 GB |
| 32,768 | 100 ms | 99 ms | 49.6 | 47.1 | 13.7 GB |
| 65,536 | 160 ms | 202 ms | 41.5 | 39.8 | 18.9 GB |
| 98,304 | 2.0 s | 1.2 s | 17.2 | 22.4 | 24.0 GB |
| 131,072 | 6.9 s | 7.5 s | 7.3 | 6.8 | 29.1 GB |
| 163,840 | 26.8 s | 21.5 s | 2.2 | 2.7 | 34.2 GB |
This MLX path uses chunked-gather plus native SDPA for prefill and a fused Metal kernel for decode. It shows wins at short context and again near the memory wall, but it is still an experimental path rather than a validated public release.
Top-1 agreement in the Qwen 9B experiments is 99.88% at 4K and 94.44% at 16K. Perplexity and downstream evaluation are still in progress, so avoid treating these tables as universal quality-parity claims.
git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[dev,kernels]"Validated public path:
./scripts/run_public_stable_profile_glm.shExperimental Qwen CUDA benchmark:
python scripts/bench_qwen35_cuda_wayfinder.py \
--model-path <path-to-Qwen3.5-9B> \
--path block_sparse \
--engine triton \
--block-size 128 \
--seq-lens 4096 8192 16384 32768git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[mlx]"
pip install mlx-lm zmlxEnvironment check:
python scripts/env_check_mlx.pyExperimental Qwen MLX benchmark:
python scripts/bench_qwen_consumer_mlx.py \
--model-path mlx-community/Qwen3.5-9B-MLX-4bit \
--mode wayfinder \
--seq-lens 2048 8192 32768 \
--decode-len 256 \
--repeats 3 \
--out-dir results/benchmarks/my_runThe --mode dense flag runs the stock attention baseline for comparison. Add --skip-quality to benchmark only throughput.
pytest
ruff check bna tests| Path | What it is |
|---|---|
bna/ |
Core package and backend integrations |
scripts/ |
Benchmarks, diagnostics, serving helpers, and figure generation |
docs/ |
Contributor-facing architecture, release evidence, and research notes |
benchmarks/, results/ |
Raw benchmark outputs and summaries |
notes/ |
Lab notebook, experiment log, handoff prompts, and planning material |
archive/ |
Older exploratory code and preserved artifacts |
- docs/FIRST_RELEASE.md: validated benchmark slice and reproduction commands
- docs/ARCHITECTURE.md: contributor-facing implementation map
- CONTRIBUTING.md: expectations for docs, claims, and performance changes
MIT
