feat: Metal fused attention kernels (ForgeAttention) by user-23xyz · Pull Request #1 · aivrar/multi-turboquant

user-23xyz · 2026-04-11T00:03:45Z

Summary

Fills the empty kernels/metal/ directory with fused 3-bit KV dequantization inside the attention dot product for Apple Silicon.

Instead of decompress → SDPA (the standard Metal path), these kernels read packed 3-bit data directly inside the QK dot product and SV accumulation. The FP16 intermediate never exists.

What's added

kernels/metal/fused_attention.py — 6 Metal kernels via mx.fast.metal_kernel
- Fused QK (scores from packed K, FP16 math)
- Tiled SV (256-token tiles)
- Flash decode (single-pass QK+softmax+SV)
- Sparse SV (skips zero-prob tokens)
- Phase 1/2 fused sparse (tile-level early exit)
kernels/metal/calibration.py — per-head budget calibration
tests/test_metal_fused.py — 2 tests

Results (M4 Mini 16GB)

82% per-layer KV memory reduction
0.99x baseline decode speed
NIAH to 300K context
Compatible with PlanarQuant/IsoQuant rotation (swappable constants)

Code: github.com/user-23xyz/forgeattention

Fills the empty kernels/metal/ directory with fused 3-bit KV dequantization inside the attention dot product for Apple Silicon. The packed 3-bit values are read, rotated, and dot-producted entirely in threadgroup shared memory — no FP16 intermediate in device memory. New files: - kernels/metal/fused_attention.py: 6 Metal kernels via mx.fast.metal_kernel - kernels/metal/calibration.py: per-head budget calibration - tests/test_metal_fused.py: 2 tests (fused QK correctness, sparse SV) Results: 82% memory reduction, 0.99x baseline speed, NIAH to 300K.

aivrar · 2026-05-11T20:35:50Z

Hi @user-23xyz / Sabowsla — first off, I owe you a real apology. This PR has been sitting here for a month, and that's entirely on me. I don't get GitHub email notifications, and this is honestly the first PR anyone has ever opened on one of my repos, so I just didn't see it. I'm sorry for the silence — that's not the welcome a contribution of this quality deserves.

I looked through the code today. The Metal kernels are excellent — real mlx.fast.metal_kernel MSL with threadgroup shared memory, half-precision QK with float accumulation, the fused Givens unrotate-in-place trick, sparse SV with early exit. The calibration approach (per-head entropy → per-head budget) is a nice angle too. The tests cleanly skip on non-Apple platforms so they won't break my CI. The fact that you kept it compatible with the existing PlanarQuant/IsoQuant rotations and codebooks instead of forking the API is exactly the right call.

One honest constraint on my side: I only have an RTX 3090 (Windows). I have no Apple Silicon hardware, so I cannot independently verify the 82% KV reduction or 0.99× decode speed numbers, and I won't be able to debug Metal-specific issues for users. With that in mind, here's what I'd like to do:

Merge this PR into main.
Add a UserWarning in multi_turboquant/kernels/metal/__init__.py when imported on Apple Silicon, noting it's a community-contributed path and that bug reports should ping you. I'd word it as "experimental / community-maintained," not as a warning against using it.
Ask a favor (no pressure, totally fine to say no): would you be willing to add a small GitHub Actions workflow using macos-14 runners (free for public repos) that just runs pytest tests/test_metal_fused.py? That would let me merge future Metal changes with actual signal from CI rather than just my eyeballs.

I'll also link your user-23xyz/forgeattention repo from the README and credit you in the next release notes.

Again — really sorry for the wait. This is genuinely great work and I'm glad to have it in the project.

Follow-up to PR #1 (Metal fused attention kernels). The merged PR landed kernel code but the metal subpackage docstring still claimed "not used directly from Python" — corrected. Adds an informational UserWarning on `import multi_turboquant.kernels.metal` so users know they are on the community-maintained path. Credits contributor and links the sibling forgeattention project in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

aivrar · 2026-05-11T20:41:38Z

Merged in 57b2010. Follow-up commit 1b0361c lands:

Updated multi_turboquant/kernels/metal/__init__.py docstring (the old one still said "not used directly from Python" — fixed) and added an informational UserWarning on import noting the path is community-maintained and tagging you for Metal/MLX issues.
Added a Community Contributors section to the README crediting you and linking user-23xyz/forgeattention as the sibling project.
Updated the Platform Support row for macOS to mention the fused MLX kernels.

Existing test suite confirmed clean — 81 passed, 2 skipped (your two Metal tests cleanly skip on non-Apple, as designed).

If you ever want to add the GH Actions workflow with macos-14 runners I mentioned, just open another PR — I'll be watching this time. Thanks again for the contribution.

aivrar merged commit 57b2010 into aivrar:main May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Metal fused attention kernels (ForgeAttention)#1

feat: Metal fused attention kernels (ForgeAttention)#1
aivrar merged 1 commit into
aivrar:mainfrom
user-23xyz:feat/forgeattention-fused-metal

user-23xyz commented Apr 11, 2026

Uh oh!

aivrar commented May 11, 2026

Uh oh!

aivrar commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

user-23xyz commented Apr 11, 2026

Summary

What's added

Results (M4 Mini 16GB)

Uh oh!

aivrar commented May 11, 2026

Uh oh!

aivrar commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants