Skip to content

feat: Metal fused attention kernels (ForgeAttention)#1

Merged
aivrar merged 1 commit into
aivrar:mainfrom
user-23xyz:feat/forgeattention-fused-metal
May 11, 2026
Merged

feat: Metal fused attention kernels (ForgeAttention)#1
aivrar merged 1 commit into
aivrar:mainfrom
user-23xyz:feat/forgeattention-fused-metal

Conversation

@user-23xyz
Copy link
Copy Markdown
Contributor

Summary

Fills the empty kernels/metal/ directory with fused 3-bit KV dequantization inside the attention dot product for Apple Silicon.

Instead of decompress → SDPA (the standard Metal path), these kernels read packed 3-bit data directly inside the QK dot product and SV accumulation. The FP16 intermediate never exists.

What's added

  • kernels/metal/fused_attention.py — 6 Metal kernels via mx.fast.metal_kernel
    • Fused QK (scores from packed K, FP16 math)
    • Tiled SV (256-token tiles)
    • Flash decode (single-pass QK+softmax+SV)
    • Sparse SV (skips zero-prob tokens)
    • Phase 1/2 fused sparse (tile-level early exit)
  • kernels/metal/calibration.py — per-head budget calibration
  • tests/test_metal_fused.py — 2 tests

Results (M4 Mini 16GB)

  • 82% per-layer KV memory reduction
  • 0.99x baseline decode speed
  • NIAH to 300K context
  • Compatible with PlanarQuant/IsoQuant rotation (swappable constants)

Code: github.com/user-23xyz/forgeattention

Fills the empty kernels/metal/ directory with fused 3-bit KV
dequantization inside the attention dot product for Apple Silicon.

The packed 3-bit values are read, rotated, and dot-producted entirely
in threadgroup shared memory — no FP16 intermediate in device memory.

New files:
- kernels/metal/fused_attention.py: 6 Metal kernels via mx.fast.metal_kernel
- kernels/metal/calibration.py: per-head budget calibration
- tests/test_metal_fused.py: 2 tests (fused QK correctness, sparse SV)

Results: 82% memory reduction, 0.99x baseline speed, NIAH to 300K.
@aivrar
Copy link
Copy Markdown
Owner

aivrar commented May 11, 2026

Hi @user-23xyz / Sabowsla — first off, I owe you a real apology. This PR has been sitting here for a month, and that's entirely on me. I don't get GitHub email notifications, and this is honestly the first PR anyone has ever opened on one of my repos, so I just didn't see it. I'm sorry for the silence — that's not the welcome a contribution of this quality deserves.

I looked through the code today. The Metal kernels are excellent — real mlx.fast.metal_kernel MSL with threadgroup shared memory, half-precision QK with float accumulation, the fused Givens unrotate-in-place trick, sparse SV with early exit. The calibration approach (per-head entropy → per-head budget) is a nice angle too. The tests cleanly skip on non-Apple platforms so they won't break my CI. The fact that you kept it compatible with the existing PlanarQuant/IsoQuant rotations and codebooks instead of forking the API is exactly the right call.

One honest constraint on my side: I only have an RTX 3090 (Windows). I have no Apple Silicon hardware, so I cannot independently verify the 82% KV reduction or 0.99× decode speed numbers, and I won't be able to debug Metal-specific issues for users. With that in mind, here's what I'd like to do:

  1. Merge this PR into main.
  2. Add a UserWarning in multi_turboquant/kernels/metal/__init__.py when imported on Apple Silicon, noting it's a community-contributed path and that bug reports should ping you. I'd word it as "experimental / community-maintained," not as a warning against using it.
  3. Ask a favor (no pressure, totally fine to say no): would you be willing to add a small GitHub Actions workflow using macos-14 runners (free for public repos) that just runs pytest tests/test_metal_fused.py? That would let me merge future Metal changes with actual signal from CI rather than just my eyeballs.

I'll also link your user-23xyz/forgeattention repo from the README and credit you in the next release notes.

Again — really sorry for the wait. This is genuinely great work and I'm glad to have it in the project.

@aivrar aivrar merged commit 57b2010 into aivrar:main May 11, 2026
aivrar added a commit that referenced this pull request May 11, 2026
Follow-up to PR #1 (Metal fused attention kernels). The merged PR
landed kernel code but the metal subpackage docstring still claimed
"not used directly from Python" — corrected. Adds an informational
UserWarning on `import multi_turboquant.kernels.metal` so users know
they are on the community-maintained path. Credits contributor and
links the sibling forgeattention project in the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@aivrar
Copy link
Copy Markdown
Owner

aivrar commented May 11, 2026

Merged in 57b2010. Follow-up commit 1b0361c lands:

  • Updated multi_turboquant/kernels/metal/__init__.py docstring (the old one still said "not used directly from Python" — fixed) and added an informational UserWarning on import noting the path is community-maintained and tagging you for Metal/MLX issues.
  • Added a Community Contributors section to the README crediting you and linking user-23xyz/forgeattention as the sibling project.
  • Updated the Platform Support row for macOS to mention the fused MLX kernels.

Existing test suite confirmed clean — 81 passed, 2 skipped (your two Metal tests cleanly skip on non-Apple, as designed).

If you ever want to add the GH Actions workflow with macos-14 runners I mentioned, just open another PR — I'll be watching this time. Thanks again for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants