feat: Metal fused attention kernels (ForgeAttention)#1
Conversation
Fills the empty kernels/metal/ directory with fused 3-bit KV dequantization inside the attention dot product for Apple Silicon. The packed 3-bit values are read, rotated, and dot-producted entirely in threadgroup shared memory — no FP16 intermediate in device memory. New files: - kernels/metal/fused_attention.py: 6 Metal kernels via mx.fast.metal_kernel - kernels/metal/calibration.py: per-head budget calibration - tests/test_metal_fused.py: 2 tests (fused QK correctness, sparse SV) Results: 82% memory reduction, 0.99x baseline speed, NIAH to 300K.
|
Hi @user-23xyz / Sabowsla — first off, I owe you a real apology. This PR has been sitting here for a month, and that's entirely on me. I don't get GitHub email notifications, and this is honestly the first PR anyone has ever opened on one of my repos, so I just didn't see it. I'm sorry for the silence — that's not the welcome a contribution of this quality deserves. I looked through the code today. The Metal kernels are excellent — real One honest constraint on my side: I only have an RTX 3090 (Windows). I have no Apple Silicon hardware, so I cannot independently verify the 82% KV reduction or 0.99× decode speed numbers, and I won't be able to debug Metal-specific issues for users. With that in mind, here's what I'd like to do:
I'll also link your Again — really sorry for the wait. This is genuinely great work and I'm glad to have it in the project. |
Follow-up to PR #1 (Metal fused attention kernels). The merged PR landed kernel code but the metal subpackage docstring still claimed "not used directly from Python" — corrected. Adds an informational UserWarning on `import multi_turboquant.kernels.metal` so users know they are on the community-maintained path. Credits contributor and links the sibling forgeattention project in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Merged in 57b2010. Follow-up commit 1b0361c lands:
Existing test suite confirmed clean — 81 passed, 2 skipped (your two Metal tests cleanly skip on non-Apple, as designed). If you ever want to add the GH Actions workflow with |
Summary
Fills the empty
kernels/metal/directory with fused 3-bit KV dequantization inside the attention dot product for Apple Silicon.Instead of decompress → SDPA (the standard Metal path), these kernels read packed 3-bit data directly inside the QK dot product and SV accumulation. The FP16 intermediate never exists.
What's added
kernels/metal/fused_attention.py— 6 Metal kernels viamx.fast.metal_kernelkernels/metal/calibration.py— per-head budget calibrationtests/test_metal_fused.py— 2 testsResults (M4 Mini 16GB)
Code: github.com/user-23xyz/forgeattention