Support for Poolside LagunaXS open source coding model in nvfp4#1334
Support for Poolside LagunaXS open source coding model in nvfp4#1334tzachicohen wants to merge 1 commit into
Conversation
nastya236
left a comment
There was a problem hiding this comment.
Thank you for your contribution.
If I understood everything correctly, you requantize the weights into nvfp4 format, which collapses the per‑tensor fp32 global scale because Metal qqmm doesn't accept it yet. Block‑scaled quantization in MLX is actively evolving and we are planning to add global‑scale support. For now I'd strongly suggest the following: 1) requantize once offline using _dequantize_compressed_tensors + mx.quantize(..., mode="nvfp4") and store the converted weights, 2) remove the format‑conversion code from sanitize leaving only the MoE expert stacking.
Happy to iterate on this with you.
| self.assertTrue(mx.allclose(y, y_gt, rtol=1e-4, atol=1e-4)) | ||
| self.assertTrue(mx.allclose(st, st_gt, rtol=1e-4, atol=1e-3)) | ||
|
|
||
| def test_laguna(self): |
There was a problem hiding this comment.
Is there a reason I'm missing for the dedicated test_laguna method? I think that your testing can be fully covered by adding laguna to the pool (test_all_models). If you're planning to add laguna-specific assertions, then keeping the dedicated method makes sense, otherwise I'd move it into test_all_models.
The patch introduces support for LagunaXS model with Poolside's native nvfp4 quantization checkpoint.(https://huggingface.co/poolside/Laguna-XS.2-NVFP4)
The model definition is made from the standard primitives.
Poolside's checkpoint defines two scaling factors, one global, per tensor, in fp32 and one local, per 16 elements tile.
Since mlx supports only a single scale factor, the two are coalesced into one bf16 scale per tile.
This patch was also tested in integration with "vllm-metal", to enable seamless serving of LagunaXS with vLLM on Apple silicon Macos.