Skip to content

Support for Poolside LagunaXS open source coding model in nvfp4#1334

Open
tzachicohen wants to merge 1 commit into
ml-explore:mainfrom
tzachicohen:poolside/laguna-xs-support
Open

Support for Poolside LagunaXS open source coding model in nvfp4#1334
tzachicohen wants to merge 1 commit into
ml-explore:mainfrom
tzachicohen:poolside/laguna-xs-support

Conversation

@tzachicohen
Copy link
Copy Markdown

@tzachicohen tzachicohen commented May 31, 2026

The patch introduces support for LagunaXS model with Poolside's native nvfp4 quantization checkpoint.(https://huggingface.co/poolside/Laguna-XS.2-NVFP4)
The model definition is made from the standard primitives.
Poolside's checkpoint defines two scaling factors, one global, per tensor, in fp32 and one local, per 16 elements tile.
Since mlx supports only a single scale factor, the two are coalesced into one bf16 scale per tile.
This patch was also tested in integration with "vllm-metal", to enable seamless serving of LagunaXS with vLLM on Apple silicon Macos.

@tzachicohen tzachicohen changed the title Support for Poolside LagunaXS checkpoint in nvfp4 Support for Poolside LagunaXS open source coding model in nvfp4 Jun 1, 2026
@nastya236 nastya236 self-assigned this Jun 3, 2026
@nastya236 nastya236 self-requested a review June 3, 2026 14:22
@nastya236 nastya236 removed their assignment Jun 3, 2026
Copy link
Copy Markdown
Collaborator

@nastya236 nastya236 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.

If I understood everything correctly, you requantize the weights into nvfp4 format, which collapses the per‑tensor fp32 global scale because Metal qqmm doesn't accept it yet. Block‑scaled quantization in MLX is actively evolving and we are planning to add global‑scale support. For now I'd strongly suggest the following: 1) requantize once offline using _dequantize_compressed_tensors + mx.quantize(..., mode="nvfp4") and store the converted weights, 2) remove the format‑conversion code from sanitize leaving only the MoE expert stacking.

Happy to iterate on this with you.

Comment thread tests/test_models.py
self.assertTrue(mx.allclose(y, y_gt, rtol=1e-4, atol=1e-4))
self.assertTrue(mx.allclose(st, st_gt, rtol=1e-4, atol=1e-3))

def test_laguna(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason I'm missing for the dedicated test_laguna method? I think that your testing can be fully covered by adding laguna to the pool (test_all_models). If you're planning to add laguna-specific assertions, then keeping the dedicated method makes sense, otherwise I'd move it into test_all_models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants