Fix nemotron_h MoEGate breaking load with per-path quantization#1282
Open
YBJ0000 wants to merge 1 commit into
Open
Fix nemotron_h MoEGate breaking load with per-path quantization#1282YBJ0000 wants to merge 1 commit into
YBJ0000 wants to merge 1 commit into
Conversation
MoEGate stores its router weights as a raw mx.zeros tensor and does not subclass nn.Linear, so it has no to_quantized() method. When a model's quantization config lists per-layer entries for the gate paths, the class_predicate in load_model() returns a truthy params dict for those paths and nn.quantize() then raises "Unable to quantize model of type MoEGate". Add a no-op to_quantized() so the gate can be referenced by per-path quantization configs while staying unquantized; its weights are tiny and path-dependent in the routing computation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Loading a pre-quantized
nemotron_hMoE model fails withUnable to quantize model of type <class 'mlx_lm.models.nemotron_h.MoEGate'>when the model's quantization config includes per-layer entries for themixer.gatepaths.Closes #1266.
Root cause
MoEGatestores its router weights as a rawmx.zerostensor and does not subclassnn.Linear, so it has noto_quantized()method.In
load_model()(mlx_lm/utils.py),class_predicatereturns a truthy params dict whenever a path is present inconfig["quantization"]:For a gate path the first branch returns before the
hasattrcheck, sonn.quantize()proceeds to callm.to_quantized(**params)and raises becauseMoEGatehas no such method.This stays dormant for standard mlx-lm quantization (gates are left out of the config dict) but surfaces for any pre-quantized config that explicitly lists
mixer.gatepaths.Fix
Add a no-op
to_quantized()toMoEGate. The gate can now be referenced by per-path quantization configs while staying unquantized — its router weights are tiny and path-dependent in the routing computation, so leaving them in full precision is the expected behavior.Test
Added
test_nemotron_h_moe_gate_quantizationintests/test_models.py: it builds a tinynemotron_hmodel with an MoE block and runsnn.quantize()with aclass_predicatethat returns a params dict for the gate path (mirroringload_model()'s behavior). The test fails before the fix with the exactValueErrorfrom the issue and passes after.Verified locally with
python -m unittestfor the new test (againstmlx0.29.3, sincenn.quantize'sto_quantizedcontract is unchanged across versions); CI will exercise it on a supported Python.