Skip to content

Fix nemotron_h MoEGate breaking load with per-path quantization#1282

Open
YBJ0000 wants to merge 1 commit into
ml-explore:mainfrom
YBJ0000:fix/nemotron-h-moe-gate-quantization
Open

Fix nemotron_h MoEGate breaking load with per-path quantization#1282
YBJ0000 wants to merge 1 commit into
ml-explore:mainfrom
YBJ0000:fix/nemotron-h-moe-gate-quantization

Conversation

@YBJ0000
Copy link
Copy Markdown

@YBJ0000 YBJ0000 commented May 18, 2026

Summary

Loading a pre-quantized nemotron_h MoE model fails with Unable to quantize model of type <class 'mlx_lm.models.nemotron_h.MoEGate'> when the model's quantization config includes per-layer entries for the mixer.gate paths.

Closes #1266.

Root cause

MoEGate stores its router weights as a raw mx.zeros tensor and does not subclass nn.Linear, so it has no to_quantized() method.

In load_model() (mlx_lm/utils.py), class_predicate returns a truthy params dict whenever a path is present in config["quantization"]:

def class_predicate(p, m):
    if p in config["quantization"]:
        return config["quantization"][p]   # truthy for gate paths
    if not hasattr(m, "to_quantized"):
        return False
    return f"{p}.scales" in weights

For a gate path the first branch returns before the hasattr check, so nn.quantize() proceeds to call m.to_quantized(**params) and raises because MoEGate has no such method.

This stays dormant for standard mlx-lm quantization (gates are left out of the config dict) but surfaces for any pre-quantized config that explicitly lists mixer.gate paths.

Fix

Add a no-op to_quantized() to MoEGate. The gate can now be referenced by per-path quantization configs while staying unquantized — its router weights are tiny and path-dependent in the routing computation, so leaving them in full precision is the expected behavior.

Test

Added test_nemotron_h_moe_gate_quantization in tests/test_models.py: it builds a tiny nemotron_h model with an MoE block and runs nn.quantize() with a class_predicate that returns a params dict for the gate path (mirroring load_model()'s behavior). The test fails before the fix with the exact ValueError from the issue and passes after.

Verified locally with python -m unittest for the new test (against mlx 0.29.3, since nn.quantize's to_quantized contract is unchanged across versions); CI will exercise it on a supported Python.

MoEGate stores its router weights as a raw mx.zeros tensor and does not
subclass nn.Linear, so it has no to_quantized() method. When a model's
quantization config lists per-layer entries for the gate paths, the
class_predicate in load_model() returns a truthy params dict for those
paths and nn.quantize() then raises "Unable to quantize model of type
MoEGate".

Add a no-op to_quantized() so the gate can be referenced by per-path
quantization configs while staying unquantized; its weights are tiny
and path-dependent in the routing computation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MoEGate.to_quantized() missing in nemotron_h.py causes load failure when quantization config includes gate paths

1 participant