Fix nemotron_h MoEGate breaking load with per-path quantization by YBJ0000 · Pull Request #1282 · ml-explore/mlx-lm

YBJ0000 · 2026-05-18T13:20:26Z

Summary

Loading a pre-quantized nemotron_h MoE model fails with Unable to quantize model of type <class 'mlx_lm.models.nemotron_h.MoEGate'> when the model's quantization config includes per-layer entries for the mixer.gate paths.

Closes #1266.

Root cause

MoEGate stores its router weights as a raw mx.zeros tensor and does not subclass nn.Linear, so it has no to_quantized() method.

In load_model() (mlx_lm/utils.py), class_predicate returns a truthy params dict whenever a path is present in config["quantization"]:

def class_predicate(p, m):
    if p in config["quantization"]:
        return config["quantization"][p]   # truthy for gate paths
    if not hasattr(m, "to_quantized"):
        return False
    return f"{p}.scales" in weights

For a gate path the first branch returns before the hasattr check, so nn.quantize() proceeds to call m.to_quantized(**params) and raises because MoEGate has no such method.

This stays dormant for standard mlx-lm quantization (gates are left out of the config dict) but surfaces for any pre-quantized config that explicitly lists mixer.gate paths.

Fix

Add a no-op to_quantized() to MoEGate. The gate can now be referenced by per-path quantization configs while staying unquantized — its router weights are tiny and path-dependent in the routing computation, so leaving them in full precision is the expected behavior.

Test

Added test_nemotron_h_moe_gate_quantization in tests/test_models.py: it builds a tiny nemotron_h model with an MoE block and runs nn.quantize() with a class_predicate that returns a params dict for the gate path (mirroring load_model()'s behavior). The test fails before the fix with the exact ValueError from the issue and passes after.

Verified locally with python -m unittest for the new test (against mlx 0.29.3, since nn.quantize's to_quantized contract is unchanged across versions); CI will exercise it on a supported Python.

MoEGate stores its router weights as a raw mx.zeros tensor and does not subclass nn.Linear, so it has no to_quantized() method. When a model's quantization config lists per-layer entries for the gate paths, the class_predicate in load_model() returns a truthy params dict for those paths and nn.quantize() then raises "Unable to quantize model of type MoEGate". Add a no-op to_quantized() so the gate can be referenced by per-path quantization configs while staying unquantized; its weights are tiny and path-dependent in the routing computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nemotron_h MoEGate breaking load with per-path quantization#1282

Fix nemotron_h MoEGate breaking load with per-path quantization#1282
YBJ0000 wants to merge 1 commit into
ml-explore:mainfrom
YBJ0000:fix/nemotron-h-moe-gate-quantization

YBJ0000 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YBJ0000 commented May 18, 2026

Summary

Root cause

Fix

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant