Add Mellum (Mellum 2) model support#1339
Conversation
|
Thanks for adding a new model. I tried the thinking variant and the output loops on <|im_end|> / </tool_call> after the assistant turn ends: I think probable reason: Mellum2's generation_config.json has "eos_token_id": 0 (<|endoftext|>), so mlx-lm |
|
Would you mind rebasing this branch? |
Mellum 2 is a Qwen3-lineage Mixture-of-Experts model with MoE on every layer, per-head QK-norm, a hybrid of sliding-window and full-attention layers driven by layer_types, and per-layer RoPE where full-attention layers use YaRN scaling and sliding-window layers use plain RoPE. The rope settings come from the config's rope_parameters mapping keye by attention type.
|
Rebased! |
|
And the MLX models at https://huggingface.co/jedisct1/models already have the tool calling fix. |
Add support for the Mellum (Mellum 2) architecture
This adds a
mellum.pymodel implementation somlx-lmcan load and run JetBrains'Mellum 2 models, e.g.
JetBrains/Mellum2-12B-A2.5B-Thinkingand
JetBrains/Mellum2-12B-A2.5B-Instruct(
model_type: mellum,MellumForCausalLM).Architecture
Mellum 2 is a Mixture-of-Experts model in the Qwen3 lineage with a couple of distinctive features:
renormalized gating (same block shape as
qwen3_moe).config's
layer_typeslist (sliding_window= 1024). Implemented with the sameglobal/sliding mask split and
KVCache/RotatingKVCachepairing asgemma3_text.RoPE. Both rope configs come from the config's
rope_parametersmapping keyed by attention type.The implementation composes the well-tested
qwen3_moeMoE/attention block with thegemma3_texthybrid-mask scheme, plus per-layer rope selection via
initialize_rope. The YaRNattention_factorin the published config (
1.2772588722239782) matches theYarnRoPEdefault mscale exactly.Testing
Converted
JetBrains/Mellum2-12B-A2.5B-Thinkingto MLX (bf16) and confirmed weights load with nomissing or unexpected keys and that generation is coherent on reasoning and coding prompts.