Skip to content

Add Mellum (Mellum 2) model support#1339

Merged
nastya236 merged 1 commit into
ml-explore:mainfrom
jedisct1:add-mellum
Jun 6, 2026
Merged

Add Mellum (Mellum 2) model support#1339
nastya236 merged 1 commit into
ml-explore:mainfrom
jedisct1:add-mellum

Conversation

@jedisct1

@jedisct1 jedisct1 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Add support for the Mellum (Mellum 2) architecture

This adds a mellum.py model implementation so mlx-lm can load and run JetBrains'
Mellum 2 models, e.g. JetBrains/Mellum2-12B-A2.5B-Thinking
and JetBrains/Mellum2-12B-A2.5B-Instruct
(model_type: mellum, MellumForCausalLM).

Architecture

Mellum 2 is a Mixture-of-Experts model in the Qwen3 lineage with a couple of distinctive features:

  • MoE every layer — 64 experts, 8 active per token, with a linear router and softmax/top-k
    renormalized gating (same block shape as qwen3_moe).
  • Per-head QK-norm RMSNorm on the query/key heads (Qwen3 style).
  • Hybrid attention — each layer is either sliding-window or full attention, driven by the
    config's layer_types list (sliding_window = 1024). Implemented with the same
    global/sliding mask split and KVCache / RotatingKVCache pairing as gemma3_text.
  • Per-layer RoPE — full-attention layers use YaRN scaling, sliding-window layers use plain
    RoPE. Both rope configs come from the config's rope_parameters mapping keyed by attention type.

The implementation composes the well-tested qwen3_moe MoE/attention block with the gemma3_text
hybrid-mask scheme, plus per-layer rope selection via initialize_rope. The YaRN attention_factor
in the published config (1.2772588722239782) matches the YarnRoPE default mscale exactly.

Testing

Converted JetBrains/Mellum2-12B-A2.5B-Thinking to MLX (bf16) and confirmed weights load with no
missing or unexpected keys and that generation is coherent on reasoning and coding prompts.

@nastya236 nastya236 added the enhancement New feature or request label Jun 6, 2026
@nastya236 nastya236 self-requested a review June 6, 2026 09:54
@nastya236

Copy link
Copy Markdown
Collaborator

Thanks for adding a new model.

I tried the thinking variant and the output loops on <|im_end|> / </tool_call> after the assistant turn ends:
mlx_lm.chat --model JetBrains/Mellum2-12B-A2.5B-Thinking --max-tokens 4096

>> Are you an instruct model?
<think>
bla bla bla (remove for shortness)
</think>

Yes, I am an instruct model. I'm designed to follow instructions, answer questions, and assist with a wide range of tasks based on user input. My purpose is to provide helpful, accurate, and context-aware responses. Let me know how I can assist you!<|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call><|im_end|>
</tool_call>
</think>

I think probable reason: Mellum2's generation_config.json has "eos_token_id": 0 (<|endoftext|>), so mlx-lm
never stops on the chat-end token.
I think it is not a blocking factor to merge the model, but if you plan to use it I'd recommend updating generation_config.json upstream so eos_token_id is a list that includes <|im_end|> (id 28) — e.g. "eos_token_id": [0, 28] (like Qwen3.6 adds additional <|im_end|> token to generation_config.json to stop generating when hit <|im_end|>).

@nastya236

Copy link
Copy Markdown
Collaborator

Would you mind rebasing this branch?

Mellum 2 is a Qwen3-lineage Mixture-of-Experts model with MoE on every
layer, per-head QK-norm, a hybrid of sliding-window and full-attention
layers driven by layer_types, and per-layer RoPE where full-attention
layers use YaRN scaling and sliding-window layers use plain RoPE.

The rope settings come from the config's rope_parameters mapping keye
by attention type.
@jedisct1

jedisct1 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Rebased!

@jedisct1

jedisct1 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

And the MLX models at https://huggingface.co/jedisct1/models already have the tool calling fix.

@nastya236 nastya236 merged commit e476a22 into ml-explore:main Jun 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants