mtplx doctor --json
Hi, I’m testing MTPLX 0.3.5 on a Mac Studio M3 Ultra (512 GB RAM) installed via Homebrew.
Environment
- MTPLX 0.3.5
- Apple M3 Ultra
- 512 GiB unified memory
- Python 3.13.13 arm64
- MLX 0.31.2
- mlx_lm 0.31.3
Doctor
I ran mtplx doctor --json and the environment looks healthy overall:
- native arm64 Python: pass
- MLX import/device: pass
- estimated runtime memory well below available memory
- no low power mode
- no thermal warnings
Only warnings are unrelated (default Qwen cache missing, port 8000 already open, no ThermalForge).
Model
Local model:
/Users/macstudio/.lmstudio/models/mlx-community/GLM-4.7-PRISM-8bit-gs64-mlx
Initial behavior
The model originally showed:
No MTP head · Glm4MoeForCausalLM
I found that:
- the converted config had
"num_nextn_predict_layers": 0
- the original BF16 model had
"num_nextn_predict_layers": 1
After adding mtp.safetensors and changing:
"num_nextn_predict_layers": 1
mtplx inspect changed to:
- recognized: true
- can_run: true
- runtime_compatibility: native-family-gated
- mtp_layers: 1
- mtp_tensors_present: 502
So inspect now considers the model runnable.
Runtime result
The model loads successfully in sustained profile and enters:
Generation mode: MTP
Native-MTP speed path: draft-only LM head is active
But on the first prompt, generation crashes with:
AttributeError: 'LanguageModel' object has no attribute 'fa_idx'
Stack trace points into:
generate_mtpk
forward_ar_capture
forward_with_gdn_capture
gdn_capture.py
where it accesses:
cache[inner.fa_idx]
Additional observation
If I run with /mtp off, the model does generate, but it is extremely slow and GPU usage is unstable compared with direct MLX inference.
Question
Is GLM-4.7 PRISM / Glm4MoeForCausalLM expected to work in MTPLX 0.3.5 with MTP, or is this currently a known issue in the GLM runtime path?
doctor_output.json
Exact command
mtplx start --model /Users/macstudio/.lmstudio/models/mlx-community/GLM-4.7-PRISM-8bit-gs64-mlx
Model path or repo id
/Users/macstudio/.lmstudio/models/mlx-community/GLM-4.7-PRISM-8bit-gs64-mlx
Chip, RAM, macOS version
Apple M3 Ultra, 512 GB RAM, macOS 26.4.1
mtplx doctor --json
Hi, I’m testing MTPLX 0.3.5 on a Mac Studio M3 Ultra (512 GB RAM) installed via Homebrew.
Environment
Doctor
I ran
mtplx doctor --jsonand the environment looks healthy overall:Only warnings are unrelated (default Qwen cache missing, port 8000 already open, no ThermalForge).
Model
Local model:
/Users/macstudio/.lmstudio/models/mlx-community/GLM-4.7-PRISM-8bit-gs64-mlxInitial behavior
The model originally showed:
No MTP head · Glm4MoeForCausalLMI found that:
"num_nextn_predict_layers": 0"num_nextn_predict_layers": 1After adding
mtp.safetensorsand changing:"num_nextn_predict_layers": 1mtplx inspectchanged to:So inspect now considers the model runnable.
Runtime result
The model loads successfully in
sustainedprofile and enters:Generation mode: MTPNative-MTP speed path: draft-only LM head is activeBut on the first prompt, generation crashes with:
AttributeError: 'LanguageModel' object has no attribute 'fa_idx'Stack trace points into:
generate_mtpkforward_ar_captureforward_with_gdn_capturegdn_capture.pywhere it accesses:
cache[inner.fa_idx]Additional observation
If I run with
/mtp off, the model does generate, but it is extremely slow and GPU usage is unstable compared with direct MLX inference.Question
Is GLM-4.7 PRISM / Glm4MoeForCausalLM expected to work in MTPLX 0.3.5 with MTP, or is this currently a known issue in the GLM runtime path?
doctor_output.json
Exact command
mtplx start --model /Users/macstudio/.lmstudio/models/mlx-community/GLM-4.7-PRISM-8bit-gs64-mlx
Model path or repo id
/Users/macstudio/.lmstudio/models/mlx-community/GLM-4.7-PRISM-8bit-gs64-mlx
Chip, RAM, macOS version
Apple M3 Ultra, 512 GB RAM, macOS 26.4.1