Skip to content

fix(load): VLM model loading fixes for oQ-quantized checkpoints#1247

Merged
jundot merged 1 commit into
jundot:mainfrom
a4501150:pr/vlm-load-fixes
May 14, 2026
Merged

fix(load): VLM model loading fixes for oQ-quantized checkpoints#1247
jundot merged 1 commit into
jundot:mainfrom
a4501150:pr/vlm-load-fixes

Conversation

@a4501150
Copy link
Copy Markdown
Contributor

Summary

  • Expand per-layer quantization config keys for VLM model-tree paths so quantization config matches the MLX model parameter hierarchy (e.g. language_model.model.layers.N vs model.layers.N)
  • Centralise pre-load patches in oQ _measure_sensitivity so MTP/nested-visual patches are active during sensitivity measurement
  • Remap nested visual keys (language_model.model.visual.*vision_tower.*) for MLX-format VLM models where mlx-vlm skips Model.sanitize
  • Fix nested-visual patch idempotency: use function-attribute marker instead of module-level flag so the wrap can re-apply if another patch (e.g. MTP runtime) overwrites Model.sanitize
  • Add inline nested-visual post-fixup in all three MTP sanitize functions

Background on nested visual bug

Qwen3.6-35B-A3B nests ViT weights at model.language_model.visual.*. mlx-vlm's sanitize uses if/elif that matches model.language_model first, rewriting to language_model.model.visual.* — the model.visual → vision_tower branch never fires. For non-MLX-format models, the existing qwen3_6_nested_visual sanitize wrap catches this. But mlx-vlm skips sanitize entirely for MLX-format checkpoints (format=mlx in safetensors metadata), so oQ output models fail with "333 parameters not in model". The new _remap_nested_visual_on_load context manager intercepts nn.Module.load_weights during the scoped vlm_load() call to remap keys before they reach the model.

Test plan

  • pytest tests/test_oq.py -v
  • Server loads Qwen3.6-35B-A3B-uncensored-heretic (non-MLX format) — nested-visual sanitize wrap fires
  • Server loads Qwen3.6-35B-A3B-uncensored-heretic-oQ6-mtp (MLX format) — _remap_nested_visual_on_load remaps 333 keys

🤖 Generated with Claude Code

- Expand per-layer quant keys for VLM model-tree paths so quantization
  config matches the MLX model parameter hierarchy
- Centralise pre-load patches in oQ _measure_sensitivity
- Remap nested visual keys (language_model.model.visual.* -> vision_tower.*)
  for MLX-format VLM models where mlx-vlm skips Model.sanitize
- Fix nested-visual patch idempotency: use function-attribute marker
  instead of module-level flag
- Add inline nested-visual post-fixup in MTP sanitize functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jundot
Copy link
Copy Markdown
Owner

jundot commented May 14, 2026

Thanks for this, the load-path fixes look solid and the nested-visual / quant-key remapping all check out.

Merging this with two follow-ups from me, no action needed on your side:

  1. Dropping the new TestBuiltinCalibration class. It expects mixed/chat/bartowski categories and 3000+ samples, but the oq_calibration_data.json in the repo has 7 categories / ~600 samples — looks like the updated JSON wasn't part of the diff. The reasoning addition in _load_builtin_calibration is fine since that key already exists in the shipped JSON.
  2. _measure_sensitivity now goes through maybe_apply_pre_load_patches, which leaves mtp_active=False and doesn't apply the mlx-vlm runtime patch. For an MLX-format VLM checkpoint with MTP heads, mlx-vlm skips sanitize entirely, so the language_model.mtp.* weights stay in the dict but no MTP head is attached — load_weights then rejects them and sensitivity silently returns empty. The text path is fine (the patched qwen35_model.sanitize self-consistently strips mtp.* when no head). I'll restore the head attachment for the VLM path in a follow-up.

@jundot jundot merged commit ccfba1d into jundot:main May 14, 2026
jundot added a commit that referenced this pull request May 14, 2026
…on tests

PR #1247 routed _measure_sensitivity through maybe_apply_pre_load_patches,
which leaves mtp_active False. For MLX-format VLM checkpoints mlx-vlm skips
sanitize, so the language_model.mtp.* weights stay in the dict but no head
gets attached. load_weights then rejects them and sensitivity silently
returns {}. Re-apply the mlx-vlm runtime MTP patch and set mtp_active True
for the VLM load when the source declares MTP heads. The text path is
unchanged since the patched qwen35_model.sanitize already strips mtp.* when
no head is attached.

Also drop TestBuiltinCalibration. It expects calibration categories and
sample counts that the shipped oq_calibration_data.json does not have (the
updated JSON was not part of #1247). And update the #1204 discovery-failure
test for the new sensitivity-before-discovery ordering.
@a4501150
Copy link
Copy Markdown
Contributor Author

thanks @jundot

For the first one - that should belong to this pr #1246 - cherry picked the wrong files. If we decide to merge that maybe a follow up will be moving TestBuiltinCalibration in PR-1246

blightbow added a commit to blightbow/omlx that referenced this pull request May 15, 2026
Pulls in 8 upstream commits, most relevantly:

- 386e16f fix(tests): repair pre-existing upstream test failures
  and import guards (jundot#1244) — restores list-shaped GitHub
  releases payload in test_admin_update_check / test_admin_auth
  fixtures.  Was committed upstream 2026-05-14 10:21, after this
  branch's previous merge of main and before the next.  Branch
  was unknowingly running with these 4 tests failing the entire
  time.
- 4fe004d feat: add Hermes Agent quick launch (jundot#1250)
- ccfba1d fix(load): VLM model loading fixes for oQ-quantized
  checkpoints (jundot#1247)
- 51907f0 fix(oq): restore MTP head attach for VLM sensitivity
- and others

Without jundot#1244 we keep inheriting the broken admin-auth /
update-check tests as branch-only baseline failures.  The fix
landed 8 hours before today's MRU work and was never picked up
because the branch hadn't merged main since.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
panwudi pushed a commit to panwudi/flyto-mlx that referenced this pull request May 16, 2026
…ot#1247)

- Expand per-layer quantization config keys for VLM model-tree paths so `quantization` config matches the MLX model parameter hierarchy (e.g. `language_model.model.layers.N` vs `model.layers.N`)
- Centralise pre-load patches in oQ `_measure_sensitivity` so MTP/nested-visual patches are active during sensitivity measurement
- Remap nested visual keys (`language_model.model.visual.*` → `vision_tower.*`) for MLX-format VLM models where mlx-vlm skips `Model.sanitize`
- Fix nested-visual patch idempotency: use function-attribute marker instead of module-level flag so the wrap can re-apply if another patch (e.g. MTP runtime) overwrites `Model.sanitize`
- Add inline nested-visual post-fixup in all three MTP sanitize functions

Qwen3.6-35B-A3B nests ViT weights at `model.language_model.visual.*`. mlx-vlm's sanitize uses if/elif that matches `model.language_model` first, rewriting to `language_model.model.visual.*` — the `model.visual → vision_tower` branch never fires. For non-MLX-format models, the existing `qwen3_6_nested_visual` sanitize wrap catches this. But mlx-vlm skips sanitize entirely for MLX-format checkpoints (`format=mlx` in safetensors metadata), so oQ output models fail with "333 parameters not in model". The new `_remap_nested_visual_on_load` context manager intercepts `nn.Module.load_weights` during the scoped `vlm_load()` call to remap keys before they reach the model.

- [x] `pytest tests/test_oq.py -v`
- [x] Server loads `Qwen3.6-35B-A3B-uncensored-heretic` (non-MLX format) — nested-visual sanitize wrap fires
- [x] Server loads `Qwen3.6-35B-A3B-uncensored-heretic-oQ6-mtp` (MLX format) — `_remap_nested_visual_on_load` remaps 333 keys

🤖 Generated with [Claude Code](https://claude.com/claude-code)
panwudi pushed a commit to panwudi/flyto-mlx that referenced this pull request May 18, 2026
…on tests

PR jundot#1247 routed _measure_sensitivity through maybe_apply_pre_load_patches,
which leaves mtp_active False. For MLX-format VLM checkpoints mlx-vlm skips
sanitize, so the language_model.mtp.* weights stay in the dict but no head
gets attached. load_weights then rejects them and sensitivity silently
returns {}. Re-apply the mlx-vlm runtime MTP patch and set mtp_active True
for the VLM load when the source declares MTP heads. The text path is
unchanged since the patched qwen35_model.sanitize already strips mtp.* when
no head is attached.

Also drop TestBuiltinCalibration. It expects calibration categories and
sample counts that the shipped oq_calibration_data.json does not have (the
updated JSON was not part of jundot#1247). And update the jundot#1204 discovery-failure
test for the new sensitivity-before-discovery ordering.

(cherry picked from commit 51907f0)
panwudi pushed a commit to panwudi/flyto-mlx that referenced this pull request May 18, 2026
test_is_mtp_eligible_requires_mtp_forward_and_solo_batch was written
against a pre-jundot#1247 contract where head presence implied MTP active.
Commit 23ca7dc decoupled head attachment from inference-time MTP for
the VLM load path and added an is_mtp_active() gate. Set the flag
around the True-expected assertion, restore in finally, and add a new
"head attached but flag off" case to lock in the post-23ca7dc semantics.

test_patch_wraps_target_processors stubbed the wrong module path and
class name for dots_ocr (mlx_vlm.models.dots_ocr.processing /
DotsOcrProcessor), but the patcher imports
mlx_vlm.models.dots_ocr.processing_dots_ocr and looks up
DotsVLProcessor (per commit a1987ed, where the test was added). The
fake_import branch never matched, the dots branch silently fell
through to the except clause, and the wrap-marker assertion failed.
Align the test's stubs with the actual (module_path, cls_name) tuples
in vlm.py.

Both are test-only fixes. Refs jundot#1259.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 6057304)
jundot pushed a commit that referenced this pull request May 19, 2026
test_is_mtp_eligible_requires_mtp_forward_and_solo_batch was written
against a pre-#1247 contract where head presence implied MTP active.
Commit 23ca7dc decoupled head attachment from inference-time MTP for
the VLM load path and added an is_mtp_active() gate. Set the flag
around the True-expected assertion, restore in finally, and add a new
"head attached but flag off" case to lock in the post-23ca7dc semantics.

test_patch_wraps_target_processors stubbed the wrong module path and
class name for dots_ocr (mlx_vlm.models.dots_ocr.processing /
DotsOcrProcessor), but the patcher imports
mlx_vlm.models.dots_ocr.processing_dots_ocr and looks up
DotsVLProcessor (per commit a1987ed, where the test was added). The
fake_import branch never matched, the dots branch silently fell
through to the except clause, and the wrap-marker assertion failed.
Align the test's stubs with the actual (module_path, cls_name) tuples
in vlm.py.

Both are test-only fixes. Refs #1259.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ziya32 pushed a commit to ziya32/omlx that referenced this pull request May 19, 2026
…lm_sync

Re-applies pieces from main commits ccfba1d (jundot#1247) + a6781db (jundot#209) lost
when vlm.py took --ours in the v0.3.9rc1 merge.

(1) _remap_nested_visual_on_load context manager: mlx-vlm's load_model
skips Model.sanitize when safetensors metadata declares format=mlx. oQ
output is MLX-format, so the nested-visual key fixup that sanitize
normally applies never fires. The wrapper intercepts Module.load_weights
and remaps 'language_model.model.visual.*' -> 'vision_tower.*'. Without
this, Qwen3.6-35B-A3B oQ variants fail to load with '333 parameters not
in model'.

(2) maybe_load_custom_quantization dispatch: ParoQuant checkpoints
require a non-standard loader. Returns (model, processor) on match;
falls through to standard vlm_load() otherwise. Without this, ParoQuant
VLM checkpoints fail at load with mlx-vlm's standard pipeline.
ziya32 pushed a commit to ziya32/omlx that referenced this pull request May 19, 2026
…rphaned (jundot#1247)

Re-applies the per-layer quant expansion missing from feature's vlm.py
(--ours took feature, lost main's ccfba1d). Flattens nested per-layer
quant configs (e.g. language_model.model.layers.N vs model.layers.N)
into a uniform schema so oQ quantization reserves bits per actual model
layer rather than per top-level config key.

Without the expansion, oQ-quantized VLM checkpoints with nested model
hierarchies may load with wrong/missing per-layer quant attributes.
No-op for configs without per-layer quant data.
@a4501150 a4501150 deleted the pr/vlm-load-fixes branch May 20, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants