refactor(aliases): explicit quantization suffix on every alias (BREAKING)#547
Merged
Conversation
…uffix
BREAKING — every legacy short alias has been renamed to its canonical
explicit form. ``rapid-mlx serve qwen3.5-4b`` no longer works; use
``rapid-mlx serve qwen3.5-4b-4bit``. ``rapid-mlx models`` lists the
72 new names.
Naming template (now documented in README "Naming convention"):
<family>-<version>-<params>-<modality?>-<technique?>-<quant>
The quantization suffix is mandatory — mirrors LM Studio's
``…-MLX-4bit`` / ``…-MLX-8bit`` HuggingFace convention so the bit
width is readable off the alias instead of hidden in ``hf_path``.
Aliases renamed: 51 (e.g. ``qwen3.5-4b`` → ``qwen3.5-4b-4bit``,
``gemma-4-12b-qat`` → ``gemma-4-12b-qat-4bit``).
Aliases already explicit: 23 (e.g. ``qwen3.5-4b-8bit``,
``deepseek-v4-flash-2bit``).
Aliases added: 1 (``phi-4-mini-4bit``, separating the 4B mini from
the real Phi-4 14B — see phi4-14b fix below).
Codename aliases dropped: 3
* ``deepseek-v4-flash`` — duplicate hf_path of
``deepseek-v4-flash-8bit``; references now resolve to that name.
* ``gemma4`` — duplicate hf_path of ``gemma-4-12b-qat-4bit``;
references resolve to that name. (The ``gemma4`` *parser*
identifier is untouched — that's the tool/reasoning parser
family, not an alias.)
* ``nemotron-nano`` — duplicate hf_path of ``nemotron-30b-4bit``;
references resolve to that name.
Schema bug fixed: ``phi4-14b`` was pointing at
``mlx-community/phi-4-mini-instruct-4bit`` (a ~4B model). It is
now ``phi-4-14b-4bit`` pointing at ``mlx-community/phi-4-4bit``
(the real 14B). The old mini target moves to ``phi-4-mini-4bit``.
Non-bit-width quant suffixes formalised: ``-mxfp4``, ``-mxfp4-q8``,
``-dwq``, ``-ud``, ``-3bit``, ``-6bit``, ``-unpacked`` (Bonsai's
no-quantization tier). Picked from HF community / mlx-community /
LM Studio conventions.
Scope of the sweep (107 files):
* vllm_mlx/aliases.json — the 51 renames + 3 drops + 1 new.
* README.md — new "Naming convention" section + family lineup
table rewritten with the explicit names.
* 100 active source files — tests, docs, scripts, install.sh,
Issue templates, harness/scorecard/latest.md.
* tests/fixtures/generation_configs/*.json — file basenames
renamed alongside.
* harness/baselines/full-qwen3.5-35b.json → -8bit; .6-35b → -4bit.
* vllm_mlx/cli.py: Alias column widened 22 → 24 to fit the
longest new name (``deepseek-v4-flash-8bit``).
Deliberately NOT swept (historical snapshots whose ``model`` field
records the alias under which the benchmark was originally run —
rewriting them is rewriting history):
* evals/results/*.json (83 files)
* harness/runs/** (timestamped doctor harness runs)
* reports/mhi/*.json (timestamped MHI reports)
* reports/benchmarks/** (README-refresh bench snapshots)
Tools introduced for the rename (gitted so the operation is
reproducible and future renames can reuse the machinery):
* scripts/rename_aliases.py — generates the new aliases.json
+ scripts/rename_map.json from a small declarative rule set.
* scripts/sweep_alias_refs.py — applies the rename map to every
active source file, with the EXCLUDED prefixes above.
Tests: 4793 passed, 11 skipped, 7 xfailed (no regressions).
Lint: ``ruff check`` clean. Format: ``ruff format`` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #547 validation scorecardTitle: refactor(aliases): explicit quantization suffix on every alias (BREAKING) Verdict: MERGE-SAFE
|
Codex BLOCKING (round 1, pr_validate):
- ``tests/test_aliases_contract.py:455-457`` had duplicate
``"nemotron-30b-4bit"`` keys after the sweep collapsed both
``nemotron-30b`` and ``nemotron-nano`` to the same canonical name.
Python silently overwrote the first key with the second, so the
test no longer pinned both pre-rename cases. Collapsed to a single
entry (the post-rename registry now contains only one nemotron
alias).
- ``tests/test_model_profiles_ssot.py:90-91`` had the same
duplicate-in-tuple-iteration pattern. Same fix.
- ``test_reverse_lookup_for_shared_hf_path_is_deterministic`` and
``test_reverse_lookup_handles_deepseek_v4_flash_duplicate`` were
pinning the tie-break for two now-removed duplicate-hf_path pairs
(``nemotron-30b`` / ``nemotron-nano`` and ``deepseek-v4-flash`` /
``deepseek-v4-flash-8bit``). After the rename, no pair of aliases
shares an hf_path, so the tie-break is unreachable from the live
registry. Removed both tests with a comment pointing at the
remaining reverse-lookup mechanism test
(``test_reverse_lookup_index_built_once_after_first_load``).
Lint (ruff check + ruff format):
- Auto-fixable F401 / F541 / I001 across the 4 PR-touched scripts
(``bench_engine_parity.py``, ``bench_readme_refresh.py``,
``local_bench_vs_ollama.py``, ``mhi_eval.py``). These were
pre-existing issues the sweep re-surfaced.
- Manual fixes inside ``scripts/mhi_eval.py``:
* ``from tau_bench.types import EnvRunResult`` is an availability
probe — annotated ``# noqa: F401``.
* E741 single-letter ``l`` rebound to ``ch``.
- Ruff format applied to the 7 touched .py files.
Full-unit (e2e):
- ``test_weather_with_fallback`` + ``test_multi_step_tool_chain``
failed once on the initial run, both passed on local rerun
against the same live qwen3.5-4b-4bit server. These are
model-behaviour tests (which tool name the model picks for a
given prompt) and are flakey by design — not caused by this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex BLOCKING (round 2, pr_validate): - ``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:702`` — the sweep rewrote the canonical spoofing example ``evil-org/gpt-oss-20b`` into ``evil-org/gpt-oss-20b-mxfp4-q8``. The spoof shape (third-party org publishing under the same bare repo name OpenAI uses) is the exact case this matcher must reject; the alias-suffixed variant tests a strictly easier case. Restored the canonical form. (Adding the suffixed variant separately would be redundant — the matcher already covers the broader shape.) - Same file line 716 — the sweep also rewrote ``openai/gpt-oss-20b`` (OpenAI's real bare repo id on HuggingFace) into the rapid-mlx alias ``openai/gpt-oss-20b-mxfp4-q8``. The bare repo id is what the matcher actually sees from upstream tokenizers, so dropping the unsuffixed form would have left a gap. Restored the bare repo id and the bare-suffix variants of the local-path examples (``/models/gpt-oss-20b``, etc.) alongside the alias-suffixed cases the sweep wrote. Codex NIT (round 2, pr_validate): - ``harness/README.md:104`` previously said the ``full`` tier's two baselines are "both 8-bit". After the rename, ``qwen3.6-35b`` is the 4-bit variant. Reworded to call out each model's quant. - ``scripts/rename_aliases.py`` — the ``dropped`` counter always printed 0 because dropped codename aliases store their redirect target as a non-None string in ``rename_map``. Reworked the three counters (renamed / dropped / kept) to compute from the input data's perspective so they always sum back to the input alias count and ``dropped`` is the real number of MANUAL ``drop=True`` specs the script processed. Verified: against ``main``'s aliases.json the script prints ``48 renamed, 3 dropped, 23 kept`` (= 74). - ``scripts/sweep_alias_refs.py`` — the comment promised a "hand-written pass below" for ``gemma4`` that did not exist (the sweep deliberately leaves every ``gemma4`` occurrence alone because the literal is also the parser ID). Reworked the comment to make the no-op intent and reason explicit so a future maintainer doesn't hunt for a missing implementation. Tests: 4789 passed, 11 skipped, 7 xfailed. ``test_simple_exec`` / ``test_multi_step_tool_chain`` flaked again (model-behaviour pick of tool name varies run-to-run); rerunning against the same live server passes both — same as round 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-3 NIT: the checked-in ``rename_map.json`` was the result of an idempotent rerun against the already-renamed ``aliases.json``, so every entry was an identity mapping (e.g. ``qwen3.5-4b-4bit`` → ``qwen3.5-4b-4bit``). A maintainer running ``scripts/sweep_alias_refs.py`` from a pre-rename checkout to verify the operation is reproducible would see the sweep do nothing because no legacy name (``qwen3.5-4b``, ``gemma4``, ``nemotron-nano``, …) was in the map. Regenerated from ``main``'s ``vllm_mlx/aliases.json`` so the map now contains the real 74-entry legacy → canonical mapping plus the three dropped codename redirects (``deepseek-v4-flash`` → ``deepseek-v4-flash-8bit``, ``gemma4`` → ``gemma-4-12b-qat-4bit``, ``nemotron-nano`` → ``nemotron-30b-4bit``). The current ``aliases.json`` is untouched — only the auxiliary map file used by the sweep tool changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… multi-tool
PydanticAI defaults max_tokens to ~1024. On verbose 4B-class models
(qwen3.5-4b-4bit) the multi-turn and sequential-tool-call test paths
spill past the cap and PydanticAI raises
``Model token limit (provider default) exceeded`` before any response
is generated.
That ceiling is a client-side default, not a rapid-mlx server contract,
so the SDK integration test should bypass it: pass
``model_settings={"max_tokens": 2048}`` on tests 5 and 6.
release-check-m3 G7 PydanticAI now 6/6 PASS on qwen3.5-4b-4bit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Major release — every alias in ``vllm_mlx/aliases.json`` now carries
an explicit quantization suffix (``-4bit`` / ``-8bit`` / ``-mxfp4`` /
``-dwq`` / etc.), no implicit-quant short forms remain. The three
legacy codename aliases (``deepseek-v4-flash``, ``gemma4``,
``nemotron-nano``) were dropped; the ``phi4-14b`` schema bug (name
claimed 14B but hf_path pointed at phi-4-mini ~4B) was fixed by
renaming to ``phi-4-14b-4bit`` AND swapping hf_path to the real Phi-4
14B; ``phi-4-mini-4bit`` was added to preserve the small-model entry.
README now documents the 7-segment naming template
``<family>-<version>-<params>-<modality?>-<technique?>-<quant>`` and
the canonical quant-suffix table.
Total: 74 → 72 aliases. Old short names are not deprecated — they're
just gone, per user direction ("没有多少用户").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Every alias in
aliases.jsonnow carries an explicit quantization suffix —qwen3.5-4b-4bitinstead ofqwen3.5-4b,gemma-4-12b-qat-8bitinstead ofgemma-4-12b-qat. The bit width is readable off the alias rather than hidden inhf_path.BREAKING — legacy short names no longer resolve. Migration: append the canonical quant suffix (
rapid-mlx modelslists the new names).Naming template
Documented in a new README "Naming convention" section:
Mirrors LM Studio's
…-MLX-4bit/…-MLX-8bitHuggingFace convention. Quant suffix is mandatory; technique (-qat,-distill) stacks before quant.Quant vocabulary formalised:
-4bit,-8bit,-2bit,-3bit,-6bit,-mxfp4,-mxfp4-q8,-dwq,-ud,-unpacked.Schema bug fixed
phi4-14bwas pointing atmlx-community/phi-4-mini-instruct-4bit(a ~4B model). Now:phi-4-14b-4bit→mlx-community/phi-4-4bit(real 14B)phi-4-mini-4bit→mlx-community/phi-4-mini-instruct-4bit(new alias for the mini)Codename aliases dropped (duplicate hf_path of an explicit entry)
deepseek-v4-flashdeepseek-v4-flash-8bitgemma4gemma-4-12b-qat-4bitnemotron-nanonemotron-30b-4bitThe
gemma4parser identifier (registered ingemma4_tool_parser.py) is untouched — that's a parser family, not an alias.Sweep scope
modelfield records the alias the bench was originally run under:evals/results/*.json(83 files)harness/runs/**reports/mhi/*.jsonreports/benchmarks/**Tooling (gitted so the operation is reproducible)
scripts/rename_aliases.py— regeneratesaliases.json+rename_map.jsonfrom a small rule set.scripts/sweep_alias_refs.py— applies the rename map to every active file with the exclusion list above.Test plan
pytest tests/→ 4793 passed, 11 skipped, 7 xfailed (no regressions)ruff checkcleanruff formatcleanrapid-mlx modelslists 72 aliases, all explicitrapid-mlx info gemma-4-12b-qat-4bitresolves tomlx-community/gemma-4-12B-it-qat-4bitrapid-mlx info phi-4-14b-4bitresolves tomlx-community/phi-4-4bit(real 14B)pr_validateSOP run (round 1 + 2)make checkSKIPPED — alias rename is surface-touching, not inference-path🤖 Generated with Claude Code