Skip to content

refactor(aliases): explicit quantization suffix on every alias (BREAKING)#547

Merged
raullenchai merged 6 commits into
mainfrom
feat/explicit-alias-naming
Jun 10, 2026
Merged

refactor(aliases): explicit quantization suffix on every alias (BREAKING)#547
raullenchai merged 6 commits into
mainfrom
feat/explicit-alias-naming

Conversation

@raullenchai

@raullenchai raullenchai commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Every alias in aliases.json now carries an explicit quantization suffix — qwen3.5-4b-4bit instead of qwen3.5-4b, gemma-4-12b-qat-8bit instead of gemma-4-12b-qat. The bit width is readable off the alias rather than hidden in hf_path.

BREAKING — legacy short names no longer resolve. Migration: append the canonical quant suffix (rapid-mlx models lists the new names).

Naming template

Documented in a new README "Naming convention" section:

<family>-<version>-<params>-<modality?>-<technique?>-<quant>

Mirrors LM Studio's …-MLX-4bit / …-MLX-8bit HuggingFace convention. Quant suffix is mandatory; technique (-qat, -distill) stacks before quant.

Quant vocabulary formalised: -4bit, -8bit, -2bit, -3bit, -6bit, -mxfp4, -mxfp4-q8, -dwq, -ud, -unpacked.

Schema bug fixed

phi4-14b was pointing at mlx-community/phi-4-mini-instruct-4bit (a ~4B model). Now:

  • phi-4-14b-4bitmlx-community/phi-4-4bit (real 14B)
  • phi-4-mini-4bitmlx-community/phi-4-mini-instruct-4bit (new alias for the mini)

Codename aliases dropped (duplicate hf_path of an explicit entry)

Dropped References resolve to
deepseek-v4-flash deepseek-v4-flash-8bit
gemma4 gemma-4-12b-qat-4bit
nemotron-nano nemotron-30b-4bit

The gemma4 parser identifier (registered in gemma4_tool_parser.py) is untouched — that's a parser family, not an alias.

Sweep scope

  • 107 files changed: aliases.json, README.md, 100 active source files (tests, docs, scripts, install.sh, Issue templates, harness/scorecard/latest.md), 5 generation_config fixtures renamed, 2 harness baselines renamed.
  • Skipped on purpose — historical snapshots whose model field records the alias the bench was originally run under:
    • evals/results/*.json (83 files)
    • harness/runs/**
    • reports/mhi/*.json
    • reports/benchmarks/**

Tooling (gitted so the operation is reproducible)

  • scripts/rename_aliases.py — regenerates aliases.json + rename_map.json from a small rule set.
  • scripts/sweep_alias_refs.py — applies the rename map to every active file with the exclusion list above.

Test plan

  • pytest tests/4793 passed, 11 skipped, 7 xfailed (no regressions)
  • ruff check clean
  • ruff format clean
  • rapid-mlx models lists 72 aliases, all explicit
  • rapid-mlx info gemma-4-12b-qat-4bit resolves to mlx-community/gemma-4-12B-it-qat-4bit
  • rapid-mlx info phi-4-14b-4bit resolves to mlx-community/phi-4-4bit (real 14B)
  • pr_validate SOP run (round 1 + 2)
  • make check SKIPPED — alias rename is surface-touching, not inference-path

🤖 Generated with Claude Code

…uffix

BREAKING — every legacy short alias has been renamed to its canonical
explicit form. ``rapid-mlx serve qwen3.5-4b`` no longer works; use
``rapid-mlx serve qwen3.5-4b-4bit``. ``rapid-mlx models`` lists the
72 new names.

Naming template (now documented in README "Naming convention"):

    <family>-<version>-<params>-<modality?>-<technique?>-<quant>

The quantization suffix is mandatory — mirrors LM Studio's
``…-MLX-4bit`` / ``…-MLX-8bit`` HuggingFace convention so the bit
width is readable off the alias instead of hidden in ``hf_path``.

Aliases renamed: 51 (e.g. ``qwen3.5-4b`` → ``qwen3.5-4b-4bit``,
``gemma-4-12b-qat`` → ``gemma-4-12b-qat-4bit``).
Aliases already explicit: 23 (e.g. ``qwen3.5-4b-8bit``,
``deepseek-v4-flash-2bit``).
Aliases added: 1 (``phi-4-mini-4bit``, separating the 4B mini from
the real Phi-4 14B — see phi4-14b fix below).
Codename aliases dropped: 3
  * ``deepseek-v4-flash`` — duplicate hf_path of
    ``deepseek-v4-flash-8bit``; references now resolve to that name.
  * ``gemma4`` — duplicate hf_path of ``gemma-4-12b-qat-4bit``;
    references resolve to that name. (The ``gemma4`` *parser*
    identifier is untouched — that's the tool/reasoning parser
    family, not an alias.)
  * ``nemotron-nano`` — duplicate hf_path of ``nemotron-30b-4bit``;
    references resolve to that name.

Schema bug fixed: ``phi4-14b`` was pointing at
``mlx-community/phi-4-mini-instruct-4bit`` (a ~4B model). It is
now ``phi-4-14b-4bit`` pointing at ``mlx-community/phi-4-4bit``
(the real 14B). The old mini target moves to ``phi-4-mini-4bit``.

Non-bit-width quant suffixes formalised: ``-mxfp4``, ``-mxfp4-q8``,
``-dwq``, ``-ud``, ``-3bit``, ``-6bit``, ``-unpacked`` (Bonsai's
no-quantization tier). Picked from HF community / mlx-community /
LM Studio conventions.

Scope of the sweep (107 files):
  * vllm_mlx/aliases.json — the 51 renames + 3 drops + 1 new.
  * README.md — new "Naming convention" section + family lineup
    table rewritten with the explicit names.
  * 100 active source files — tests, docs, scripts, install.sh,
    Issue templates, harness/scorecard/latest.md.
  * tests/fixtures/generation_configs/*.json — file basenames
    renamed alongside.
  * harness/baselines/full-qwen3.5-35b.json → -8bit; .6-35b → -4bit.
  * vllm_mlx/cli.py: Alias column widened 22 → 24 to fit the
    longest new name (``deepseek-v4-flash-8bit``).

Deliberately NOT swept (historical snapshots whose ``model`` field
records the alias under which the benchmark was originally run —
rewriting them is rewriting history):
  * evals/results/*.json   (83 files)
  * harness/runs/**         (timestamped doctor harness runs)
  * reports/mhi/*.json      (timestamped MHI reports)
  * reports/benchmarks/**   (README-refresh bench snapshots)

Tools introduced for the rename (gitted so the operation is
reproducible and future renames can reuse the machinery):
  * scripts/rename_aliases.py — generates the new aliases.json
    + scripts/rename_map.json from a small declarative rule set.
  * scripts/sweep_alias_refs.py — applies the rename map to every
    active source file, with the EXCLUDED prefixes above.

Tests: 4793 passed, 11 skipped, 7 xfailed (no regressions).
Lint: ``ruff check`` clean. Format: ``ruff format`` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

PR #547 validation scorecard

Title: refactor(aliases): explicit quantization suffix on every alias (BREAKING)
Author: raullenchai
Diff: 100 file(s), +1356/-746 LOC, blast radius: high

Verdict: MERGE-SAFE

step status summary time
fetch PASS 100 files, +1356/-746 LOC, blast=high 0.8s
test_plan_check PASS all 8 test-plan item(s) checked 0.0s
cl_description_quality PASS title OK + body has rationale (2769 chars) 0.0s
codex_review skip codex CLI not found on PATH (install: npm i -g @openai/codex) 0.0s
supply_chain PASS no hooks touched, no suspicious patterns, deps clean 8.0s
lint PASS clean (53 file(s)) 0.1s
targeted_tests skip too many test targets (42) — deferring to full_unit step 0.0s

raullenchai and others added 5 commits June 9, 2026 17:38
Codex BLOCKING (round 1, pr_validate):
- ``tests/test_aliases_contract.py:455-457`` had duplicate
  ``"nemotron-30b-4bit"`` keys after the sweep collapsed both
  ``nemotron-30b`` and ``nemotron-nano`` to the same canonical name.
  Python silently overwrote the first key with the second, so the
  test no longer pinned both pre-rename cases. Collapsed to a single
  entry (the post-rename registry now contains only one nemotron
  alias).
- ``tests/test_model_profiles_ssot.py:90-91`` had the same
  duplicate-in-tuple-iteration pattern. Same fix.
- ``test_reverse_lookup_for_shared_hf_path_is_deterministic`` and
  ``test_reverse_lookup_handles_deepseek_v4_flash_duplicate`` were
  pinning the tie-break for two now-removed duplicate-hf_path pairs
  (``nemotron-30b`` / ``nemotron-nano`` and ``deepseek-v4-flash`` /
  ``deepseek-v4-flash-8bit``). After the rename, no pair of aliases
  shares an hf_path, so the tie-break is unreachable from the live
  registry. Removed both tests with a comment pointing at the
  remaining reverse-lookup mechanism test
  (``test_reverse_lookup_index_built_once_after_first_load``).

Lint (ruff check + ruff format):
- Auto-fixable F401 / F541 / I001 across the 4 PR-touched scripts
  (``bench_engine_parity.py``, ``bench_readme_refresh.py``,
  ``local_bench_vs_ollama.py``, ``mhi_eval.py``). These were
  pre-existing issues the sweep re-surfaced.
- Manual fixes inside ``scripts/mhi_eval.py``:
  * ``from tau_bench.types import EnvRunResult`` is an availability
    probe — annotated ``# noqa: F401``.
  * E741 single-letter ``l`` rebound to ``ch``.
- Ruff format applied to the 7 touched .py files.

Full-unit (e2e):
- ``test_weather_with_fallback`` + ``test_multi_step_tool_chain``
  failed once on the initial run, both passed on local rerun
  against the same live qwen3.5-4b-4bit server. These are
  model-behaviour tests (which tool name the model picks for a
  given prompt) and are flakey by design — not caused by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex BLOCKING (round 2, pr_validate):
- ``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:702``
  — the sweep rewrote the canonical spoofing example
  ``evil-org/gpt-oss-20b`` into ``evil-org/gpt-oss-20b-mxfp4-q8``. The
  spoof shape (third-party org publishing under the same bare repo
  name OpenAI uses) is the exact case this matcher must reject; the
  alias-suffixed variant tests a strictly easier case. Restored the
  canonical form. (Adding the suffixed variant separately would be
  redundant — the matcher already covers the broader shape.)
- Same file line 716 — the sweep also rewrote ``openai/gpt-oss-20b``
  (OpenAI's real bare repo id on HuggingFace) into the rapid-mlx alias
  ``openai/gpt-oss-20b-mxfp4-q8``. The bare repo id is what the
  matcher actually sees from upstream tokenizers, so dropping the
  unsuffixed form would have left a gap. Restored the bare repo id
  and the bare-suffix variants of the local-path examples
  (``/models/gpt-oss-20b``, etc.) alongside the alias-suffixed cases
  the sweep wrote.

Codex NIT (round 2, pr_validate):
- ``harness/README.md:104`` previously said the ``full`` tier's two
  baselines are "both 8-bit". After the rename, ``qwen3.6-35b`` is
  the 4-bit variant. Reworded to call out each model's quant.
- ``scripts/rename_aliases.py`` — the ``dropped`` counter always
  printed 0 because dropped codename aliases store their redirect
  target as a non-None string in ``rename_map``. Reworked the three
  counters (renamed / dropped / kept) to compute from the input
  data's perspective so they always sum back to the input alias
  count and ``dropped`` is the real number of MANUAL ``drop=True``
  specs the script processed. Verified: against ``main``'s aliases.json
  the script prints ``48 renamed, 3 dropped, 23 kept`` (= 74).
- ``scripts/sweep_alias_refs.py`` — the comment promised a
  "hand-written pass below" for ``gemma4`` that did not exist (the
  sweep deliberately leaves every ``gemma4`` occurrence alone because
  the literal is also the parser ID). Reworked the comment to make
  the no-op intent and reason explicit so a future maintainer doesn't
  hunt for a missing implementation.

Tests: 4789 passed, 11 skipped, 7 xfailed. ``test_simple_exec`` /
``test_multi_step_tool_chain`` flaked again (model-behaviour pick of
tool name varies run-to-run); rerunning against the same live server
passes both — same as round 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-3 NIT: the checked-in ``rename_map.json`` was the result
of an idempotent rerun against the already-renamed ``aliases.json``,
so every entry was an identity mapping (e.g. ``qwen3.5-4b-4bit`` →
``qwen3.5-4b-4bit``). A maintainer running ``scripts/sweep_alias_refs.py``
from a pre-rename checkout to verify the operation is reproducible
would see the sweep do nothing because no legacy name (``qwen3.5-4b``,
``gemma4``, ``nemotron-nano``, …) was in the map.

Regenerated from ``main``'s ``vllm_mlx/aliases.json`` so the map now
contains the real 74-entry legacy → canonical mapping plus the three
dropped codename redirects (``deepseek-v4-flash`` →
``deepseek-v4-flash-8bit``, ``gemma4`` → ``gemma-4-12b-qat-4bit``,
``nemotron-nano`` → ``nemotron-30b-4bit``).

The current ``aliases.json`` is untouched — only the auxiliary map
file used by the sweep tool changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… multi-tool

PydanticAI defaults max_tokens to ~1024. On verbose 4B-class models
(qwen3.5-4b-4bit) the multi-turn and sequential-tool-call test paths
spill past the cap and PydanticAI raises
``Model token limit (provider default) exceeded`` before any response
is generated.

That ceiling is a client-side default, not a rapid-mlx server contract,
so the SDK integration test should bypass it: pass
``model_settings={"max_tokens": 2048}`` on tests 5 and 6.

release-check-m3 G7 PydanticAI now 6/6 PASS on qwen3.5-4b-4bit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Major release — every alias in ``vllm_mlx/aliases.json`` now carries
an explicit quantization suffix (``-4bit`` / ``-8bit`` / ``-mxfp4`` /
``-dwq`` / etc.), no implicit-quant short forms remain. The three
legacy codename aliases (``deepseek-v4-flash``, ``gemma4``,
``nemotron-nano``) were dropped; the ``phi4-14b`` schema bug (name
claimed 14B but hf_path pointed at phi-4-mini ~4B) was fixed by
renaming to ``phi-4-14b-4bit`` AND swapping hf_path to the real Phi-4
14B; ``phi-4-mini-4bit`` was added to preserve the small-model entry.

README now documents the 7-segment naming template
``<family>-<version>-<params>-<modality?>-<technique?>-<quant>`` and
the canonical quant-suffix table.

Total: 74 → 72 aliases. Old short names are not deprecated — they're
just gone, per user direction ("没有多少用户").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raullenchai raullenchai merged commit 806fefe into main Jun 10, 2026
15 checks passed
@raullenchai raullenchai deleted the feat/explicit-alias-naming branch June 10, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant