From 8c424a0cfed15cb1286d83845640aa8b76bdc5c7 Mon Sep 17 00:00:00 2001
From: Raullen Chai <raullenchai@gmail.com>
Date: Tue, 9 Jun 2026 17:09:06 -0700
Subject: [PATCH 1/6] refactor(aliases): every alias now carries an explicit
 quantization suffix
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

BREAKING — every legacy short alias has been renamed to its canonical
explicit form. ``rapid-mlx serve qwen3.5-4b`` no longer works; use
``rapid-mlx serve qwen3.5-4b-4bit``. ``rapid-mlx models`` lists the
72 new names.

Naming template (now documented in README "Naming convention"):

    <family>-<version>-<params>-<modality?>-<technique?>-<quant>

The quantization suffix is mandatory — mirrors LM Studio's
``…-MLX-4bit`` / ``…-MLX-8bit`` HuggingFace convention so the bit
width is readable off the alias instead of hidden in ``hf_path``.

Aliases renamed: 51 (e.g. ``qwen3.5-4b`` → ``qwen3.5-4b-4bit``,
``gemma-4-12b-qat`` → ``gemma-4-12b-qat-4bit``).
Aliases already explicit: 23 (e.g. ``qwen3.5-4b-8bit``,
``deepseek-v4-flash-2bit``).
Aliases added: 1 (``phi-4-mini-4bit``, separating the 4B mini from
the real Phi-4 14B — see phi4-14b fix below).
Codename aliases dropped: 3
  * ``deepseek-v4-flash`` — duplicate hf_path of
    ``deepseek-v4-flash-8bit``; references now resolve to that name.
  * ``gemma4`` — duplicate hf_path of ``gemma-4-12b-qat-4bit``;
    references resolve to that name. (The ``gemma4`` *parser*
    identifier is untouched — that's the tool/reasoning parser
    family, not an alias.)
  * ``nemotron-nano`` — duplicate hf_path of ``nemotron-30b-4bit``;
    references resolve to that name.

Schema bug fixed: ``phi4-14b`` was pointing at
``mlx-community/phi-4-mini-instruct-4bit`` (a ~4B model). It is
now ``phi-4-14b-4bit`` pointing at ``mlx-community/phi-4-4bit``
(the real 14B). The old mini target moves to ``phi-4-mini-4bit``.

Non-bit-width quant suffixes formalised: ``-mxfp4``, ``-mxfp4-q8``,
``-dwq``, ``-ud``, ``-3bit``, ``-6bit``, ``-unpacked`` (Bonsai's
no-quantization tier). Picked from HF community / mlx-community /
LM Studio conventions.

Scope of the sweep (107 files):
  * vllm_mlx/aliases.json — the 51 renames + 3 drops + 1 new.
  * README.md — new "Naming convention" section + family lineup
    table rewritten with the explicit names.
  * 100 active source files — tests, docs, scripts, install.sh,
    Issue templates, harness/scorecard/latest.md.
  * tests/fixtures/generation_configs/*.json — file basenames
    renamed alongside.
  * harness/baselines/full-qwen3.5-35b.json → -8bit; .6-35b → -4bit.
  * vllm_mlx/cli.py: Alias column widened 22 → 24 to fit the
    longest new name (``deepseek-v4-flash-8bit``).

Deliberately NOT swept (historical snapshots whose ``model`` field
records the alias under which the benchmark was originally run —
rewriting them is rewriting history):
  * evals/results/*.json   (83 files)
  * harness/runs/**         (timestamped doctor harness runs)
  * reports/mhi/*.json      (timestamped MHI reports)
  * reports/benchmarks/**   (README-refresh bench snapshots)

Tools introduced for the rename (gitted so the operation is
reproducible and future renames can reuse the machinery):
  * scripts/rename_aliases.py — generates the new aliases.json
    + scripts/rename_map.json from a small declarative rule set.
  * scripts/sweep_alias_refs.py — applies the rename map to every
    active source file, with the EXCLUDED prefixes above.

Tests: 4793 passed, 11 skipped, 7 xfailed (no regressions).
Lint: ``ruff check`` clean. Format: ``ruff format`` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/ISSUE_TEMPLATE/benchmark_report.yml   |   2 +-
 .github/ISSUE_TEMPLATE/bug_report.yml         |   4 +-
 CONTRIBUTING.md                               |   4 +-
 README.md                                     | 105 ++++++---
 benchmark_all_prompt_lookup.py                |   2 +-
 docs/benchmarks/README.md                     |   4 +-
 docs/benchmarks/image.md                      |   2 +-
 docs/benchmarks/llm.md                        |   8 +-
 docs/development/contributing.md              |   4 +-
 docs/development/pr_merge_sop.md              |   8 +-
 docs/development/releasing.md                 |   6 +-
 docs/getting-started/installation.md          |   4 +-
 docs/getting-started/quickstart.md            |  18 +-
 docs/guides/continuous-batching.md            |   6 +-
 docs/guides/mcp-tools.md                      |   2 +-
 docs/guides/multimodal.md                     |   4 +-
 docs/guides/server.md                         |   4 +-
 docs/reference/cli.md                         |  34 +--
 harness/README.md                             |  20 +-
 ....5-35b.json => full-qwen3.5-35b-8bit.json} |   2 +-
 ....6-35b.json => full-qwen3.6-35b-4bit.json} |   2 +-
 harness/scorecard/latest.md                   |  68 +++---
 install.sh                                    |   8 +-
 pyproject.toml                                |   2 +-
 scripts/bench_all_models.sh                   |   4 +-
 scripts/bench_engine_parity.py                |   8 +-
 scripts/bench_readme_refresh.py               |  16 +-
 scripts/bench_vs_ollama.py                    |   6 +-
 scripts/local_bench_vs_ollama.py              |   6 +-
 scripts/mhi_batch.sh                          |   4 +-
 scripts/mhi_eval.py                           |   2 +-
 scripts/pr_validate/golden_models.yaml        |   8 +-
 scripts/rename_aliases.py                     | 167 ++++++++++++++
 scripts/rename_map.json                       |  76 +++++++
 scripts/run_dogfood_mvp.sh                    |   4 +-
 scripts/sweep_alias_refs.py                   | 215 ++++++++++++++++++
 ...vstral-24b.json => devstral-24b-4bit.json} |   0
 ...mma-3n-e4b.json => gemma-3n-e4b-4bit.json} |   0
 .../{gemma3-27b.json => gemma3-27b-4bit.json} |   0
 .../{glm4.5-air.json => glm4.5-air-4bit.json} |   0
 .../{glm4.7-9b.json => glm4.7-9b-4bit.json}   |   0
 .../test_issue_444_harmony_tool_call_leak.py  |   6 +-
 ...t_issue_448_hermes_function_prefix_leak.py |   2 +-
 ...sue_455_harmony_commentary_tool_channel.py |   2 +-
 ...8_tool_choice_required_harmony_compound.py |   2 +-
 ...est_issue_513_harmony_streamable_parser.py |  16 +-
 tests/test_aliases_contract.py                |  84 +++----
 tests/test_anthropic_stop_sequences.py        |   2 +-
 tests/test_api_validation_bundle.py           |  18 +-
 tests/test_batched_engine_output_router.py    |   2 +-
 tests/test_bench_vs_ollama.py                 |  34 +--
 tests/test_chat_logprobs_channel_routing.py   |   2 +-
 tests/test_chat_route_tool_tag_leak.py        |   2 +-
 tests/test_chat_streaming_spec.py             |   6 +-
 tests/test_cli_argcomplete.py                 |  26 +--
 tests/test_cli_chat.py                        |  50 ++--
 tests/test_cli_info.py                        |   8 +-
 tests/test_cli_models.py                      |  26 +--
 tests/test_dflash_eligibility.py              |  10 +-
 tests/test_dflash_integration.py              |   8 +-
 tests/test_doctor_baseline.py                 |   2 +-
 tests/test_doctor_runner.py                   |   2 +-
 tests/test_engine_router_non_stream.py        |   2 +-
 tests/test_finalize_harmony_raw_text.py       |   2 +-
 tests/test_harmony_parsers.py                 |  12 +-
 tests/test_memory_capacity_check.py           |   6 +-
 tests/test_model_aliases.py                   |  51 ++---
 tests/test_model_profiles_ssot.py             |  48 ++--
 tests/test_postprocessor.py                   |   2 +-
 tests/test_prefix_boundary_path_parity.py     |   4 +-
 tests/test_sampling_params_passthrough.py     |  12 +-
 tests/test_share_cli.py                       |  50 ++--
 tests/test_smoke_matrix.sh                    |   2 +-
 tests/test_stop_string_enforcement.py         |   2 +-
 tests/test_suffix_bench_methodology.py        |   2 +-
 tests/test_suffix_decoding_tier.py            |   4 +-
 tests/test_telemetry_emit.py                  |  20 +-
 tests/test_telemetry_redact.py                |   4 +-
 vllm_mlx/_completion.py                       |   2 +-
 vllm_mlx/_download_gate.py                    |   2 +-
 vllm_mlx/agents/__init__.py                   |   2 +-
 vllm_mlx/agents/profiles/aider.yaml           |   8 +-
 vllm_mlx/agents/profiles/cline.yaml           |   6 +-
 vllm_mlx/agents/profiles/codex.yaml           |   6 +-
 vllm_mlx/agents/profiles/generic.yaml         |   6 +-
 vllm_mlx/agents/profiles/goose.yaml           |   6 +-
 vllm_mlx/agents/profiles/hermes.yaml          |   8 +-
 vllm_mlx/agents/profiles/langchain.yaml       |   6 +-
 vllm_mlx/agents/profiles/openclaude.yaml      |   8 +-
 vllm_mlx/agents/profiles/opencode.yaml        |   8 +-
 vllm_mlx/agents/profiles/openhands.yaml       |   6 +-
 vllm_mlx/agents/profiles/pydanticai.yaml      |   6 +-
 vllm_mlx/agents/profiles/smolagents.yaml      |   6 +-
 vllm_mlx/aliases.json                         | 135 +++++------
 vllm_mlx/api/utils.py                         |   2 +-
 vllm_mlx/cli.py                               |  54 ++---
 vllm_mlx/doctor/__init__.py                   |   4 +-
 vllm_mlx/doctor/baseline.py                   |   2 +-
 vllm_mlx/doctor/cli.py                        |   8 +-
 vllm_mlx/engine/batched.py                    |   8 +-
 vllm_mlx/model_aliases.py                     |  30 +--
 vllm_mlx/model_auto_config.py                 |   2 +-
 vllm_mlx/output_router_harmony.py             |   6 +-
 vllm_mlx/reasoning/harmony_parser.py          |   2 +-
 vllm_mlx/routes/anthropic.py                  |   2 +-
 vllm_mlx/runtime/model_registry.py            |   6 +-
 vllm_mlx/service/postprocessor.py             |   4 +-
 vllm_mlx/share/cli.py                         |   8 +-
 vllm_mlx/telemetry/redact.py                  |   4 +-
 vllm_mlx/tool_parsers/harmony_tool_parser.py  |   2 +-
 110 files changed, 1098 insertions(+), 639 deletions(-)
 rename harness/baselines/{full-qwen3.5-35b.json => full-qwen3.5-35b-8bit.json} (95%)
 rename harness/baselines/{full-qwen3.6-35b.json => full-qwen3.6-35b-4bit.json} (95%)
 create mode 100644 scripts/rename_aliases.py
 create mode 100644 scripts/rename_map.json
 create mode 100644 scripts/sweep_alias_refs.py
 rename tests/fixtures/generation_configs/{devstral-24b.json => devstral-24b-4bit.json} (100%)
 rename tests/fixtures/generation_configs/{gemma-3n-e4b.json => gemma-3n-e4b-4bit.json} (100%)
 rename tests/fixtures/generation_configs/{gemma3-27b.json => gemma3-27b-4bit.json} (100%)
 rename tests/fixtures/generation_configs/{glm4.5-air.json => glm4.5-air-4bit.json} (100%)
 rename tests/fixtures/generation_configs/{glm4.7-9b.json => glm4.7-9b-4bit.json} (100%)
diff --git a/.github/ISSUE_TEMPLATE/benchmark_report.yml b/.github/ISSUE_TEMPLATE/benchmark_report.yml
index c10869e2..a4df4527 100644
--- a/.github/ISSUE_TEMPLATE/benchmark_report.yml
+++ b/.github/ISSUE_TEMPLATE/benchmark_report.yml
@@ -17,7 +17,7 @@ body:
     id: model
     attributes:
       label: Model
-      placeholder: "e.g., qwen3.5-9b or mlx-community/Qwen3.5-9B-4bit"
+      placeholder: "e.g., qwen3.5-9b-4bit or mlx-community/Qwen3.5-9B-4bit"
     validations:
       required: true
   - type: textarea
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml
index e2963ccb..55f2377b 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.yml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -44,7 +44,7 @@ body:
     id: model
     attributes:
       label: Model
-      placeholder: "e.g. qwen3.5-4b or mlx-community/Qwen3.5-4B-MLX-4bit"
+      placeholder: "e.g. qwen3.5-4b-4bit or mlx-community/Qwen3.5-4B-MLX-4bit"
     validations:
       required: true
   - type: input
@@ -52,7 +52,7 @@ body:
     attributes:
       label: Full serve command
       description: Every flag matters — `--tool-call-parser`, `--reasoning-parser`, `--enable-prefix-cache`, etc.
-      placeholder: "rapid-mlx serve qwen3.5-4b --enable-auto-tool-choice --tool-call-parser qwen3 ..."
+      placeholder: "rapid-mlx serve qwen3.5-4b-4bit --enable-auto-tool-choice --tool-call-parser qwen3 ..."
     validations:
       required: true
   - type: dropdown
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 84cb8b27..69994968 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -14,7 +14,7 @@ pip install -e .
 pip install pytest ruff        # dev tools for testing and linting
 
 # Start a dev server
-rapid-mlx serve qwen3.5-4b --port 8000
+rapid-mlx serve qwen3.5-4b-4bit --port 8000
 ```
 
 **Requirements:** Python 3.11+, macOS with Apple Silicon (M1/M2/M3/M4).
@@ -132,7 +132,7 @@ The easiest contribution — no model download needed!
 }
 ```
 
-That's it. Find the MLX model on [HuggingFace mlx-community](https://huggingface.co/mlx-community) and add the mapping. Convention: `<family>-<size>` in lowercase (e.g., `qwen3.5-9b`, `gemma-4-26b`).
+That's it. Find the MLX model on [HuggingFace mlx-community](https://huggingface.co/mlx-community) and add the mapping. Convention: `<family>-<size>` in lowercase (e.g., `qwen3.5-9b-4bit`, `gemma-4-26b-4bit`).
 
 ## How to Add Parser Auto-Detection
 
diff --git a/README.md b/README.md
index 633e50bc..4403ba4d 100644
--- a/README.md
+++ b/README.md
@@ -82,15 +82,15 @@ curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash
 ```bash
 rapid-mlx chat
 ```
-Defaults to `qwen3.5-4b`. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type `/help` for slash commands, `/exit` to quit. Pass `--think` to surface chain-of-thought.
+Defaults to `qwen3.5-4b-4bit`. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type `/help` for slash commands, `/exit` to quit. Pass `--think` to surface chain-of-thought.
 
 **Step 2b — Or serve a model for use from other apps:**
 ```bash
-rapid-mlx serve qwen3.5-4b
+rapid-mlx serve qwen3.5-4b-4bit
 ```
 Same model, same download — but this starts an OpenAI-compatible HTTP server instead of a REPL. Wait for `Ready: http://localhost:8000/v1`.
 
-> Want vision? `pip install 'rapid-mlx[vision]'` then `rapid-mlx serve gemma-4-26b` (~14 GB).
+> Want vision? `pip install 'rapid-mlx[vision]'` then `rapid-mlx serve gemma-4-26b-4bit` (~14 GB).
 
 **Step 3 — Hit the API** (from a second terminal tab):
 ```bash
@@ -113,7 +113,7 @@ The default chat surface is our hosted Big-AGI fork (tool calling, personas, voi
 
 > **Want a Claude Code-like TUI?** Rapid-MLX is the *backend* — pair it with an open-source agent CLI like [OpenCode](https://github.com/sst/opencode) or [codex](https://github.com/openai/codex) for the full slash-commands / tool-use / multi-turn experience. Run `rapid-mlx agents opencode --setup` (or `codex --setup`) to wire it up automatically.
 
-> **Tip:** Run `rapid-mlx models` to see all available model aliases. For a smaller/faster model, try `rapid-mlx serve qwen3.5-9b` (~5 GB).
+> **Tip:** Run `rapid-mlx models` to see all available model aliases. For a smaller/faster model, try `rapid-mlx serve qwen3.5-9b-4bit` (~5 GB).
 
 <details>
 <summary>More install options</summary>
@@ -221,7 +221,7 @@ Run `rapid-mlx agents` to see all supported agents and `python3 scripts/mhi_eval
 ```
 OpenAI API Base:  http://localhost:8000/v1
 API Key:          not-needed
-Model name:       default          (or qwen3.5-9b — either works)
+Model name:       default          (or qwen3.5-9b-4bit — either works)
 ```
 Cursor's agent/composer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed.
 
@@ -405,27 +405,64 @@ The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monit
 
 > **4bit vs 8bit:** 4bit models are compressed to use less memory (recommended for most users). 8bit models are higher quality but need more RAM. "mxfp4" is a high-quality 4bit format.
 
+### Naming convention
+
+Every alias follows the same template so you can read off the model family, parameter count, training technique, and quantization at a glance:
+
+`<family>-<version>-<params>-<modality?>-<technique?>-<quant>`
+
+| Segment | Meaning | Examples |
+|---|---|---|
+| **family** | Model family | `gemma`, `qwen`, `llama`, `mistral`, `deepseek`, `phi` |
+| **version** | Major version | `-4`, `3.5`, `3.6`, `-r1`, `-v4-flash` |
+| **params** | Parameter count (MoE includes the active count) | `12b`, `27b`, `35b-a3b` (35B total / 3B active) |
+| **modality** *(optional)* | Non-text variants | `-vl` (vision), `-coder` (code) |
+| **technique** *(optional)* | Training-time modifier | `-qat` (Quantization-Aware Training), `-distill`, `-thinking` |
+| **quant** *(mandatory)* | Quantization tier (see below) | `-4bit`, `-8bit`, `-mxfp4`, `-qat-8bit`, … |
+
+The **quantization suffix is mandatory on every alias** — `qwen3.5-4b-4bit` not `qwen3.5-4b`, `gemma-4-12b-qat-8bit` not `gemma-4-12b-qat`. This mirrors LM Studio's `…-MLX-4bit` / `…-MLX-8bit` HuggingFace convention so you never have to guess the bit width.
+
+| Suffix | Meaning |
+|---|---|
+| `-4bit` | Standard MLX 4-bit (most common) |
+| `-8bit` | Standard MLX 8-bit (higher quality, ~2× RAM) |
+| `-2bit`, `-3bit`, `-6bit` | Other bit widths |
+| `-mxfp4` | Microscaling FP4 (high-quality 4-bit) |
+| `-mxfp4-q8` | MXFP4 weights + Q8 head (GPT-OSS style) |
+| `-dwq` | Dynamic Weight Quantization (mlx-community) |
+| `-ud` | Unsloth Dynamic (mixed-precision per-layer) |
+| `-unpacked` | Original FP16 / BF16 weights, no quantization |
+
+`-qat` is a *technique* suffix, not a quant — it stacks before the quant. So a QAT-trained Gemma 4 12B in 4-bit is `gemma-4-12b-qat-4bit`, and the 8-bit variant is `gemma-4-12b-qat-8bit`.
+
+Decoded examples:
+
+- `gemma-4-12b-qat-4bit` = Gemma 4 · 12B params · QAT-trained · 4-bit quant
+- `qwen3.5-35b-8bit` = Qwen 3.5 · 35B params (3B active MoE) · 8-bit quant
+- `gpt-oss-20b-mxfp4-q8` = GPT-OSS · 20B params · MXFP4 weights + Q8 head
+- `bonsai-1.7b-unpacked` = Bonsai · 1.7B params · no quantization
+
 ### Full model lineup
 
-66 short aliases across 13 families ship today. Run `rapid-mlx models` for the live list with quant tier, MoE / hybrid flags, and DFlash eligibility.
+72 explicit aliases across 13 families ship today. Run `rapid-mlx models` for the live list with parser, hybrid / MoE flags, and DFlash eligibility.
 
 <details>
-<summary><strong>Show all 66 aliases by family</strong></summary>
+<summary><strong>Show all 72 aliases by family</strong></summary>
 
 | Family | Aliases | Notable |
 |---|---|---|
-| **Qwen3.5** | `qwen3.5-4b`, `-4b-8bit`, `-9b`, `-9b-8bit`, `-27b`, `-27b-8bit` ✨, `-35b`, `-35b-4bit`, `-122b`, `-122b-8bit` | DeltaNet hybrid; **27b-8bit DFlash-eligible** |
-| **Qwen3.6** | `qwen3.6-27b`, `-27b-8bit` ✨, `-27b-ud`, `-35b`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud` | 262K ctx, 256 MoE experts; **27b-8bit DFlash-eligible** |
-| **Qwen3** | `qwen3-0.6b-8bit`, `-4b-8bit`, `-8b-8bit`, `qwen3-coder`, `qwen3-coder-30b`, `qwen3-vl-4b`, `-8b`, `-30b` | Coding + vision |
-| **Qwopus** | `qwopus-9b`, `qwopus-27b`, `qwopus-27b-8bit` | 92 MHI on tool calling |
-| **DeepSeek** | `deepseek-r1-8b`, `-32b`, `deepseek-v4-flash` (2/4/8-bit) | R1 reasoning + V4 Flash 158B-A13B day-0 |
-| **Gemma** | `gemma-3n-e4b`, `gemma-4-26b`, `-31b`, `-31b-8bit`, `gemma3-1b`, `-12b`, `-27b` | Vision-capable (gemma-4) |
-| **Llama / Hermes** | `llama3-1b`, `-3b`, `llama-3.1-8b-8bit`, `hermes3-8b`, `hermes4-70b` | |
-| **GLM** | `glm4.5-air`, `glm4.7-9b` | |
-| **GPT-OSS** | `gpt-oss-20b` | Harmony native |
-| **MiniMax / Kimi** | `minimax-m2.5`, `minimax-m2.7`, `kimi-48b`, `kimi-k2.5` | |
-| **Mistral / Devstral** | `mistral-24b`, `devstral-24b`, `devstral-v2-24b`, `ministral-3b` | |
-| **Other** | `phi4-14b`, `smollm3-3b`, `nemotron-30b` / `-nano`, `bonsai-1.7b/4b/8b`, `granite4-tiny` | |
+| **Qwen3.5** | `qwen3.5-4b-4bit`, `-4b-8bit`, `-9b-4bit`, `-9b-8bit`, `-27b-4bit`, `-27b-8bit` ✨, `-35b-4bit`, `-35b-8bit`, `-122b-mxfp4`, `-122b-8bit` | DeltaNet hybrid; **27b-8bit DFlash-eligible** |
+| **Qwen3.6** | `qwen3.6-27b-4bit`, `-27b-8bit` ✨, `-27b-ud`, `-35b-4bit`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud` | 262K ctx, 256 MoE experts; **27b-8bit DFlash-eligible** |
+| **Qwen3** | `qwen3-0.6b-8bit`, `-4b-8bit`, `-8b-8bit`, `qwen3-coder-4bit`, `qwen3-coder-30b-4bit`, `qwen3-vl-4b-4bit`, `-8b-4bit`, `-30b-4bit` | Coding + vision |
+| **Qwopus** | `qwopus-9b-4bit`, `qwopus-27b-4bit`, `qwopus-27b-8bit` | 92 MHI on tool calling |
+| **DeepSeek** | `deepseek-r1-8b-4bit`, `-32b-4bit`, `deepseek-v4-flash-2bit`, `-4bit`, `-8bit` | R1 reasoning + V4 Flash 158B-A13B day-0 |
+| **Gemma** | `gemma-3n-e4b-4bit`, `gemma-4-12b-4bit`, `-12b-qat-4bit`, `-12b-qat-8bit`, `-26b-4bit`, `-26b-qat-4bit`, `-31b-4bit`, `-31b-8bit`, `-31b-qat-4bit`, `-31b-qat-8bit`, `gemma3-1b-4bit`, `-12b-4bit`, `-27b-4bit` | Vision-capable; QAT variants |
+| **Llama / Hermes** | `llama3-1b-4bit`, `-3b-4bit`, `llama-3.1-8b-8bit`, `hermes3-8b-4bit`, `hermes4-70b-4bit` | |
+| **GLM** | `glm4.5-air-4bit`, `glm4.7-9b-4bit` | |
+| **GPT-OSS** | `gpt-oss-20b-mxfp4-q8` | Harmony native |
+| **MiniMax / Kimi** | `minimax-m2.5-4bit`, `minimax-m2.7-mxfp4`, `kimi-48b-4bit`, `kimi-k2.5-3bit` | |
+| **Mistral / Devstral** | `mistral-24b-4bit`, `devstral-24b-4bit`, `devstral-v2-24b-4bit`, `ministral-3b-4bit` | |
+| **Other** | `phi-4-14b-4bit`, `phi-4-mini-4bit`, `smollm3-3b-4bit`, `nemotron-30b-4bit`, `bonsai-1.7b-unpacked`, `-4b-unpacked`, `-8b-unpacked`, `granite4-tiny-4bit` | |
 
 ✨ = DFlash speculative decoding supported (opt in with `--enable-dflash`). `rapid-mlx info <alias>` shows per-alias capabilities.
 
@@ -433,38 +470,38 @@ The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monit
 
 ### Copy-paste commands
 
-Pick the one that matches your Mac. Short aliases work — run `rapid-mlx models` to see all available models.
+Pick the one that matches your Mac. Run `rapid-mlx models` to see all available aliases.
 
 ```bash
 # 16 GB — lightweight, fast
-rapid-mlx serve qwen3.5-4b --port 8000
+rapid-mlx serve qwen3.5-4b-4bit --port 8000
 
 # 24 GB — best small model
-rapid-mlx serve qwen3.5-9b --port 8000
+rapid-mlx serve qwen3.5-9b-4bit --port 8000
 
 # 32 GB — solid coding model
-rapid-mlx serve qwen3.5-27b --port 8000
+rapid-mlx serve qwen3.5-27b-4bit --port 8000
 
 # 32 GB — Gemma 4 12B (vision-capable, 64 tok/s)
-rapid-mlx serve gemma-4-12b --port 8000
+rapid-mlx serve gemma-4-12b-4bit --port 8000
 
 # 32 GB — GPT-OSS 20B (harmony-native, 100% tool calling, 119 tok/s)
-rapid-mlx serve gpt-oss-20b --port 8000
+rapid-mlx serve gpt-oss-20b-mxfp4-q8 --port 8000
 
 # 32+ GB — Qwen 3.6 35B-A3B (256 experts, 262K context, 93 tok/s)
-rapid-mlx serve qwen3.6-35b --port 8000
+rapid-mlx serve qwen3.6-35b-4bit --port 8000
 
 # 48+ GB — sweet spot (Qwen3.5-35B-A3B 8bit, 80 tok/s)
-rapid-mlx serve qwen3.5-35b --prefill-step-size 8192 --port 8000  # faster first response
+rapid-mlx serve qwen3.5-35b-8bit --prefill-step-size 8192 --port 8000  # faster first response
 
 # 96+ GB — frontier (Qwen3.5-122B mxfp4)
-rapid-mlx serve qwen3.5-122b --prefill-step-size 8192 --port 8000
+rapid-mlx serve qwen3.5-122b-mxfp4 --prefill-step-size 8192 --port 8000
 
 # Coding agent — fast MoE, great for Claude Code / Cursor
-rapid-mlx serve qwen3-coder --prefill-step-size 8192 --port 8000  # MoE = only uses part of the model, so it's fast
+rapid-mlx serve qwen3-coder-4bit --prefill-step-size 8192 --port 8000  # MoE = only uses part of the model, so it's fast
 
 # Vision — image understanding (see note below)
-rapid-mlx serve qwen3-vl-4b --mllm --port 8000
+rapid-mlx serve qwen3-vl-4b-4bit --mllm --port 8000
 ```
 
 > **Vision deps:** Install into the same environment where rapid-mlx lives:
@@ -530,7 +567,7 @@ Reproduce the throughput table:
 
 ```bash
 python3.12 scripts/bench_readme_refresh.py \
-  --models qwen3.5-4b,qwen3.5-9b,qwen3.5-27b,gemma-4-12b,gpt-oss-20b,qwen3.6-35b,qwen3.5-35b \
+  --models qwen3.5-4b-4bit,qwen3.5-9b-4bit,qwen3.5-27b-4bit,gemma-4-12b-4bit,gpt-oss-20b-mxfp4-q8,qwen3.6-35b-4bit,qwen3.5-35b-8bit \
   --engines rapid-mlx,mlx-lm,ollama
 ```
 
@@ -800,7 +837,7 @@ Rapid-MLX **can** send anonymous usage data to help us prioritise the right mode
 ### What we collect (only if you opt in)
 
 - Subcommand names (`serve` / `chat` / `agents` / `bench` / `doctor`)
-- Model alias names (`qwen3.5-9b`) or canonical HF repo IDs (`mlx-community/...`) — local paths are redacted to `<local>`
+- Model alias names (`qwen3.5-9b-4bit`) or canonical HF repo IDs (`mlx-community/...`) — local paths are redacted to `<local>`
 - Bucketed counts: prompt/completion tokens, TTFT, tokens/sec — never exact values
 - Error categories + a hash fingerprint of the failure site (exception class name + per-frame `file:function:lineno` only — never the message text or absolute paths)
 - OS, arch, Apple chip name, RAM (rounded to GB), Python major.minor
@@ -828,8 +865,8 @@ rapid-mlx telemetry reset      # delete consent + client-id files (re-prompts on
 Either of these always wins, regardless of stored consent:
 
 ```bash
-RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b
-rapid-mlx --no-telemetry serve qwen3.5-9b
+RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b-4bit
+rapid-mlx --no-telemetry serve qwen3.5-9b-4bit
 ```
 
 There is intentionally **no env-var equivalent for force-on** — opting in must be an explicit one-time `rapid-mlx telemetry enable`. CI agents will never silently contribute.
diff --git a/benchmark_all_prompt_lookup.py b/benchmark_all_prompt_lookup.py
index 597a8120..ea0ba06a 100644
--- a/benchmark_all_prompt_lookup.py
+++ b/benchmark_all_prompt_lookup.py
@@ -48,7 +48,7 @@
         False,
     ),
     (
-        "gpt-oss-20b",
+        "gpt-oss-20b-mxfp4-q8",
         "/Users/raullenstudio/.lmstudio/models/mlx-community/gpt-oss-20b-MXFP4-Q8",
         False,
     ),
diff --git a/docs/benchmarks/README.md b/docs/benchmarks/README.md
index 089bf163..13f88756 100644
--- a/docs/benchmarks/README.md
+++ b/docs/benchmarks/README.md
@@ -12,7 +12,7 @@ Performance benchmarks for rapid-mlx on Apple Silicon.
 
 ```bash
 # LLM benchmark — short aliases work
-rapid-mlx bench qwen3.5-4b
+rapid-mlx bench qwen3.5-4b-4bit
 
 # Or by full HF repo (vision/multimodal benches live in scripts/ — they are
 # dev-only and not shipped with `pip install rapid-mlx`)
@@ -55,7 +55,7 @@ Results will vary on different Apple Silicon chips.
 If you have a different Apple Silicon chip, please share your results:
 
 ```bash
-rapid-mlx bench qwen3.5-4b | tee results.txt
+rapid-mlx bench qwen3.5-4b-4bit | tee results.txt
 ```
 
 Open an issue with your results at [GitHub Issues](https://github.com/raullenchai/Rapid-MLX/issues).
diff --git a/docs/benchmarks/image.md b/docs/benchmarks/image.md
index 2ace9771..4c027f78 100644
--- a/docs/benchmarks/image.md
+++ b/docs/benchmarks/image.md
@@ -8,7 +8,7 @@ scripts under `scripts/` (not packaged with `pip install rapid-mlx`) — clone
 the repo if you want to reproduce them.
 
 ```bash
-rapid-mlx serve gemma-4-26b --mllm --port 8000   # then exercise the VLM via /v1/chat/completions
+rapid-mlx serve gemma-4-26b-4bit --mllm --port 8000   # then exercise the VLM via /v1/chat/completions
 ```
 
 ## Results - Qwen3-VL-8B-Instruct-4bit (M4 Max, 128GB)
diff --git a/docs/benchmarks/llm.md b/docs/benchmarks/llm.md
index b9d105a0..9c337a27 100644
--- a/docs/benchmarks/llm.md
+++ b/docs/benchmarks/llm.md
@@ -3,7 +3,7 @@
 ## Running LLM Benchmarks
 
 ```bash
-rapid-mlx bench qwen3.5-4b --num-prompts 5 --max-tokens 256
+rapid-mlx bench qwen3.5-4b-4bit --num-prompts 5 --max-tokens 256
 ```
 
 ## Results (M4 Max, 128GB)
@@ -271,13 +271,13 @@ The streaming detokenizer is **not currently viable** for per-request usage due
 
 ```bash
 # Basic benchmark — short alias works
-rapid-mlx bench qwen3.5-4b
+rapid-mlx bench qwen3.5-4b-4bit
 
 # With more prompts
-rapid-mlx bench qwen3.5-4b --num-prompts 10
+rapid-mlx bench qwen3.5-4b-4bit --num-prompts 10
 
 # Save results
-rapid-mlx bench qwen3.5-4b | tee results.txt
+rapid-mlx bench qwen3.5-4b-4bit | tee results.txt
 
 # Continuous batching test
 python tests/test_continuous_batching.py
diff --git a/docs/development/contributing.md b/docs/development/contributing.md
index 1c096706..b3a91e10 100644
--- a/docs/development/contributing.md
+++ b/docs/development/contributing.md
@@ -73,7 +73,7 @@ ruff format --check .
 
 ```bash
 # LLM benchmark — short alias works
-rapid-mlx bench qwen3.5-4b
+rapid-mlx bench qwen3.5-4b-4bit
 
 # Or by full HF repo
 rapid-mlx bench mlx-community/Qwen3.5-9B-4bit
@@ -109,7 +109,7 @@ See [Architecture](architecture.md) for details on the codebase structure.
 If you have access to different Apple Silicon chips (M1, M2, M3, M4), benchmark results are valuable:
 
 ```bash
-rapid-mlx bench qwen3.5-4b | tee results_m4.txt
+rapid-mlx bench qwen3.5-4b-4bit | tee results_m4.txt
 ```
 
 ## Questions?
diff --git a/docs/development/pr_merge_sop.md b/docs/development/pr_merge_sop.md
index 516e76f7..65bdd803 100644
--- a/docs/development/pr_merge_sop.md
+++ b/docs/development/pr_merge_sop.md
@@ -158,7 +158,7 @@ Skip rule:
 - **Touch inference code** → run, even if it takes ~10 min:
 
   ```bash
-  # make check runs against the default model (qwen3.5-4b) — ~10 min
+  # make check runs against the default model (qwen3.5-4b-4bit) — ~10 min
   make check
   # make full runs across multiple models (~1-2 hr) — only when changes affect generation correctness
   make full
@@ -166,7 +166,7 @@ Skip rule:
   python3 -m vllm_mlx.cli doctor check --model <alias>
   ```
 
-The bar is **0 regressions vs the per-model baseline in `harness/baselines/`** *for models that have committed baselines* (currently `qwen3.5-35b` and `qwen3.6-35b`). For models without baselines, document the chosen ad-hoc reference (e.g., "compared against output on commit X", "manual eyeball vs main"). Pre-existing fails (Test 10 streaming usage, `<|im_end|>` leak, thinking-toggle on qwen3.5-4b) are documented; new fails block merge.
+The bar is **0 regressions vs the per-model baseline in `harness/baselines/`** *for models that have committed baselines* (currently `qwen3.5-35b-8bit` and `qwen3.6-35b-4bit`). For models without baselines, document the chosen ad-hoc reference (e.g., "compared against output on commit X", "manual eyeball vs main"). Pre-existing fails (Test 10 streaming usage, `<|im_end|>` leak, thinking-toggle on qwen3.5-4b-4bit) are documented; new fails block merge.
 
 ## Step 9 — Anthropic-compat round-trip (gated on parser/router PRs)
 
@@ -174,11 +174,11 @@ If the diff touches `vllm_mlx/parsers/`, `vllm_mlx/reasoning/`, `vllm_mlx/routes
 
 ```bash
 # in one shell:
-rapid-mlx serve qwen3.5-4b
+rapid-mlx serve qwen3.5-4b-4bit
 # in another:
 curl -s http://localhost:8000/anthropic/v1/messages \
   -H 'content-type: application/json' \
-  -d '{"model":"qwen3.5-4b","max_tokens":64,"messages":[{"role":"user","content":"say hi"}]}'
+  -d '{"model":"qwen3.5-4b-4bit","max_tokens":64,"messages":[{"role":"user","content":"say hi"}]}'
 ```
 
 Output must be a non-empty Anthropic-shaped response, no `!!!!!!` token-id-0 corruption, no streaming-think misroute. The `/anthropic` surface shares router-level code with `/v1/chat/completions` but diverges at the streaming-think router; multiple historical regressions (#288, #289) shipped with green OpenAI-compat smoke and broken `/anthropic`.
diff --git a/docs/development/releasing.md b/docs/development/releasing.md
index a2eb9fd5..9a1ae5d4 100644
--- a/docs/development/releasing.md
+++ b/docs/development/releasing.md
@@ -2,7 +2,7 @@
 
 This page documents the **end-to-end release flow** and the **safety nets** that catch the common failure modes.
 
-The historical pain point: between v0.6.14 (2026-05-05) and v0.6.16, several PRs added 30+ new model aliases (`granite4-tiny`, `smollm3-3b`, `deepseek-v4-flash`, `qwen3.6-*`, etc), but no version was bumped — leaving brew/PyPI users with a stale `rapid-mlx models` list. The safety nets below are designed to make that exact failure impossible to repeat without explicit human override.
+The historical pain point: between v0.6.14 (2026-05-05) and v0.6.16, several PRs added 30+ new model aliases (`granite4-tiny-4bit`, `smollm3-3b-4bit`, `deepseek-v4-flash-8bit`, `qwen3.6-*`, etc), but no version was bumped — leaving brew/PyPI users with a stale `rapid-mlx models` list. The safety nets below are designed to make that exact failure impossible to repeat without explicit human override.
 
 ## Quick reference
 
@@ -170,8 +170,8 @@ This is the rule. No exceptions. CI doesn't fake-inference with a tiny model on
 ### M3 local — one command before pushing the bump commit
 
 ```bash
-make release-check-m3              # uses MODEL=qwen3.5-4b (default)
-MODEL=qwen3.6-27b make release-check-m3   # override
+make release-check-m3              # uses MODEL=qwen3.5-4b-4bit (default)
+MODEL=qwen3.6-27b-4bit make release-check-m3   # override
 ```
 
 Wrapped by [`scripts/release_check_m3.sh`](../../scripts/release_check_m3.sh). It boots `rapid-mlx serve` once on port 8000, then runs G5 (stress) + G7 (anthropic + pydantic_ai + smolagents) + G6 (parallel-tool-call cap repro) + G9 (10-seq latency) + G8b (parser microbench, M3 perf baseline) sequentially. The server is killed on exit.
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 56ca5e39..2512f28d 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -58,7 +58,7 @@ rapid-mlx version
 rapid-mlx doctor
 
 # Smallest interactive smoke test (downloads ~2.5 GB on first run)
-rapid-mlx chat qwen3.5-4b
+rapid-mlx chat qwen3.5-4b-4bit
 ```
 
 ## Troubleshooting
@@ -81,7 +81,7 @@ huggingface-cli login
 
 Use a smaller quantized model:
 ```bash
-rapid-mlx serve qwen3.5-4b
+rapid-mlx serve qwen3.5-4b-4bit
 ```
 
 ### `brew install` fails with `Operation not permitted`
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index dae94199..d50b19a1 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -3,12 +3,12 @@
 ## Option 1: Interactive Chat (fastest first taste)
 
 The shortest path to talking to a model — `chat` spawns its own server,
-downloads the model on first run (~2.5 GB for the default `qwen3.5-4b`), and
+downloads the model on first run (~2.5 GB for the default `qwen3.5-4b-4bit`), and
 drops you into a REPL.
 
 ```bash
-rapid-mlx chat                  # defaults to qwen3.5-4b
-rapid-mlx chat qwen3.5-9b       # a larger model (5 GB)
+rapid-mlx chat                  # defaults to qwen3.5-4b-4bit
+rapid-mlx chat qwen3.5-9b-4bit       # a larger model (5 GB)
 rapid-mlx chat --think          # surface chain-of-thought reasoning
 ```
 
@@ -21,7 +21,7 @@ In-REPL: `/help`, `/reset`, `/save <path>`, `/model <alias>`, `/exit`. Type
 Start the server:
 
 ```bash
-rapid-mlx serve qwen3.5-4b --port 8000
+rapid-mlx serve qwen3.5-4b-4bit --port 8000
 ```
 
 Use with the OpenAI Python SDK:
@@ -61,7 +61,7 @@ For image / video understanding, use a VLM (requires the `[vision]` extra —
 `pip install 'rapid-mlx[vision]'`):
 
 ```bash
-rapid-mlx serve gemma-4-26b --mllm --port 8000
+rapid-mlx serve gemma-4-26b-4bit --mllm --port 8000
 ```
 
 ```python
@@ -85,7 +85,7 @@ chain-of-thought into a separate `reasoning_content` field, leaving `content`
 clean.
 
 ```bash
-rapid-mlx serve qwen3.5-9b --port 8000   # qwen3 reasoning parser auto-detected
+rapid-mlx serve qwen3.5-9b-4bit --port 8000   # qwen3 reasoning parser auto-detected
 ```
 
 ```python
@@ -103,7 +103,7 @@ Generate text embeddings for semantic search and RAG (install the
 `[embeddings]` extra first):
 
 ```bash
-rapid-mlx serve qwen3.5-4b --embedding-model mlx-community/multilingual-e5-small-mlx
+rapid-mlx serve qwen3.5-4b-4bit --embedding-model mlx-community/multilingual-e5-small-mlx
 ```
 
 ```python
@@ -119,13 +119,13 @@ Tool/function calling is on by default for supported model families (Qwen3.x,
 GLM-4.7, GPT-OSS, Llama, Mistral, etc.) — the right parser is auto-detected:
 
 ```bash
-rapid-mlx serve qwen3.5-9b --port 8000
+rapid-mlx serve qwen3.5-9b-4bit --port 8000
 ```
 
 If you need to pin the parser manually:
 
 ```bash
-rapid-mlx serve devstral-24b \
+rapid-mlx serve devstral-24b-4bit \
   --enable-auto-tool-choice --tool-call-parser hermes
 ```
 
diff --git a/docs/guides/continuous-batching.md b/docs/guides/continuous-batching.md
index 3f517432..5ae6b87e 100644
--- a/docs/guides/continuous-batching.md
+++ b/docs/guides/continuous-batching.md
@@ -7,7 +7,7 @@ for back-compat but is a no-op.
 ## Default Behaviour
 
 ```bash
-rapid-mlx serve qwen3.5-4b
+rapid-mlx serve qwen3.5-4b-4bit
 ```
 
 ## With Paged Cache
@@ -15,7 +15,7 @@ rapid-mlx serve qwen3.5-4b
 For memory-efficient prefix sharing:
 
 ```bash
-rapid-mlx serve qwen3.5-4b --use-paged-cache
+rapid-mlx serve qwen3.5-4b-4bit --use-paged-cache
 ```
 
 ## How It Works
@@ -164,7 +164,7 @@ python tests/test_prefix_cache.py
 ## Production Setup
 
 ```bash
-rapid-mlx serve qwen3.5-9b \
+rapid-mlx serve qwen3.5-9b-4bit \
   --use-paged-cache \
   --port 8000
 ```
diff --git a/docs/guides/mcp-tools.md b/docs/guides/mcp-tools.md
index 68f0f6c9..b906972e 100644
--- a/docs/guides/mcp-tools.md
+++ b/docs/guides/mcp-tools.md
@@ -51,7 +51,7 @@ Create `mcp.json`:
 ### 2. Start Server with MCP
 
 ```bash
-rapid-mlx serve qwen3.5-4b --mcp-config mcp.json
+rapid-mlx serve qwen3.5-4b-4bit --mcp-config mcp.json
 ```
 
 ### 3. Verify MCP Status
diff --git a/docs/guides/multimodal.md b/docs/guides/multimodal.md
index abeb1cfc..a861db04 100644
--- a/docs/guides/multimodal.md
+++ b/docs/guides/multimodal.md
@@ -232,7 +232,7 @@ benches live in the dev-only `scripts/` directory (source checkout only). For
 a quick text-only sanity bench against a VLM, you can still run:
 
 ```bash
-rapid-mlx bench qwen3-vl-4b
+rapid-mlx bench qwen3-vl-4b-4bit
 ```
 
 ## MLLM Cache
@@ -307,7 +307,7 @@ For an interactive multimodal session, start a server and use any OpenAI-
 compatible web UI (Open WebUI, LibreChat, etc.) pointed at it:
 
 ```bash
-rapid-mlx serve qwen3-vl-4b --mllm --port 8000
+rapid-mlx serve qwen3-vl-4b-4bit --mllm --port 8000
 ```
 
 The shipped `rapid-mlx chat` REPL is text-only. The optional Gradio web UI
diff --git a/docs/guides/server.md b/docs/guides/server.md
index 77765a12..12bd1e45 100644
--- a/docs/guides/server.md
+++ b/docs/guides/server.md
@@ -9,7 +9,7 @@ a no-op for back-compat.
 ### Default
 
 ```bash
-rapid-mlx serve qwen3.5-4b --port 8000
+rapid-mlx serve qwen3.5-4b-4bit --port 8000
 ```
 
 Short aliases (see `rapid-mlx models`) work everywhere a model name is
@@ -20,7 +20,7 @@ accepted. Full HuggingFace repo IDs (`mlx-community/...`) work too.
 Memory-efficient caching for production / shared system prompts:
 
 ```bash
-rapid-mlx serve qwen3.5-9b --port 8000 --use-paged-cache
+rapid-mlx serve qwen3.5-9b-4bit --port 8000 --use-paged-cache
 ```
 
 ## Server Options
diff --git a/docs/reference/cli.md b/docs/reference/cli.md
index 47e07540..ca962b83 100644
--- a/docs/reference/cli.md
+++ b/docs/reference/cli.md
@@ -63,37 +63,37 @@ rapid-mlx serve <model> [options]
 
 ```bash
 # Default — continuous batching is on by default; short aliases work
-rapid-mlx serve qwen3.5-4b
+rapid-mlx serve qwen3.5-4b-4bit
 
 # A larger general-purpose model (5 GB)
-rapid-mlx serve qwen3.5-9b --port 8000
+rapid-mlx serve qwen3.5-9b-4bit --port 8000
 
 # Paged KV cache (memory-efficient prefix sharing)
-rapid-mlx serve qwen3.5-9b --use-paged-cache --port 8000
+rapid-mlx serve qwen3.5-9b-4bit --use-paged-cache --port 8000
 
 # With MCP tools
-rapid-mlx serve qwen3.5-9b --mcp-config mcp.json
+rapid-mlx serve qwen3.5-9b-4bit --mcp-config mcp.json
 
 # Multimodal (vision) model — requires the [vision] extra
-rapid-mlx serve gemma-4-26b --mllm
+rapid-mlx serve gemma-4-26b-4bit --mllm
 
 # Reasoning model — parser is auto-detected, but you can pin it
-rapid-mlx serve qwen3.5-9b --reasoning-parser qwen3
+rapid-mlx serve qwen3.5-9b-4bit --reasoning-parser qwen3
 
 # DeepSeek reasoning model
-rapid-mlx serve deepseek-r1-8b --reasoning-parser deepseek_r1
+rapid-mlx serve deepseek-r1-8b-4bit --reasoning-parser deepseek_r1
 
 # Tool calling with Mistral/Devstral
-rapid-mlx serve devstral-24b --enable-auto-tool-choice --tool-call-parser hermes
+rapid-mlx serve devstral-24b-4bit --enable-auto-tool-choice --tool-call-parser hermes
 
 # DFlash speculative decoding (single-user, single supported alias)
 rapid-mlx serve qwen3.5-27b-8bit --enable-dflash --port 8000
 
 # API key authentication
-rapid-mlx serve qwen3.5-9b --api-key your-secret-key
+rapid-mlx serve qwen3.5-9b-4bit --api-key your-secret-key
 
 # Production setup with security options
-rapid-mlx serve qwen3.5-9b \
+rapid-mlx serve qwen3.5-9b-4bit \
   --api-key your-secret-key \
   --rate-limit 60 \
   --timeout 120
@@ -137,7 +137,7 @@ rapid-mlx bench <model> [options]
 
 | Option | Description | Default |
 |--------|-------------|---------|
-| `<model>` | Model alias (e.g. `qwen3.5-4b`) or HF repo (positional) | *(required)* |
+| `<model>` | Model alias (e.g. `qwen3.5-4b-4bit`) or HF repo (positional) | *(required)* |
 | `--num-prompts` | Number of prompts | 5 |
 | `--max-tokens` | Max tokens per prompt | 256 |
 | `--enable-prefix-cache` / `--disable-prefix-cache` | Toggle prefix caching | enabled |
@@ -150,7 +150,7 @@ Run `rapid-mlx bench --help` for the full list (memory limits, batch sizes, etc.
 
 ```bash
 # Quick LLM benchmark using a short alias
-rapid-mlx bench qwen3.5-4b
+rapid-mlx bench qwen3.5-4b-4bit
 
 # Bench a vision-language model by full HF repo
 rapid-mlx bench mlx-community/Qwen3-VL-8B-Instruct-4bit
@@ -172,7 +172,7 @@ rapid-mlx chat [model] [options]
 
 | Option | Description | Default |
 |--------|-------------|---------|
-| `model` | Model alias or HF repo (positional, optional) | `qwen3.5-4b` |
+| `model` | Model alias or HF repo (positional, optional) | `qwen3.5-4b-4bit` |
 | `--system` | System prompt prepended to the conversation | *(none)* |
 | `--think` / `--no-think` | Enable / disable reasoning output in the REPL | off |
 | `--max-tokens` | Max tokens per assistant response | 2048 |
@@ -189,18 +189,18 @@ rapid-mlx chat [model] [options]
 ### Examples
 
 ```bash
-# Fastest path — defaults to qwen3.5-4b, spawns its own server
+# Fastest path — defaults to qwen3.5-4b-4bit, spawns its own server
 rapid-mlx chat
 
 # A reasoning model with thinking surfaced
-rapid-mlx chat qwen3.5-9b --think
+rapid-mlx chat qwen3.5-9b-4bit --think
 
 # Attach to a server you're already running on :8000
-rapid-mlx serve qwen3.5-27b --port 8000 &
+rapid-mlx serve qwen3.5-27b-4bit --port 8000 &
 rapid-mlx chat --port 8000
 
 # Pin a system prompt
-rapid-mlx chat qwen3.5-4b --system "You are a terse, friendly Mac shell tutor."
+rapid-mlx chat qwen3.5-4b-4bit --system "You are a terse, friendly Mac shell tutor."
 ```
 
 In-REPL slash commands: `/help`, `/reset` (alias `/clear`), `/model <alias>`,
diff --git a/harness/README.md b/harness/README.md
index acbf04fb..75ebf703 100644
--- a/harness/README.md
+++ b/harness/README.md
@@ -4,7 +4,7 @@ A four-tier "code health checkup" for Rapid-MLX:
 
 ```
 rapid-mlx doctor smoke       # ~2 min,  no model         — pre-commit
-rapid-mlx doctor check       # ~15 min, qwen3.5-35b      — pre-PR / big change
+rapid-mlx doctor check       # ~15 min, qwen3.5-35b-8bit      — pre-PR / big change
 rapid-mlx doctor full        # ~2-3 hr, 3 models         — pre-release / refactor
 rapid-mlx doctor benchmark   # overnight, all models     — periodic / promo material
 ```
@@ -27,7 +27,7 @@ it needs `tests/`, `harness/`, and `pyproject.toml`):
 # Pre-commit — no model required
 make smoke                            # or: rapid-mlx doctor smoke
 
-# Pre-PR — boots qwen3.5-35b, runs API + perf checks, diffs vs baseline.
+# Pre-PR — boots qwen3.5-35b-8bit, runs API + perf checks, diffs vs baseline.
 # 35B 8-bit is the smallest model we trust to ~never err on the eval
 # suite, so failures cleanly attribute to rapid-mlx bugs.
 HF_HUB_CACHE=... make check           # or: rapid-mlx doctor check
@@ -77,9 +77,9 @@ Designed to be invoked from a pre-commit hook or `make` target.
 | `cli_sanity` | `rapid-mlx --help / models / agents` actually run |
 | `pytest` | Full unit suite (~45s, ~2070 tests) excluding `tests/integrations/` and `test_event_loop.py` |
 
-### `check` (~15 min, qwen3.5-35b)
+### `check` (~15 min, qwen3.5-35b-8bit)
 
-Spins up a real server with `qwen3.5-35b` (Qwen3.5-35B-A3B-8bit — A3B
+Spins up a real server with `qwen3.5-35b-8bit` (Qwen3.5-35B-A3B-8bit — A3B
 MoE so decode is fast despite the 35B param count), runs API + perf
 checks, diffs against `harness/baselines/check-qwen3.5-35b.json`.
 
@@ -96,11 +96,11 @@ the old default and made bug triage ambiguous.
 | `autoresearch` | `scripts/autoresearch_bench.py --json` (13 perf metrics) |
 | `baseline_diff` | Compare metrics, flag regressions per `harness/thresholds.yaml` |
 
-Override the model with `--model qwen3.6-35b` (will need its own baseline).
+Override the model with `--model qwen3.6-35b-4bit` (will need its own baseline).
 
 ### `full` (~2-3 hr, 3 models × 11 agent profiles)
 
-Loops the check tier across `qwen3.5-35b` and `qwen3.6-35b`
+Loops the check tier across `qwen3.5-35b-8bit` and `qwen3.6-35b-4bit`
 (real-capacity Qwen lines — both 8-bit, both go through the Hermes
 parser path that most users hit). For each model, also runs all 11
 agent profiles' auto-generated test plans.
@@ -115,7 +115,7 @@ agent profiles' auto-generated test plans.
 Override the model list:
 
 ```bash
-rapid-mlx doctor full --models qwen3.5-35b,qwen3.6-35b
+rapid-mlx doctor full --models qwen3.5-35b-8bit,qwen3.6-35b-4bit
 ```
 
 ### `benchmark` (overnight, all local models)
@@ -128,7 +128,7 @@ scorecard markdown:
 HF_HUB_CACHE=... rapid-mlx doctor benchmark
 
 # Or be explicit (forces inclusion even if cache probe misses):
-rapid-mlx doctor benchmark --models qwen3.5-35b,qwen3.6-35b
+rapid-mlx doctor benchmark --models qwen3.5-35b-8bit,qwen3.6-35b-4bit
 ```
 
 Output:
@@ -165,7 +165,7 @@ Baseline file shape:
 {
   "captured_at": "2026-04-15T21:36:32",
   "rapid_mlx_version": "0.5.1",
-  "model": "qwen3.5-35b",
+  "model": "qwen3.5-35b-8bit",
   "metrics": {
     "decode_tps": 49.67,
     "cold_ttft_ms": 313.63,
@@ -198,7 +198,7 @@ git diff harness/baselines/
 
 # 3. If the change is justified, commit; otherwise revert + investigate
 git commit harness/baselines/check-qwen3.5-35b.json -m \
-  "chore(doctor): bump qwen3.5-35b decode_tps baseline (mlx 0.31 SDPA gains)"
+  "chore(doctor): bump qwen3.5-35b-8bit decode_tps baseline (mlx 0.31 SDPA gains)"
 ```
 
 ## Thresholds
diff --git a/harness/baselines/full-qwen3.5-35b.json b/harness/baselines/full-qwen3.5-35b-8bit.json
similarity index 95%
rename from harness/baselines/full-qwen3.5-35b.json
rename to harness/baselines/full-qwen3.5-35b-8bit.json
index 124faf60..0bbab958 100644
--- a/harness/baselines/full-qwen3.5-35b.json
+++ b/harness/baselines/full-qwen3.5-35b-8bit.json
@@ -1,7 +1,7 @@
 {
   "captured_at": "2026-05-05T06:16:05",
   "rapid_mlx_version": "0.6.10",
-  "model": "qwen3.5-35b",
+  "model": "qwen3.5-35b-8bit",
   "metrics": {
     "cold_ttft_ms": 123.91637498512864,
     "cold_tps": 69.78493344229089,
diff --git a/harness/baselines/full-qwen3.6-35b.json b/harness/baselines/full-qwen3.6-35b-4bit.json
similarity index 95%
rename from harness/baselines/full-qwen3.6-35b.json
rename to harness/baselines/full-qwen3.6-35b-4bit.json
index 0c4f3e5b..743c3178 100644
--- a/harness/baselines/full-qwen3.6-35b.json
+++ b/harness/baselines/full-qwen3.6-35b-4bit.json
@@ -1,7 +1,7 @@
 {
   "captured_at": "2026-05-05T06:21:25",
   "rapid_mlx_version": "0.6.10",
-  "model": "qwen3.6-35b",
+  "model": "qwen3.6-35b-4bit",
   "metrics": {
     "cold_ttft_ms": 167.38729202188551,
     "cold_tps": 89.2618750454324,
diff --git a/harness/scorecard/latest.md b/harness/scorecard/latest.md
index ef51f998..54793d67 100644
--- a/harness/scorecard/latest.md
+++ b/harness/scorecard/latest.md
@@ -4,42 +4,42 @@ _Generated: 2026-04-16T07:10:38_
 
 | Model | Decode TPS | Cold TTFT | Cached TTFT | Tool % | Score | Status |
 | --- | ---: | ---: | ---: | ---: | ---: | --- |
-| deepseek-r1-32b | 8.6 | 1111ms | 418ms | 0% | 51.8 | OK |
-| llama3-3b | 34.9 | 258ms | 189ms | 0% | 130.1 | OK |
-| qwen3-vl-8b | 12.2 | 456ms | 505ms | 100% | 59.3 | OK |
-| qwen3.5-27b | — | — | — | — | — | FAIL — server boot failed: server exited with code 1 before becoming healthy |
-| qwen3.5-35b | 10.9 | 1091ms | 1063ms | 0% | 26.5 | OK |
-| qwen3.5-4b | 25.1 | 448ms | 460ms | 100% | 78.5 | OK |
-| qwen3.5-9b | 20.7 | 539ms | 563ms | 100% | 68.1 | OK |
-| qwopus-27b | 8.8 | 1165ms | 1145ms | 100% | 41.9 | OK |
+| deepseek-r1-32b-4bit | 8.6 | 1111ms | 418ms | 0% | 51.8 | OK |
+| llama3-3b-4bit | 34.9 | 258ms | 189ms | 0% | 130.1 | OK |
+| qwen3-vl-8b-4bit | 12.2 | 456ms | 505ms | 100% | 59.3 | OK |
+| qwen3.5-27b-4bit | — | — | — | — | — | FAIL — server boot failed: server exited with code 1 before becoming healthy |
+| qwen3.5-35b-8bit | 10.9 | 1091ms | 1063ms | 0% | 26.5 | OK |
+| qwen3.5-4b-4bit | 25.1 | 448ms | 460ms | 100% | 78.5 | OK |
+| qwen3.5-9b-4bit | 20.7 | 539ms | 563ms | 100% | 68.1 | OK |
+| qwopus-27b-4bit | 8.8 | 1165ms | 1145ms | 100% | 41.9 | OK |
 | qwopus-27b-8bit | — | — | — | — | — | FAIL — server boot failed: server exited with code 1 before becoming healthy |
 
 ## Skipped
 
-- **deepseek-r1-8b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **devstral-24b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **devstral-v2-24b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gemma-3n-e4b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gemma-4-26b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gemma-4-31b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gemma3-12b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gemma3-1b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gemma3-27b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **glm4.5-air** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **glm4.7-9b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **gpt-oss-20b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **hermes3-8b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **hermes4-70b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **kimi-48b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **kimi-k2.5** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **minimax-m2.5** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **ministral-3b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **mistral-24b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **phi4-14b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **qwen3-coder** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **qwen3-coder-30b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **qwen3-vl-30b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **qwen3-vl-4b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **qwen3.5-122b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **deepseek-r1-8b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **devstral-24b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **devstral-v2-24b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gemma-3n-e4b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gemma-4-26b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gemma-4-31b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gemma3-12b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gemma3-1b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gemma3-27b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **glm4.5-air-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **glm4.7-9b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **gpt-oss-20b-mxfp4-q8** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **hermes3-8b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **hermes4-70b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **kimi-48b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **kimi-k2.5-3bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **minimax-m2.5-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **ministral-3b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **mistral-24b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **phi-4-14b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **qwen3-coder-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **qwen3-coder-30b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **qwen3-vl-30b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **qwen3-vl-4b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **qwen3.5-122b-mxfp4** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
 - **qwen3.5-122b-8bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
-- **qwopus-9b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
+- **qwopus-9b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio
diff --git a/install.sh b/install.sh
index 9a0a6f7c..dd43768c 100755
--- a/install.sh
+++ b/install.sh
@@ -85,10 +85,10 @@ fi
 # ── 2. Detect RAM → recommend model ──────────────────────────────────────────
 
 RAM_GB=$(sysctl -n hw.memsize 2>/dev/null | awk '{printf "%d", $1/1073741824}')
-if   [ "$RAM_GB" -ge 96 ]; then RECOMMENDED_MODEL="qwen3.5-122b"; RAM_TIER="96+ GB"
-elif [ "$RAM_GB" -ge 48 ]; then RECOMMENDED_MODEL="qwen3.5-35b";  RAM_TIER="48-95 GB"
-elif [ "$RAM_GB" -ge 24 ]; then RECOMMENDED_MODEL="qwen3.5-9b";   RAM_TIER="24-47 GB"
-else                            RECOMMENDED_MODEL="qwen3.5-4b";   RAM_TIER="8-23 GB"
+if   [ "$RAM_GB" -ge 96 ]; then RECOMMENDED_MODEL="qwen3.5-122b-mxfp4"; RAM_TIER="96+ GB"
+elif [ "$RAM_GB" -ge 48 ]; then RECOMMENDED_MODEL="qwen3.5-35b-8bit";  RAM_TIER="48-95 GB"
+elif [ "$RAM_GB" -ge 24 ]; then RECOMMENDED_MODEL="qwen3.5-9b-4bit";   RAM_TIER="24-47 GB"
+else                            RECOMMENDED_MODEL="qwen3.5-4b-4bit";   RAM_TIER="8-23 GB"
 fi
 
 dim "macOS $(sw_vers -productVersion) · Apple Silicon · ${RAM_GB} GB RAM"
diff --git a/pyproject.toml b/pyproject.toml
index 85c75ce7..64e12c99 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -80,7 +80,7 @@ dependencies = [
 # Required for any model with vision input. Text-only models work without this.
 # 0.5.0+ also unlocks DFlash speculative decoding (see [dflash] extras).
 vision = [
-    "mlx-vlm>=0.6.1",  # 0.6.1: gemma4_unified architecture (gemma-4-12b/12b-8bit aliases require it). 0.5.0 baseline: gemma4 multi-image + tool-parser fixes, TurboQuant race-condition fix, continuous-batching guard. Also unlocks DFlash spec-decode hooks (see [dflash] extras).
+    "mlx-vlm>=0.6.1",  # 0.6.1: gemma4_unified architecture (gemma-4-12b-4bit/12b-8bit aliases require it). 0.5.0 baseline: gemma4 multi-image + tool-parser fixes, TurboQuant race-condition fix, continuous-batching guard. Also unlocks DFlash spec-decode hooks (see [dflash] extras).
     "opencv-python>=4.8.0",
     "torch>=2.3.0",
     "torchvision>=0.18.0",
diff --git a/scripts/bench_all_models.sh b/scripts/bench_all_models.sh
index b357ca45..2dadfa27 100755
--- a/scripts/bench_all_models.sh
+++ b/scripts/bench_all_models.sh
@@ -16,8 +16,8 @@ mkdir -p "$RESULTS_DIR"
 MODELS=(
   "phi4-mini-14b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Phi-4-mini-reasoning-MLX-4bit|hermes|"
   "mistral-small-24b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit|hermes|"
-  "gemma3-12b|/Users/raullenstudio/.lmstudio/models/mlx-community/gemma-3-12b-it-qat-4bit|hermes|"
-  "gpt-oss-20b|/Users/raullenstudio/.lmstudio/models/mlx-community/gpt-oss-20b-MXFP4-Q8|seed_oss|"
+  "gemma3-12b-4bit|/Users/raullenstudio/.lmstudio/models/mlx-community/gemma-3-12b-it-qat-4bit|hermes|"
+  "gpt-oss-20b-mxfp4-q8|/Users/raullenstudio/.lmstudio/models/mlx-community/gpt-oss-20b-MXFP4-Q8|seed_oss|"
   "glm47-9b|/Users/raullenstudio/.lmstudio/models/mlx-community/GLM-4.7-4bit|glm47|"
   "qwen35-122b-a10b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Qwen3.5-122B-A10B-Text-mxfp4-mlx|hermes|qwen3"
   "qwen3-coder-next-80b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-MLX-6bit|hermes|"
diff --git a/scripts/bench_engine_parity.py b/scripts/bench_engine_parity.py
index 4258bf21..f2d6404e 100644
--- a/scripts/bench_engine_parity.py
+++ b/scripts/bench_engine_parity.py
@@ -8,10 +8,10 @@
 
 Usage:
     # Start server with SimpleEngine (default)
-    rapid-mlx serve qwen3.5-4b --port 8000
+    rapid-mlx serve qwen3.5-4b-4bit --port 8000
 
     # Start server with BatchedEngine
-    rapid-mlx serve qwen3.5-4b --port 8001 --continuous-batching
+    rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching
 
     # Run benchmark
     python3 scripts/bench_engine_parity.py
@@ -289,8 +289,8 @@ def main():
         except Exception as e:
             print(f"  {name} ({url}): NOT AVAILABLE — {e}")
             print(f"\nPlease start both servers:")
-            print(f"  Terminal 1: rapid-mlx serve qwen3.5-4b --port 8000")
-            print(f"  Terminal 2: rapid-mlx serve qwen3.5-4b --port 8001 --continuous-batching")
+            print(f"  Terminal 1: rapid-mlx serve qwen3.5-4b-4bit --port 8000")
+            print(f"  Terminal 2: rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching")
             sys.exit(1)
 
     model_simple = detect_model(SIMPLE_URL)
diff --git a/scripts/bench_readme_refresh.py b/scripts/bench_readme_refresh.py
index dcbf942d..50a2fb32 100644
--- a/scripts/bench_readme_refresh.py
+++ b/scripts/bench_readme_refresh.py
@@ -22,7 +22,7 @@
 
 Usage:
     python3.12 scripts/bench_readme_refresh.py                     # full sweep
-    python3.12 scripts/bench_readme_refresh.py --models qwen3.5-4b # one model
+    python3.12 scripts/bench_readme_refresh.py --models qwen3.5-4b-4bit # one model
     python3.12 scripts/bench_readme_refresh.py --engines rapid-mlx,mlx-lm
 """
 
@@ -84,40 +84,40 @@ class ModelSpec:
 
 MODELS: list[ModelSpec] = [
     ModelSpec(
-        "qwen3.5-4b",
+        "qwen3.5-4b-4bit",
         "mlx-community/Qwen3.5-4B-MLX-4bit",
         "qwen3:4b",
         "Ollama Qwen3 (not Qwen3.5; DeltaNet arch unavailable on llama.cpp)",
     ),
     ModelSpec(
-        "qwen3.5-9b",
+        "qwen3.5-9b-4bit",
         "mlx-community/Qwen3.5-9B-4bit",
         "qwen3:8b",
         "Ollama Qwen3 8B (not Qwen3.5 9B; closest available)",
     ),
     ModelSpec(
-        "qwen3.5-27b",
+        "qwen3.5-27b-4bit",
         "mlx-community/Qwen3.5-27B-4bit",
         "qwen3:32b",
         "Ollama Qwen3 32B Q4_K_M (closest dense 27-32B; Qwen3.5 DeltaNet not on llama.cpp; Unsloth Qwen3.6-27B GGUF fails to load in Ollama 0.24)",
     ),
     ModelSpec(
-        "gemma-4-12b",
+        "gemma-4-12b-4bit",
         "mlx-community/gemma-4-12B-it-4bit",
         "gemma3:12b",
         "Ollama Gemma 3 12B (Gemma 4 not yet on llama.cpp)",
     ),
     ModelSpec(
-        "gpt-oss-20b", "mlx-community/gpt-oss-20b-MXFP4-Q8", "gpt-oss:20b", "Same arch"
+        "gpt-oss-20b-mxfp4-q8", "mlx-community/gpt-oss-20b-MXFP4-Q8", "gpt-oss:20b", "Same arch"
     ),
     ModelSpec(
-        "qwen3.6-35b",
+        "qwen3.6-35b-4bit",
         "mlx-community/Qwen3.6-35B-A3B-4bit",
         "qwen3:30b-a3b",
         "Ollama Qwen3 30B-A3B (not Qwen3.6; closest MoE A3B)",
     ),
     ModelSpec(
-        "qwen3.5-35b",
+        "qwen3.5-35b-8bit",
         "mlx-community/Qwen3.5-35B-A3B-8bit",
         "qwen3:30b-a3b",
         "Ollama Qwen3 30B-A3B 4bit (not Qwen3.5-35B 8bit; closest MoE)",
diff --git a/scripts/bench_vs_ollama.py b/scripts/bench_vs_ollama.py
index 78696146..a717fa5b 100644
--- a/scripts/bench_vs_ollama.py
+++ b/scripts/bench_vs_ollama.py
@@ -8,7 +8,7 @@
 
 Manual usage:
     python scripts/bench_vs_ollama.py
-    python scripts/bench_vs_ollama.py --model-pair qwen3.5-4b=qwen3.5:4b --runs 1
+    python scripts/bench_vs_ollama.py --model-pair qwen3.5-4b-4bit=qwen3.5:4b --runs 1
     python scripts/bench_vs_ollama.py --no-pull --no-download --runs 1
 """
 
@@ -93,8 +93,8 @@ class ParsedStream:
 
 def default_model_pairs() -> list[ModelPair]:
     return [
-        ModelPair("qwen3.5-4b", "qwen3.5:4b"),
-        ModelPair("qwen3.5-9b", "qwen3.5:9b"),
+        ModelPair("qwen3.5-4b-4bit", "qwen3.5:4b"),
+        ModelPair("qwen3.5-9b-4bit", "qwen3.5:9b"),
     ]
 
 
diff --git a/scripts/local_bench_vs_ollama.py b/scripts/local_bench_vs_ollama.py
index c1483125..e8dd23cf 100644
--- a/scripts/local_bench_vs_ollama.py
+++ b/scripts/local_bench_vs_ollama.py
@@ -96,7 +96,7 @@ def ollama_model_name(model: str) -> str:
     if ":" in model:
         return model
     known = {
-        "qwen3.5-4b": "qwen3:4b",
+        "qwen3.5-4b-4bit": "qwen3:4b",
         "qwen3.5-8b": "qwen3:8b",
         "qwen3.5-14b": "qwen3:14b",
         "qwen3.5-32b": "qwen3:32b",
@@ -106,7 +106,7 @@ def ollama_model_name(model: str) -> str:
         "phi4-4b": "phi4:4b",
         "phi4-mini": "phi4-mini:latest",
         "gemma3-4b": "gemma3:4b",
-        "gemma3-12b": "gemma3:12b",
+        "gemma3-12b-4bit": "gemma3:12b",
     }
     if model in known:
         return known[model]
@@ -683,7 +683,7 @@ def summary_row(metric: str, ratio: float, desc: str) -> None:
 # ── Main ──────────────────────────────────────────────────────────────────────
 def main() -> int:
     parser = argparse.ArgumentParser(description="Benchmark Rapid-MLX vs Ollama")
-    parser.add_argument("--model", default="qwen3.5-4b", help="Rapid-MLX model name")
+    parser.add_argument("--model", default="qwen3.5-4b-4bit", help="Rapid-MLX model name")
     parser.add_argument(
         "--ollama-model",
         default=None,
diff --git a/scripts/mhi_batch.sh b/scripts/mhi_batch.sh
index c3c9d6cc..cb4219dd 100644
--- a/scripts/mhi_batch.sh
+++ b/scripts/mhi_batch.sh
@@ -9,8 +9,8 @@ cd "$PROJECT_DIR"
 
 # Model definitions: name|path|tool_parser
 MODELS=(
-    "qwopus-27b|/Users/raullenstudio/.cache/huggingface/hub/models--Jackrong--MLX-Qwopus3.5-27B-v3-4bit/snapshots/d399209470abffa6b45678c53a910f869b18b2f2|hermes"
-    "deepseek-r1-32b|/Users/raullenstudio/.cache/huggingface/hub/models--mlx-community--DeepSeek-R1-Distill-Qwen-32B-4bit/snapshots/4e0d3848a0ad8f9fb54638891e4928f04fcca978|hermes"
+    "qwopus-27b-4bit|/Users/raullenstudio/.cache/huggingface/hub/models--Jackrong--MLX-Qwopus3.5-27B-v3-4bit/snapshots/d399209470abffa6b45678c53a910f869b18b2f2|hermes"
+    "deepseek-r1-32b-4bit|/Users/raullenstudio/.cache/huggingface/hub/models--mlx-community--DeepSeek-R1-Distill-Qwen-32B-4bit/snapshots/4e0d3848a0ad8f9fb54638891e4928f04fcca978|hermes"
     "llama-70b|/Volumes/Extreme SSD/Models/Llama-3.3-70B-Instruct-4bit|llama"
 )
 
diff --git a/scripts/mhi_eval.py b/scripts/mhi_eval.py
index 7e55460b..77f7843f 100644
--- a/scripts/mhi_eval.py
+++ b/scripts/mhi_eval.py
@@ -17,7 +17,7 @@
     python3 scripts/mhi_eval.py --base-url http://localhost:8000/v1 --suite tau
 
     # Custom model name
-    python3 scripts/mhi_eval.py --base-url http://localhost:8000/v1 --model qwopus-27b
+    python3 scripts/mhi_eval.py --base-url http://localhost:8000/v1 --model qwopus-27b-4bit
 """
 
 import argparse
diff --git a/scripts/pr_validate/golden_models.yaml b/scripts/pr_validate/golden_models.yaml
index 033e3aaf..cd5d1afa 100644
--- a/scripts/pr_validate/golden_models.yaml
+++ b/scripts/pr_validate/golden_models.yaml
@@ -127,7 +127,7 @@ families:
       # absent from this matrix when Bug A shipped, so no PR could have
       # caught the regression at the integration layer. Gemma 4 remains
       # excluded (see note below — model-side instruction-following
-      # weaknesses produce agent-test flake), but gpt-oss-20b
+      # weaknesses produce agent-test flake), but gpt-oss-20b-mxfp4-q8
       # (Harmony format) does NOT have those issues and gives us live
       # OutputRouter coverage for the second allowlist family.
       #
@@ -182,8 +182,8 @@ families:
 #     single fail (pydantic_ai single tool call) is a model-quality
 #     edge case (off-topic Java/Spring response). Not flaky enough
 #     to block merges, but adding here would gate every PR on it.
-# Both families remain available via ``rapid-mlx serve smollm3-3b``
-# and ``rapid-mlx serve granite4-tiny`` (auto-detected parsers).
+# Both families remain available via ``rapid-mlx serve smollm3-3b-4bit``
+# and ``rapid-mlx serve granite4-tiny-4bit`` (auto-detected parsers).
 
 # Per-model overrides: extra CLI args (e.g. parser overrides for
 # tool-calling tests). Keep small — most models don't need this.
@@ -211,7 +211,7 @@ overrides:
     args: ["--enable-auto-tool-choice", "--tool-call-parser", "hermes"]
   # Harmony parser handles both <|channel|>commentary tool-call blocks and
   # the final-channel content for gpt-oss-20b. The auto-detect registry
-  # picks it via gpt-oss-20b alias profile, but the explicit override
+  # picks it via gpt-oss-20b-mxfp4-q8 alias profile, but the explicit override
   # documents intent at the validation surface (mirrors the qwen3.5/3.6
   # convention above).
   "mlx-community/gpt-oss-20b-MXFP4-Q8":
diff --git a/scripts/rename_aliases.py b/scripts/rename_aliases.py
new file mode 100644
index 00000000..a33fa8a5
--- /dev/null
+++ b/scripts/rename_aliases.py
@@ -0,0 +1,167 @@
+#!/usr/bin/env python3.12
+"""One-shot rename of every alias in vllm_mlx/aliases.json to the canonical
+explicit form ``<family>-<version>-<params>-<modality?>-<technique?>-<quant>``.
+
+Drops three legacy short-form codename aliases that violate the spec
+(``deepseek-v4-flash``, ``gemma4``, ``nemotron-nano``) and fixes the
+``phi4-14b`` schema bug where the alias name claimed 14B but the hf_path
+pointed at phi-4-mini (~4B).
+
+Also dumps ``rename_map.json`` so the repo-wide reference sweep can
+mechanically rewrite occurrences in tests, docs, scripts.
+
+Run from repo root:
+
+    python3.12 scripts/rename_aliases.py
+"""
+
+import json
+import re
+from collections import OrderedDict
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parents[1]
+ALIASES_PATH = ROOT / "vllm_mlx" / "aliases.json"
+RENAME_MAP_PATH = ROOT / "scripts" / "rename_map.json"
+
+
+def detect_quant(hf: str) -> str | None:
+    """Inspect an hf_path and return the canonical quant suffix.
+
+    Order matters: longer / more specific markers first so we don't
+    misclassify ``mxfp4-q8`` as ``mxfp4`` or ``-4bit-DWQ`` as ``-4bit``.
+    """
+    h = hf.lower()
+    if "mxfp4-q8" in h or "mxfp4_q8" in h:
+        return "mxfp4-q8"
+    if "mxfp4" in h:
+        return "mxfp4"
+    if "dwq" in h:
+        return "dwq"
+    if "ud-mlx" in h or "-ud-" in h:
+        return "ud"
+    if m := re.search(r"-(\d+)bit", h):
+        return f"{m.group(1)}bit"
+    if "unpacked" in h:
+        return "unpacked"
+    if "bf16" in h:
+        return "bf16"
+    if "fp16" in h:
+        return "fp16"
+    return None
+
+
+# 1. Aliases whose hf_path is unambiguous and only need the quant suffix added.
+# These are determined automatically by detect_quant.
+
+# 2. Manual overrides — aliases that need name changes beyond just adding a suffix,
+#    or that need their hf_path corrected. ``drop_to`` is the equivalent
+#    explicit alias the repo-wide sweep should rewrite references to.
+MANUAL: dict[str, dict[str, str] | None] = {
+    # Codename aliases — duplicate hf_path of an explicit entry. Drop entirely,
+    # but tell the sweep where to redirect references.
+    "deepseek-v4-flash": {"drop": True, "redirect_to": "deepseek-v4-flash-8bit"},
+    "gemma4": {"drop": True, "redirect_to": "gemma-4-12b-qat-4bit"},
+    "nemotron-nano": {"drop": True, "redirect_to": "nemotron-30b-4bit"},
+    # phi4-14b: schema bug. hf_path was pointing at phi-4-mini (~4B).
+    # Fix: rename to phi-4-14b-4bit AND swap hf_path to the real Phi-4 14B.
+    # The 4B mini variant moves to its own new alias (added below).
+    "phi4-14b": {
+        "new_name": "phi-4-14b-4bit",
+        "new_hf_path": "mlx-community/phi-4-4bit",
+    },
+}
+
+# 3. Brand-new aliases to add at the end (preserves old phi-4-mini coverage).
+NEW_ALIASES: list[tuple[str, dict]] = [
+    (
+        "phi-4-mini-4bit",
+        {
+            "hf_path": "mlx-community/phi-4-mini-instruct-4bit",
+            "tool_call_parser": "hermes",
+            "reasoning_parser": None,
+            "is_hybrid": False,
+            "is_moe": False,
+            "supports_spec_decode": True,
+            "suffix_decoding_tier": "unknown",
+        },
+    ),
+]
+
+
+def compute_new_name(old: str, hf: str) -> str:
+    quant = detect_quant(hf)
+    if quant is None:
+        raise ValueError(f"alias {old!r}: cannot detect quant from {hf!r}")
+    # If the old name already ends in a known quant suffix, replace it.
+    stem = re.sub(
+        r"-(2bit|3bit|4bit|6bit|8bit|mxfp4-q8|mxfp4|dwq|ud|unpacked|bf16|fp16)$",
+        "",
+        old,
+    )
+    return f"{stem}-{quant}"
+
+
+def main() -> None:
+    with open(ALIASES_PATH) as fp:
+        data = json.load(fp, object_pairs_hook=OrderedDict)
+
+    new_data: OrderedDict[str, object] = OrderedDict()
+    rename_map: dict[str, str | None] = {}
+
+    for old, profile in data.items():
+        # Handle manual overrides first.
+        if old in MANUAL:
+            spec = MANUAL[old]
+            if spec is None:
+                # Drop entirely (legacy form — kept for the type checker).
+                rename_map[old] = None
+                continue
+            if spec.get("drop"):
+                # Codename alias — drop from aliases.json but tell the sweep
+                # where to point references.
+                rename_map[old] = spec["redirect_to"]
+                continue
+            new_name = spec["new_name"]
+            if "new_hf_path" in spec:
+                profile = OrderedDict(profile)
+                profile["hf_path"] = spec["new_hf_path"]
+            new_data[new_name] = profile
+            rename_map[old] = new_name
+            continue
+
+        # Default path: keep the entry, rewrite the key.
+        hf = profile["hf_path"] if isinstance(profile, dict) else profile
+        new_name = compute_new_name(old, hf)
+        if new_name in new_data:
+            raise ValueError(f"rename collision: {old!r} -> {new_name!r} already used")
+        new_data[new_name] = profile
+        rename_map[old] = new_name
+
+    # Append brand-new aliases (skip if already present so the script is
+    # idempotent — useful when iterating on the rename rules).
+    for name, profile in NEW_ALIASES:
+        if name not in new_data:
+            new_data[name] = profile
+
+    # Write back.
+    with open(ALIASES_PATH, "w") as fp:
+        json.dump(new_data, fp, indent=2)
+        fp.write("\n")
+
+    with open(RENAME_MAP_PATH, "w") as fp:
+        json.dump(rename_map, fp, indent=2, sort_keys=True)
+        fp.write("\n")
+
+    renamed = sum(1 for o, n in rename_map.items() if n and o != n)
+    dropped = sum(1 for n in rename_map.values() if n is None)
+    kept = sum(1 for o, n in rename_map.items() if n and o == n)
+    print(f"  renamed: {renamed}")
+    print(f"  dropped: {dropped}")
+    print(f"  kept (already explicit): {kept}")
+    print(f"  new aliases added: {len(NEW_ALIASES)}")
+    print(f"  total: {len(new_data)}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/rename_map.json b/scripts/rename_map.json
new file mode 100644
index 00000000..810983c0
--- /dev/null
+++ b/scripts/rename_map.json
@@ -0,0 +1,76 @@
+{
+  "bonsai-1.7b": "bonsai-1.7b-unpacked",
+  "bonsai-4b": "bonsai-4b-unpacked",
+  "bonsai-8b": "bonsai-8b-unpacked",
+  "deepseek-r1-32b": "deepseek-r1-32b-4bit",
+  "deepseek-r1-8b": "deepseek-r1-8b-4bit",
+  "deepseek-v4-flash": "deepseek-v4-flash-8bit",
+  "deepseek-v4-flash-2bit": "deepseek-v4-flash-2bit",
+  "deepseek-v4-flash-4bit": "deepseek-v4-flash-4bit",
+  "deepseek-v4-flash-8bit": "deepseek-v4-flash-8bit",
+  "devstral-24b": "devstral-24b-4bit",
+  "devstral-v2-24b": "devstral-v2-24b-4bit",
+  "gemma-3n-e4b": "gemma-3n-e4b-4bit",
+  "gemma-4-12b": "gemma-4-12b-4bit",
+  "gemma-4-12b-8bit": "gemma-4-12b-8bit",
+  "gemma-4-12b-qat": "gemma-4-12b-qat-4bit",
+  "gemma-4-12b-qat-8bit": "gemma-4-12b-qat-8bit",
+  "gemma-4-26b": "gemma-4-26b-4bit",
+  "gemma-4-26b-qat": "gemma-4-26b-qat-4bit",
+  "gemma-4-31b": "gemma-4-31b-4bit",
+  "gemma-4-31b-8bit": "gemma-4-31b-8bit",
+  "gemma-4-31b-qat": "gemma-4-31b-qat-4bit",
+  "gemma-4-31b-qat-8bit": "gemma-4-31b-qat-8bit",
+  "gemma3-12b": "gemma3-12b-4bit",
+  "gemma3-1b": "gemma3-1b-4bit",
+  "gemma3-27b": "gemma3-27b-4bit",
+  "gemma4": "gemma-4-12b-qat-4bit",
+  "glm4.5-air": "glm4.5-air-4bit",
+  "glm4.7-9b": "glm4.7-9b-4bit",
+  "gpt-oss-20b": "gpt-oss-20b-mxfp4-q8",
+  "granite4-tiny": "granite4-tiny-4bit",
+  "hermes3-8b": "hermes3-8b-4bit",
+  "hermes4-70b": "hermes4-70b-4bit",
+  "kimi-48b": "kimi-48b-4bit",
+  "kimi-k2.5": "kimi-k2.5-3bit",
+  "llama-3.1-8b-8bit": "llama-3.1-8b-8bit",
+  "llama3-1b": "llama3-1b-4bit",
+  "llama3-3b": "llama3-3b-4bit",
+  "minimax-m2.5": "minimax-m2.5-4bit",
+  "minimax-m2.7": "minimax-m2.7-mxfp4",
+  "ministral-3b": "ministral-3b-4bit",
+  "mistral-24b": "mistral-24b-4bit",
+  "nemotron-30b": "nemotron-30b-4bit",
+  "nemotron-nano": "nemotron-30b-4bit",
+  "phi4-14b": "phi-4-14b-4bit",
+  "qwen3-0.6b-8bit": "qwen3-0.6b-8bit",
+  "qwen3-4b-8bit": "qwen3-4b-8bit",
+  "qwen3-8b-8bit": "qwen3-8b-8bit",
+  "qwen3-coder": "qwen3-coder-4bit",
+  "qwen3-coder-30b": "qwen3-coder-30b-4bit",
+  "qwen3-vl-30b": "qwen3-vl-30b-4bit",
+  "qwen3-vl-4b": "qwen3-vl-4b-4bit",
+  "qwen3-vl-8b": "qwen3-vl-8b-4bit",
+  "qwen3.5-122b": "qwen3.5-122b-mxfp4",
+  "qwen3.5-122b-8bit": "qwen3.5-122b-8bit",
+  "qwen3.5-27b": "qwen3.5-27b-4bit",
+  "qwen3.5-27b-8bit": "qwen3.5-27b-8bit",
+  "qwen3.5-35b": "qwen3.5-35b-8bit",
+  "qwen3.5-35b-4bit": "qwen3.5-35b-4bit",
+  "qwen3.5-4b": "qwen3.5-4b-4bit",
+  "qwen3.5-4b-8bit": "qwen3.5-4b-8bit",
+  "qwen3.5-9b": "qwen3.5-9b-4bit",
+  "qwen3.5-9b-8bit": "qwen3.5-9b-8bit",
+  "qwen3.6-27b": "qwen3.6-27b-4bit",
+  "qwen3.6-27b-8bit": "qwen3.6-27b-8bit",
+  "qwen3.6-27b-ud": "qwen3.6-27b-ud",
+  "qwen3.6-35b": "qwen3.6-35b-4bit",
+  "qwen3.6-35b-6bit": "qwen3.6-35b-6bit",
+  "qwen3.6-35b-8bit": "qwen3.6-35b-8bit",
+  "qwen3.6-35b-dwq": "qwen3.6-35b-dwq",
+  "qwen3.6-35b-ud": "qwen3.6-35b-ud",
+  "qwopus-27b": "qwopus-27b-4bit",
+  "qwopus-27b-8bit": "qwopus-27b-8bit",
+  "qwopus-9b": "qwopus-9b-4bit",
+  "smollm3-3b": "smollm3-3b-4bit"
+}
diff --git a/scripts/run_dogfood_mvp.sh b/scripts/run_dogfood_mvp.sh
index da638b58..24ab50d4 100755
--- a/scripts/run_dogfood_mvp.sh
+++ b/scripts/run_dogfood_mvp.sh
@@ -11,7 +11,7 @@
 #     scripts/run_dogfood_mvp.sh status
 #
 #   Env vars:
-#     MODEL            alias to serve (default: qwen3.5-35b)
+#     MODEL            alias to serve (default: qwen3.5-35b-8bit)
 #     PORT             local port (default: 8765)
 #     API_KEY          bearer token (default: random 24 hex bytes)
 #     RAPID_MLX_CMD    serve command (default: auto — editable `python3.12 -m
@@ -114,7 +114,7 @@ fi
 
 # Pick the serve command. Prefer the editable repo CLI when we're inside
 # vllm-mlx — the brew-installed `rapid-mlx` ships an older aliases.json
-# and won't see recent additions like `minimax-m2.7`.
+# and won't see recent additions like `minimax-m2.7-mxfp4`.
 if [ -z "${RAPID_MLX_CMD:-}" ]; then
   if [ -f "$(git rev-parse --show-toplevel 2>/dev/null)/vllm_mlx/cli.py" ] \
        && python3.12 -c "import vllm_mlx" >/dev/null 2>&1; then
diff --git a/scripts/sweep_alias_refs.py b/scripts/sweep_alias_refs.py
new file mode 100644
index 00000000..918f2ca7
--- /dev/null
+++ b/scripts/sweep_alias_refs.py
@@ -0,0 +1,215 @@
+#!/usr/bin/env python3.12
+"""Repo-wide rewrite of legacy alias names to canonical explicit names.
+
+Loads ``scripts/rename_map.json`` (generated by ``rename_aliases.py``) and
+applies the mapping inside every active source file.
+
+EXCLUDED on purpose:
+  * ``evals/results/*.json`` — historical benchmark snapshots; their
+    ``model`` field records *which alias the bench was run under at the
+    time*. Rewriting these is rewriting history.
+  * ``CHANGELOG.md`` (if present) — historical release notes; same logic.
+  * ``.git/``, ``.build/``, ``.venv/``, ``__pycache__/``, ``node_modules/``
+  * ``vllm_mlx/aliases.json`` — already correctly written by
+    ``rename_aliases.py``.
+  * ``scripts/rename_aliases.py``, ``scripts/rename_map.json``,
+    ``scripts/sweep_alias_refs.py`` — these files document the rename
+    itself; legacy names must appear in them by construction.
+
+Run from repo root:
+
+    python3.12 scripts/sweep_alias_refs.py [--dry-run]
+"""
+
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parents[1]
+RENAME_MAP_PATH = ROOT / "scripts" / "rename_map.json"
+
+# File extensions we sweep.
+EXTENSIONS = {
+    ".py",
+    ".md",
+    ".sh",
+    ".yaml",
+    ".yml",
+    ".toml",
+    ".cfg",
+    ".rst",
+    ".mdx",
+    ".json",
+    ".txt",
+}
+
+# Path prefixes to ignore entirely.
+SKIP_DIR_PARTS = {
+    ".git",
+    ".build",
+    ".venv",
+    "venv",
+    "__pycache__",
+    "node_modules",
+    ".pytest_cache",
+    ".mypy_cache",
+    ".ruff_cache",
+    "dist",
+    "build",
+    ".tox",
+    ".claude",
+}
+
+# Specific files to leave alone — see module docstring.
+SKIP_FILES = {
+    ROOT / "vllm_mlx" / "aliases.json",
+    ROOT / "scripts" / "rename_aliases.py",
+    ROOT / "scripts" / "rename_map.json",
+    ROOT / "scripts" / "sweep_alias_refs.py",
+    ROOT / "CHANGELOG.md",
+}
+
+# Specific subtrees to leave alone.
+SKIP_TREES = (
+    ROOT / "evals" / "results",
+    # Historical doctor harness runs — directory names are
+    # ``YYYY-MM-DD-<seq>-<tier>`` and the contents are point-in-time
+    # snapshots of what each alias produced. Rewriting them is
+    # rewriting history. The ``harness/scorecard/latest.md`` pointer
+    # OUTSIDE this tree is intentionally not skipped — it's the live
+    # dashboard and will be regenerated on the next ``make check``.
+    ROOT / "harness" / "runs",
+    # Historical README-refresh benchmark snapshots — captured to
+    # justify a release-time README edit; rewriting changes what we
+    # claimed at the time.
+    ROOT / "reports" / "benchmarks",
+    # Historical Model Harness Index reports — filenames have a
+    # timestamp suffix (``alias_YYYYMMDD_HHMMSS.json``); rewriting
+    # the ``model`` field inside changes which alias the historical
+    # MHI score was attributed to.
+    ROOT / "reports" / "mhi",
+)
+
+
+def should_skip(path: Path) -> bool:
+    if path.is_dir():
+        return any(part in SKIP_DIR_PARTS for part in path.parts)
+    if path in SKIP_FILES:
+        return True
+    if any(part in SKIP_DIR_PARTS for part in path.parts):
+        return True
+    for tree in SKIP_TREES:
+        try:
+            path.relative_to(tree)
+            return True
+        except ValueError:
+            pass
+    if path.suffix not in EXTENSIONS:
+        return True
+    return False
+
+
+def build_pattern(rename_map: dict[str, str]) -> re.Pattern:
+    """Build one big regex matching any legacy alias on a word boundary.
+
+    Sort by length descending so longer names match before any prefix
+    of theirs (e.g. ``qwen3.5-122b`` must match before ``qwen3.5-1``).
+    """
+    keys = sorted(rename_map.keys(), key=len, reverse=True)
+    # Custom boundary: the names contain ``-`` and ``.`` so the stdlib
+    # ``\b`` doesn't apply cleanly. Match negative-lookahead/lookbehind
+    # against ``[A-Za-z0-9._-]`` — i.e. don't extend on either side.
+    parts = [re.escape(k) for k in keys]
+    return re.compile(
+        r"(?<![A-Za-z0-9._-])(" + "|".join(parts) + r")(?![A-Za-z0-9._-])"
+    )
+
+
+def rewrite_file(
+    path: Path, pattern: re.Pattern, rename_map: dict[str, str]
+) -> tuple[bool, int]:
+    try:
+        original = path.read_text(encoding="utf-8")
+    except (UnicodeDecodeError, PermissionError):
+        return False, 0
+    count = 0
+
+    def sub(m: re.Match) -> str:
+        nonlocal count
+        count += 1
+        return rename_map[m.group(1)]
+
+    new = pattern.sub(sub, original)
+    if new != original:
+        return True, count, new
+    return False, 0, original
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="report files that would change, don't write",
+    )
+    args = parser.parse_args()
+
+    rename_map = json.loads(RENAME_MAP_PATH.read_text())
+    # Filter to entries with a real target (drops + renames).
+    rename_map = {o: n for o, n in rename_map.items() if n}
+    # High-ambiguity legacy aliases that collide with non-alias usage
+    # elsewhere in the codebase. ``gemma4`` is the parser ID
+    # (registered in ``gemma4_tool_parser.py``, referenced from
+    # ``model_auto_config.py``, ``output_router.py``, etc.) — auto-
+    # rewriting it would corrupt the parser registry. These are handled
+    # by a hand-written pass below.
+    HAND_HANDLED = {"gemma4"}
+    rename_map = {o: n for o, n in rename_map.items() if o not in HAND_HANDLED}
+
+    pattern = build_pattern(rename_map)
+
+    changed_files = []
+    total_replacements = 0
+
+    for path in ROOT.rglob("*"):
+        if should_skip(path):
+            continue
+        if not path.is_file():
+            continue
+        try:
+            original = path.read_text(encoding="utf-8")
+        except (UnicodeDecodeError, PermissionError):
+            continue
+        count = 0
+
+        def sub(m: re.Match) -> str:
+            nonlocal count
+            count += 1
+            return rename_map[m.group(1)]
+
+        new = pattern.sub(sub, original)
+        if new != original:
+            changed_files.append((path, count))
+            total_replacements += count
+            if not args.dry_run:
+                path.write_text(new, encoding="utf-8")
+
+    rel_root = ROOT
+    print(f"  files changed: {len(changed_files)}")
+    print(f"  total replacements: {total_replacements}")
+    if args.dry_run:
+        print("  (dry-run — no files written)")
+    print()
+    for path, count in sorted(changed_files, key=lambda x: -x[1])[:20]:
+        rel = path.relative_to(rel_root)
+        print(f"    {count:4d}  {rel}")
+    if len(changed_files) > 20:
+        print(f"    ... +{len(changed_files) - 20} more")
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/tests/fixtures/generation_configs/devstral-24b.json b/tests/fixtures/generation_configs/devstral-24b-4bit.json
similarity index 100%
rename from tests/fixtures/generation_configs/devstral-24b.json
rename to tests/fixtures/generation_configs/devstral-24b-4bit.json
diff --git a/tests/fixtures/generation_configs/gemma-3n-e4b.json b/tests/fixtures/generation_configs/gemma-3n-e4b-4bit.json
similarity index 100%
rename from tests/fixtures/generation_configs/gemma-3n-e4b.json
rename to tests/fixtures/generation_configs/gemma-3n-e4b-4bit.json
diff --git a/tests/fixtures/generation_configs/gemma3-27b.json b/tests/fixtures/generation_configs/gemma3-27b-4bit.json
similarity index 100%
rename from tests/fixtures/generation_configs/gemma3-27b.json
rename to tests/fixtures/generation_configs/gemma3-27b-4bit.json
diff --git a/tests/fixtures/generation_configs/glm4.5-air.json b/tests/fixtures/generation_configs/glm4.5-air-4bit.json
similarity index 100%
rename from tests/fixtures/generation_configs/glm4.5-air.json
rename to tests/fixtures/generation_configs/glm4.5-air-4bit.json
diff --git a/tests/fixtures/generation_configs/glm4.7-9b.json b/tests/fixtures/generation_configs/glm4.7-9b-4bit.json
similarity index 100%
rename from tests/fixtures/generation_configs/glm4.7-9b.json
rename to tests/fixtures/generation_configs/glm4.7-9b-4bit.json
diff --git a/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py b/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py
index 6f4d902e..9b18c71e 100644
--- a/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py
+++ b/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py
@@ -9,7 +9,7 @@
     raw body ``...to=functions.get_weather...{"city": "Tokyo"}...``
     leaks into ``delta.content``.
 
-  * #480 (2026-05-28, gpt-oss-20b, ``tool_choice="auto"``): raw body
+  * #480 (2026-05-28, gpt-oss-20b-mxfp4-q8, ``tool_choice="auto"``): raw body
     ``commentary to=functions.get_weather json{"city":"Paris"}``
     leaks into ``delta.content``. User-facing duplicate of #444 with
     a different prompt — covered here to ensure the fix applies
@@ -31,7 +31,7 @@
 
 Test cases sourced verbatim from the issue body's repro section.
 
-Scope caveat (post-live-verification on gpt-oss-20b 2026-06-04):
+Scope caveat (post-live-verification on gpt-oss-20b-mxfp4-q8 2026-06-04):
 this file exercises the ``HarmonyToolParser`` streaming entry point
 in isolation against the FULL markered text (`<|channel|>commentary
 to=functions.X<|message|>{body}<|call|>`). The parser-level fix
@@ -85,7 +85,7 @@ class _Case:
 
 
 # Verbatim from the repro in issue #444 (and adjacent harmony commentary
-# formats observed on gpt-oss-20b). Each case is the FULL model output
+# formats observed on gpt-oss-20b-mxfp4-q8). Each case is the FULL model output
 # from the ``<|channel|>commentary`` token through the closing ``<|call|>``,
 # i.e. what the model emits when invoking exactly one tool.
 TEST_CASES: list[_Case] = [
diff --git a/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py b/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py
index b8734751..212bdfde 100644
--- a/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py
+++ b/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py
@@ -2,7 +2,7 @@
 """Regression guard for #448 — hermes streaming leaks `<function` prefix.
 
 Reported 2026-05-23 on ``mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit``
-(aliased via ``qwen3-coder-30b`` to the hermes tool parser). Qwen3-Coder
+(aliased via ``qwen3-coder-30b-4bit`` to the hermes tool parser). Qwen3-Coder
 emits the legacy ``<function=name>{...}</function>`` wire format. The
 hermes non-stream regex handles both ``<tool_call>...`` and bare
 ``<function=...`` blocks, but the streaming branch only short-circuits
diff --git a/tests/parsers/regressions/test_issue_455_harmony_commentary_tool_channel.py b/tests/parsers/regressions/test_issue_455_harmony_commentary_tool_channel.py
index 31aa3c70..925237be 100644
--- a/tests/parsers/regressions/test_issue_455_harmony_commentary_tool_channel.py
+++ b/tests/parsers/regressions/test_issue_455_harmony_commentary_tool_channel.py
@@ -285,7 +285,7 @@ def test_harmony_router_sanity(case: _SanityCase, router):
         "an optional `` json`` constrain directive, then the body. The "
         "router falls into the default branch, transitions to CONTENT, "
         "and leaks ``commentary`` + recipient + body as content text. "
-        "Live verification on gpt-oss-20b (2026-06-04) revealed an "
+        "Live verification on gpt-oss-20b-mxfp4-q8 (2026-06-04) revealed an "
         "additional constraint not modeled by these BUG_CASES synthetic "
         "vocab: production ``commentary`` is TWO tokens — ``comment`` "
         "(12606) + ``ary`` (815) — so naive single-token channel-type "
diff --git a/tests/parsers/regressions/test_issue_468_tool_choice_required_harmony_compound.py b/tests/parsers/regressions/test_issue_468_tool_choice_required_harmony_compound.py
index 977f5fca..510abf5f 100644
--- a/tests/parsers/regressions/test_issue_468_tool_choice_required_harmony_compound.py
+++ b/tests/parsers/regressions/test_issue_468_tool_choice_required_harmony_compound.py
@@ -186,7 +186,7 @@ def _stringify_structured(entry: object) -> str:
         "Issue #468 (router-level portion) — compound analysis + "
         "commentary sequence leaks the commentary block as CONTENT "
         "text. Same family-wide gap as #455. Live verification on "
-        "gpt-oss-20b (2026-06-04) confirmed the symptom AND surfaced "
+        "gpt-oss-20b-mxfp4-q8 (2026-06-04) confirmed the symptom AND surfaced "
         "the deeper constraint that breaks naive single-token fixes: "
         "production ``commentary`` is two tokens (``comment``+``ary``). "
         "Eventual fix must lookahead-decode the channel-type word or "
diff --git a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py
index b322bcb5..0657adab 100644
--- a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py
+++ b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py
@@ -8,7 +8,7 @@
 ``commentary`` + ``to=functions.<name>`` + optional
 ``<|constrain|>json`` + body + ``<|call|>``. PR #514 confirmed
 ``commentary`` is multi-token (``comment``+``ary``) on production
-gpt-oss-20b, which the custom token-ID-match state machine could
+gpt-oss-20b-mxfp4-q8, which the custom token-ID-match state machine could
 never identify.
 
 PR #515 lands the SOTA fix: delegate harmony state tracking to
@@ -80,7 +80,7 @@ def router(encoding):
 def _encode(encoding, text: str) -> list[int]:
     """Wrap encode with allowed_special=all so structural markers
     (``<|channel|>`` etc.) round-trip as single token IDs the way
-    a real gpt-oss-20b would emit them.
+    a real gpt-oss-20b-mxfp4-q8 would emit them.
     """
     return encoding.encode(text, allowed_special="all")
 
@@ -699,7 +699,7 @@ def encode(self, text, add_special_tokens=False):
         "my-not-gpt-oss-20b",
         "notgpt-oss-fake",
         "some-user/gpt-oss-remapped",
-        "evil-org/gpt-oss-20b",
+        "evil-org/gpt-oss-20b-mxfp4-q8",
         "anonymous/gpt-oss",
     )
     for name in rejected_names:
@@ -713,15 +713,15 @@ class _Fake(_CompatTokenizerBase):
         )
 
     accepted_names = (
-        "openai/gpt-oss-20b",
+        "openai/gpt-oss-20b-mxfp4-q8",
         "mlx-community/gpt-oss-20b-MXFP4-Q8",
         "unsloth/gpt-oss-20b-MLX-8bit",
-        "gpt-oss-20b",
+        "gpt-oss-20b-mxfp4-q8",
         "gpt-oss",
-        "/models/gpt-oss-20b",
-        "~/lmstudio-models/gpt-oss-20b",
+        "/models/gpt-oss-20b-mxfp4-q8",
+        "~/lmstudio-models/gpt-oss-20b-mxfp4-q8",
         "./gpt-oss-20b-quantized",
-        "../models/gpt-oss-20b",
+        "../models/gpt-oss-20b-mxfp4-q8",
     )
     for name in accepted_names:
 
diff --git a/tests/test_aliases_contract.py b/tests/test_aliases_contract.py
index e4784d4b..c007dfe4 100644
--- a/tests/test_aliases_contract.py
+++ b/tests/test_aliases_contract.py
@@ -436,27 +436,27 @@ def test_negative_control_dflash_missing_drafter_is_caught() -> None:
 
 def test_audit_batch_reasoning_parser_wirings() -> None:
     """Pin the Model Onboarding SOP audit fixes for reasoning_parser
-    on nemotron / kimi-k2.5 / hermes4 aliases. Each was previously
+    on nemotron / kimi-k2.5-3bit / hermes4 aliases. Each was previously
     ``null`` despite the model emitting ``<think>``/``</think>``
     blocks — without the parser, those blocks leak into
     ``message.content``.
 
     Parser choice rationale:
-    - nemotron-30b/nano + kimi-k2.5 use a Qwen3-style template that
+    - nemotron-30b-4bit/nano + kimi-k2.5-3bit use a Qwen3-style template that
       INJECTS ``<think>`` into the prompt (gated by ``enable_thinking``
       / ``thinking`` flag). ``qwen3`` parser's ``finalize_streaming``
       correction handles the "no </think> ever appeared → emit as
       content" case correctly.
-    - hermes4-70b: the chat template does NOT inject ``<think>``;
+    - hermes4-70b-4bit: the chat template does NOT inject ``<think>``;
       the model decides autonomously. Same contract as GLM-4 → reuse
       ``glm4`` parser (no-tags-yet → content semantics).
     """
     profiles = list_profiles()
     expected = {
-        "nemotron-30b": "qwen3",
-        "nemotron-nano": "qwen3",
-        "kimi-k2.5": "qwen3",
-        "hermes4-70b": "glm4",
+        "nemotron-30b-4bit": "qwen3",
+        "nemotron-30b-4bit": "qwen3",
+        "kimi-k2.5-3bit": "qwen3",
+        "hermes4-70b-4bit": "glm4",
     }
     for alias, parser in expected.items():
         assert alias in profiles, f"{alias} missing from aliases.json"
@@ -482,7 +482,7 @@ def test_bonsai_family_wires_glm4_reasoning_parser() -> None:
     zero behavioural downside for non-thinking turns.
     """
     profiles = list_profiles()
-    for alias in ("bonsai-1.7b", "bonsai-4b", "bonsai-8b"):
+    for alias in ("bonsai-1.7b-unpacked", "bonsai-4b-unpacked", "bonsai-8b-unpacked"):
         assert alias in profiles, f"{alias} missing from aliases.json"
         assert profiles[alias].reasoning_parser == "glm4", (
             f"{alias}: reasoning_parser must be 'glm4' per audit. "
@@ -499,7 +499,7 @@ def test_audit_batch_bonsai_tool_call_parser_wired() -> None:
     https://huggingface.co/prism-ml/Bonsai-1.7B-unpacked.
     """
     profiles = list_profiles()
-    for alias in ("bonsai-1.7b", "bonsai-4b", "bonsai-8b"):
+    for alias in ("bonsai-1.7b-unpacked", "bonsai-4b-unpacked", "bonsai-8b-unpacked"):
         assert alias in profiles, f"{alias} missing from aliases.json"
         assert profiles[alias].tool_call_parser == "hermes", (
             f"{alias}: tool_call_parser must be 'hermes' per audit. "
@@ -519,7 +519,7 @@ def test_deepseek_v4_flash_family_wires_deepseek_r1_reasoning_parser() -> None:
     """
     profiles = list_profiles()
     family = [
-        "deepseek-v4-flash",
+        "deepseek-v4-flash-8bit",
         "deepseek-v4-flash-2bit",
         "deepseek-v4-flash-4bit",
         "deepseek-v4-flash-8bit",
@@ -545,44 +545,44 @@ def test_aliases_with_known_broken_hf_paths_stay_fixed() -> None:
     change" commit doesn't quietly restore the broken path.
     """
     profiles = list_profiles()
-    # qwen3-vl-4b: stale ``-MLX-`` suffix not used by upstream uploads
-    assert "MLX-4bit" not in profiles["qwen3-vl-4b"].hf_path, (
-        "qwen3-vl-4b previously pointed at "
+    # qwen3-vl-4b-4bit: stale ``-MLX-`` suffix not used by upstream uploads
+    assert "MLX-4bit" not in profiles["qwen3-vl-4b-4bit"].hf_path, (
+        "qwen3-vl-4b-4bit previously pointed at "
         "mlx-community/Qwen3-VL-4B-Instruct-MLX-4bit which 404s; the "
         "current upload is Qwen3-VL-4B-Instruct-4bit (no '-MLX-' suffix)."
     )
-    # devstral-24b: ``2503`` snapshot was never re-uploaded as MLX-4bit;
+    # devstral-24b-4bit: ``2503`` snapshot was never re-uploaded as MLX-4bit;
     # 2505/2507 are the canonical Devstral-Small v1 releases.
-    assert "2503" not in profiles["devstral-24b"].hf_path, (
-        "devstral-24b previously pointed at Devstral-Small-2503-MLX-4bit "
+    assert "2503" not in profiles["devstral-24b-4bit"].hf_path, (
+        "devstral-24b-4bit previously pointed at Devstral-Small-2503-MLX-4bit "
         "which 404s. Use the 2507 (or 2505) MLX 4-bit upload."
     )
-    # glm4.5-air: ``-0111-`` date suffix was a community-only tag that
+    # glm4.5-air-4bit: ``-0111-`` date suffix was a community-only tag that
     # got rolled into the default release.
-    assert "0111" not in profiles["glm4.5-air"].hf_path, (
-        "glm4.5-air previously pointed at GLM-4.5-Air-0111-4bit which "
+    assert "0111" not in profiles["glm4.5-air-4bit"].hf_path, (
+        "glm4.5-air-4bit previously pointed at GLM-4.5-Air-0111-4bit which "
         "404s. The current canonical upload is GLM-4.5-Air-4bit."
     )
-    # glm4.7-9b previously pointed at the full GLM-4.7 (355B MoE,
+    # glm4.7-9b-4bit previously pointed at the full GLM-4.7 (355B MoE,
     # ~185 GB at 4-bit) — the alias name implies a 9B model. The
     # correct upload is the Flash variant (~16 GB).
-    assert "Flash" in profiles["glm4.7-9b"].hf_path, (
-        "glm4.7-9b must point at the GLM-4.7-Flash upload, not the full "
+    assert "Flash" in profiles["glm4.7-9b-4bit"].hf_path, (
+        "glm4.7-9b-4bit must point at the GLM-4.7-Flash upload, not the full "
         "GLM-4.7 (355B MoE) which is ~12x larger and won't fit on most "
         "user disks."
     )
-    # gpt-oss-20b previously pointed at mlx-community/GPT-OSS-20B-4bit
+    # gpt-oss-20b-mxfp4-q8 previously pointed at mlx-community/GPT-OSS-20B-4bit
     # which 404s; the canonical mlx-community release uses the
     # MXFP4-Q8 hybrid quantization.
-    assert profiles["gpt-oss-20b"].hf_path != "mlx-community/GPT-OSS-20B-4bit", (
-        "gpt-oss-20b must not regress to the 404 path; current canonical "
+    assert profiles["gpt-oss-20b-mxfp4-q8"].hf_path != "mlx-community/GPT-OSS-20B-4bit", (
+        "gpt-oss-20b-mxfp4-q8 must not regress to the 404 path; current canonical "
         "upload is mlx-community/gpt-oss-20b-MXFP4-Q8."
     )
-    # kimi-48b previously pointed at mlx-community/Kimi-K2-Instruct-Q4_0-MLX
+    # kimi-48b-4bit previously pointed at mlx-community/Kimi-K2-Instruct-Q4_0-MLX
     # (404). The replacement Kimi-K2-Instruct-4bit is large
     # (~540 GB) but is the actual mlx-community Kimi K2 Instruct release.
-    assert "Q4_0" not in profiles["kimi-48b"].hf_path, (
-        "kimi-48b must not regress to the Q4_0 path which 404s."
+    assert "Q4_0" not in profiles["kimi-48b-4bit"].hf_path, (
+        "kimi-48b-4bit must not regress to the Q4_0 path which 404s."
     )
 
 
@@ -603,46 +603,46 @@ def test_aliases_with_known_broken_hf_paths_stay_fixed() -> None:
     # Devstral 1.x — Mistral code-tuned model card example uses 0.15
     # for interactive coding (see model card on huggingface.co/mistralai).
     # Devstral 2.x ships the same empty stub; same pattern applies.
-    "devstral-24b": {"temperature": 0.15},
-    "devstral-v2-24b": {"temperature": 0.15},
+    "devstral-24b-4bit": {"temperature": 0.15},
+    "devstral-v2-24b-4bit": {"temperature": 0.15},
     # Gemma 3 family — Google's Gemma docs recommend
     # (temperature=1.0, top_p=0.95, top_k=64) for the chat-tuned models.
     # All of gemma-3-1b / gemma-3-12b / gemma-3-27b ship an empty stub
     # locally (`_from_model_config: true` plus eos/pad tokens only).
-    "gemma3-1b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
-    "gemma3-12b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
-    "gemma3-27b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma3-1b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma3-12b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma3-27b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     # gemma-3n-E4B ships top_p=0.95 and top_k=64 upstream but no
     # temperature. We bake in the full triple anyway (matches the
     # rest of the Gemma family) so a future mlx-community re-quant
     # that drops generation_config.json doesn't silently regress to
     # the framework fallback (0.7 / 0.9).
-    "gemma-3n-e4b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-3n-e4b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     # Gemma 4 — official Google sampling guidance hasn't been
     # published yet at the time of writing; we extrapolate from the
     # Gemma 3 family card. Revisit when an official Gemma 4 doc lands.
-    "gemma-4-12b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-4-12b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     "gemma-4-12b-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
-    "gemma-4-26b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
-    "gemma-4-31b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-4-26b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-4-31b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     "gemma-4-31b-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     # Gemma 4 QAT variants — same sampling as PTQ siblings. QAT changes
     # weight distribution (training with simulated quantization) not the
     # decoding distribution, so Google's chat sampling guidance applies
     # unchanged.
-    "gemma-4-12b-qat": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-4-12b-qat-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     "gemma-4-12b-qat-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
-    "gemma-4-26b-qat": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
-    "gemma-4-31b-qat": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-4-26b-qat-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
+    "gemma-4-31b-qat-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     "gemma-4-31b-qat-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0},
     # GLM-4.5-Air — THUDM publishes two recommendations: temperature=0.6
     # for *thinking* mode, ~1.0 for non-thinking. The alias has
     # reasoning_parser=glm4 → thinking IS the default response path,
     # so 0.6 is the right pick. (Users who want non-thinking can pass
     # temperature explicitly per-request.)
-    "glm4.5-air": {"temperature": 0.6, "top_p": 0.95},
+    "glm4.5-air-4bit": {"temperature": 0.6, "top_p": 0.95},
     # GLM-4.7-Flash ships temperature=1.0 upstream; we add only top_p.
-    "glm4.7-9b": {"top_p": 0.95},
+    "glm4.7-9b-4bit": {"top_p": 0.95},
 }
 
 
diff --git a/tests/test_anthropic_stop_sequences.py b/tests/test_anthropic_stop_sequences.py
index ece2579b..e8ea6223 100644
--- a/tests/test_anthropic_stop_sequences.py
+++ b/tests/test_anthropic_stop_sequences.py
@@ -6,7 +6,7 @@
 NOT ``stop`` — so ``stop_sequences`` from the request flowed through
 ``anthropic_to_openai`` into ``openai_request.stop`` and then died at the
 route boundary. Engine ran uncapped, model emitted past the user's stop
-tokens. Surfaced by the iter8 onboarding sweep on gpt-oss-20b: same
+tokens. Surfaced by the iter8 onboarding sweep on gpt-oss-20b-mxfp4-q8: same
 prompt + ``stop_sequences:["STOPHERE"]`` returned full text including
 "STOPHERE", finish_reason=end_turn. Identical prompt via /v1/chat/
 completions with ``stop:["STOPHERE"]`` stopped correctly. Fix: include
diff --git a/tests/test_api_validation_bundle.py b/tests/test_api_validation_bundle.py
index 440b76dd..595f7aa0 100644
--- a/tests/test_api_validation_bundle.py
+++ b/tests/test_api_validation_bundle.py
@@ -541,7 +541,7 @@ def test_array_of_strings_still_works(self):
 
 class TestPsCommandPortParsing:
     """``rapid-mlx ps`` used to break on the first positional argument,
-    so ``serve qwen3.5-4b --port 8005`` showed port=8000 (the default).
+    so ``serve qwen3.5-4b-4bit --port 8005`` showed port=8000 (the default).
     Verify the parser keeps scanning for flags after capturing the
     positional model."""
 
@@ -575,28 +575,28 @@ def _parse_serve(self, cmd_words):
 
     def test_port_after_positional_model(self):
         model, port = self._parse_serve(
-            ["rapid-mlx", "serve", "qwen3.5-4b", "--port", "8005"]
+            ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--port", "8005"]
         )
-        assert model == "qwen3.5-4b"
+        assert model == "qwen3.5-4b-4bit"
         assert port == "8005"
 
     def test_port_before_positional_model(self):
         model, port = self._parse_serve(
-            ["rapid-mlx", "serve", "--port", "8005", "qwen3.5-4b"]
+            ["rapid-mlx", "serve", "--port", "8005", "qwen3.5-4b-4bit"]
         )
-        assert model == "qwen3.5-4b"
+        assert model == "qwen3.5-4b-4bit"
         assert port == "8005"
 
     def test_port_equals_form(self):
         model, port = self._parse_serve(
-            ["rapid-mlx", "serve", "qwen3.5-4b", "--port=9000"]
+            ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--port=9000"]
         )
-        assert model == "qwen3.5-4b"
+        assert model == "qwen3.5-4b-4bit"
         assert port == "9000"
 
     def test_no_port_uses_default(self):
-        model, port = self._parse_serve(["rapid-mlx", "serve", "qwen3.5-4b"])
-        assert model == "qwen3.5-4b"
+        model, port = self._parse_serve(["rapid-mlx", "serve", "qwen3.5-4b-4bit"])
+        assert model == "qwen3.5-4b-4bit"
         assert port == "8000"
 
 
diff --git a/tests/test_batched_engine_output_router.py b/tests/test_batched_engine_output_router.py
index 41a4c5ab..348d7bba 100644
--- a/tests/test_batched_engine_output_router.py
+++ b/tests/test_batched_engine_output_router.py
@@ -426,7 +426,7 @@ async def test_router_tool_call_body_preserved_single_token_flush(family):
     chunks reuse the scheduler's detokenized ``output.new_text`` would, if
     applied to TOOL_CALL events, override the accumulated body with just the
     end-marker token's text, silently dropping the call body. Caught
-    post-v0.6.61 on gemma-4-26b — non-stream extracted a valid tool call
+    post-v0.6.61 on gemma-4-26b-4bit — non-stream extracted a valid tool call
     from the same generation that streaming returned as bare content.
 
     Parametrized over every router-allowlist family that emits TOOL_CALL
diff --git a/tests/test_bench_vs_ollama.py b/tests/test_bench_vs_ollama.py
index d37c9fcb..9a0d8b29 100644
--- a/tests/test_bench_vs_ollama.py
+++ b/tests/test_bench_vs_ollama.py
@@ -29,8 +29,8 @@ def test_default_model_pairs():
     pairs = bench.default_model_pairs()
 
     assert pairs == [
-        bench.ModelPair("qwen3.5-4b", "qwen3.5:4b"),
-        bench.ModelPair("qwen3.5-9b", "qwen3.5:9b"),
+        bench.ModelPair("qwen3.5-4b-4bit", "qwen3.5:4b"),
+        bench.ModelPair("qwen3.5-9b-4bit", "qwen3.5:9b"),
     ]
 
 
@@ -46,7 +46,7 @@ def test_parse_model_pair_rejects_missing_separator():
     bench = load_bench_module()
 
     with pytest.raises(ValueError, match="RAPID=OLLAMA"):
-        bench.parse_model_pair("qwen3.5-9b")
+        bench.parse_model_pair("qwen3.5-9b-4bit")
 
 
 def test_parse_args_replaces_default_model_pairs():
@@ -285,13 +285,13 @@ def test_build_rapid_mlx_payload_is_deterministic_no_thinking():
     bench = load_bench_module()
 
     payload = bench.build_rapid_mlx_payload(
-        model="qwen3.5-9b",
+        model="qwen3.5-9b-4bit",
         messages=[{"role": "user", "content": "hi"}],
         max_tokens=32,
         stream=True,
     )
 
-    assert payload["model"] == "qwen3.5-9b"
+    assert payload["model"] == "qwen3.5-9b-4bit"
     assert payload["temperature"] == 0
     assert payload["enable_thinking"] is False
     assert payload["stream"] is True
@@ -414,7 +414,7 @@ def test_render_markdown_includes_model_table_and_speedups():
         },
         "model_pairs": [
             {
-                "rapid_mlx_model": "qwen3.5-9b",
+                "rapid_mlx_model": "qwen3.5-9b-4bit",
                 "ollama_model": "qwen3.5:9b",
                 "rapid-mlx": {
                     "summary": {
@@ -437,7 +437,7 @@ def test_render_markdown_includes_model_table_and_speedups():
     markdown = bench.render_markdown(result)
 
     assert "# Rapid-MLX vs Ollama Benchmark" in markdown
-    assert "qwen3.5-9b vs qwen3.5:9b" in markdown
+    assert "qwen3.5-9b-4bit vs qwen3.5:9b" in markdown
     assert "| Decode tok/s | 120.0 | 40.0 | 3.00x |" in markdown
     assert "| TTFT | 100.0 ms | 250.0 ms | 2.50x |" in markdown
     assert "- Rapid-MLX: `rapid-mlx 0.2.0`" in markdown
@@ -454,7 +454,7 @@ def test_render_markdown_surfaces_engine_errors():
         "config": {"runs": 1, "concurrency": [1]},
         "model_pairs": [
             {
-                "rapid_mlx_model": "qwen3.5-9b",
+                "rapid_mlx_model": "qwen3.5-9b-4bit",
                 "ollama_model": "qwen3.5:9b",
                 "rapid-mlx": {"error": "boom"},
                 "ollama": {"summary": {"stream": {"decode_tok_s": 40.0}}},
@@ -474,7 +474,7 @@ def test_render_markdown_surfaces_workload_errors():
         "config": {"runs": 1, "concurrency": [1]},
         "model_pairs": [
             {
-                "rapid_mlx_model": "qwen3.5-9b",
+                "rapid_mlx_model": "qwen3.5-9b-4bit",
                 "ollama_model": "qwen3.5:9b",
                 "rapid-mlx": {
                     "errors": [
@@ -515,7 +515,7 @@ def test_render_markdown_tolerates_none_engine_payloads():
         "config": {"runs": 1, "concurrency": [1]},
         "model_pairs": [
             {
-                "rapid_mlx_model": "qwen3.5-9b",
+                "rapid_mlx_model": "qwen3.5-9b-4bit",
                 "ollama_model": "qwen3.5:9b",
                 "rapid-mlx": None,
                 "ollama": {"error": "ollama down"},
@@ -560,13 +560,13 @@ def test_build_rapid_mlx_command_includes_explicit_benchmark_settings():
     bench = load_bench_module()
 
     cmd = bench.build_rapid_mlx_command(
-        "qwen3.5-9b", 9123, ["--prefill-step-size", "4096"]
+        "qwen3.5-9b-4bit", 9123, ["--prefill-step-size", "4096"]
     )
 
     assert cmd == [
         "rapid-mlx",
         "serve",
-        "qwen3.5-9b",
+        "qwen3.5-9b-4bit",
         "--host",
         "127.0.0.1",
         "--port",
@@ -642,9 +642,9 @@ def test_build_engine_success_result_shape():
 
     result = bench.build_engine_success_result(
         engine="rapid-mlx",
-        model="qwen3.5-9b",
+        model="qwen3.5-9b-4bit",
         port=9123,
-        command=["rapid-mlx", "serve", "qwen3.5-9b"],
+        command=["rapid-mlx", "serve", "qwen3.5-9b-4bit"],
         raw_runs={"stream": [{"ttft_ms": 100.0}]},
         summary={"stream": {"ttft_ms": 100.0}},
         errors=[],
@@ -653,13 +653,13 @@ def test_build_engine_success_result_shape():
     )
 
     assert result["engine"] == "rapid-mlx"
-    assert result["model"] == "qwen3.5-9b"
+    assert result["model"] == "qwen3.5-9b-4bit"
     assert result["port"] == 9123
-    assert result["command"] == ["rapid-mlx", "serve", "qwen3.5-9b"]
+    assert result["command"] == ["rapid-mlx", "serve", "qwen3.5-9b-4bit"]
     assert result["server"]["url"] == "http://127.0.0.1:9123"
     assert result["runtime"]["prepared"] is True
     assert result["metadata"]["engine"] == "rapid-mlx"
-    assert result["metadata"]["model"] == "qwen3.5-9b"
+    assert result["metadata"]["model"] == "qwen3.5-9b-4bit"
     assert result["raw_runs"]["stream"] == [{"ttft_ms": 100.0}]
     assert result["summary"]["stream"] == {"ttft_ms": 100.0}
     assert result["errors"] == []
diff --git a/tests/test_chat_logprobs_channel_routing.py b/tests/test_chat_logprobs_channel_routing.py
index de24d761..a3b68f9d 100644
--- a/tests/test_chat_logprobs_channel_routing.py
+++ b/tests/test_chat_logprobs_channel_routing.py
@@ -11,7 +11,7 @@
 back to the text-regex parser, which leaked analysis-channel content into
 ``message.content`` and dropped ``reasoning_content`` entirely.
 
-Surfaced by the iter7 onboarding sweep on gpt-oss-20b: identical request
+Surfaced by the iter7 onboarding sweep on gpt-oss-20b-mxfp4-q8: identical request
 with vs without ``logprobs:true`` produced different channel routing — same
 shape as #442 (PR #443) but on the logprobs codepath instead of truncated
 output. Fix: accumulate ``new_text`` by ``channel`` while iterating the
diff --git a/tests/test_chat_route_tool_tag_leak.py b/tests/test_chat_route_tool_tag_leak.py
index 14049964..e9f9e9e3 100644
--- a/tests/test_chat_route_tool_tag_leak.py
+++ b/tests/test_chat_route_tool_tag_leak.py
@@ -133,7 +133,7 @@ def test_no_tool_call_path_preserves_content(self):
         _assert_no_leak(content)
 
     def test_parser_finds_nothing_preserves_existing_cleaned_text(self):
-        # Regression for the v0.6.64 gpt-oss-20b empty-TextBlock bug:
+        # Regression for the v0.6.64 gpt-oss-20b-mxfp4-q8 empty-TextBlock bug:
         # ``engine.generate()`` runs ``clean_output_text`` on harmony
         # output, which strips channel markup and returns just the
         # final-channel content ("4"). The non-streaming route then
diff --git a/tests/test_chat_streaming_spec.py b/tests/test_chat_streaming_spec.py
index b600e3c1..37182f9f 100644
--- a/tests/test_chat_streaming_spec.py
+++ b/tests/test_chat_streaming_spec.py
@@ -4,7 +4,7 @@
 Companion to ``tests/test_chat_streaming_guided.py``. PR #422 pinned these
 invariants on the guided helper (``stream_chat_completion_guided``) but
 the regular ``stream_chat_completion`` path retained two spec violations
-that the 2026-05-20 ≥20B onboarding sweep caught on qwen3.5-35b
+that the 2026-05-20 ≥20B onboarding sweep caught on qwen3.5-35b-8bit
 (see knowledge/guided_generation_gaps_2026-05-20.md, "Bug B"):
 
 1. **``created`` drift** — content chunks share one timestamp but the
@@ -130,7 +130,7 @@ def test_non_guided_streaming_pins_single_created_timestamp(monkeypatch):
     without ``created=...`` and inherited the default factory's fresh
     ``time.time()`` per construction. On slow MoE models the gap
     between first content chunk and finish chunk was 5-7s (Agent A7,
-    qwen3.5-35b, 2026-05-20 sweep).
+    qwen3.5-35b-8bit, 2026-05-20 sweep).
 
     Patches ``time.time`` to advance one second per call so the bug is
     deterministically observable in a unit test (real wall-clock under
@@ -264,7 +264,7 @@ class _GapStreamParser(ToolParser):
     call (returns plain content), but the non-stream ``extract_tool_calls``
     catches it via a fallback pattern.
 
-    Mirrors the gemma-4-26b case from the 2026-05-20 sweep where
+    Mirrors the gemma-4-26b-4bit case from the 2026-05-20 sweep where
     streaming dropped a tool call that the non-stream parser handled.
     """
 
diff --git a/tests/test_cli_argcomplete.py b/tests/test_cli_argcomplete.py
index ac4f6be9..b034c1cd 100644
--- a/tests/test_cli_argcomplete.py
+++ b/tests/test_cli_argcomplete.py
@@ -10,7 +10,7 @@
 2. ``alias_completer`` returns aliases filtered by prefix (the actual
    contract argcomplete invokes per keystroke).
 3. ``alias_csv_completer`` correctly carries the comma-separated
-   prefix forward so ``--models qwen3.5-4b,gem<TAB>`` expands the
+   prefix forward so ``--models qwen3.5-4b-4bit,gem<TAB>`` expands the
    trailing token only.
 """
 
@@ -69,10 +69,10 @@ def test_alias_completer_filters_by_prefix() -> None:
     # membership catches a silent regression where the QAT entries get
     # dropped from aliases.json without the test failing.
     qat_aliases = {
-        "gemma-4-12b-qat",
+        "gemma-4-12b-qat-4bit",
         "gemma-4-12b-qat-8bit",
-        "gemma-4-26b-qat",
-        "gemma-4-31b-qat",
+        "gemma-4-26b-qat-4bit",
+        "gemma-4-31b-qat-4bit",
         "gemma-4-31b-qat-8bit",
     }
     missing = qat_aliases - set(result)
@@ -120,13 +120,13 @@ def test_alias_csv_completer_first_token() -> None:
 
 
 def test_alias_csv_completer_appends_to_existing_csv() -> None:
-    """``--models qwen3.5-4b,gem<TAB>`` should expand only the
+    """``--models qwen3.5-4b-4bit,gem<TAB>`` should expand only the
     trailing token but emit the full re-assembled value so the shell
-    inserts ``qwen3.5-4b,gemma-4-12b`` rather than dropping the head."""
-    result = alias_csv_completer("qwen3.5-4b,gemma-4-")
-    assert all(m.startswith("qwen3.5-4b,gemma-4-") for m in result), (
+    inserts ``qwen3.5-4b-4bit,gemma-4-12b-4bit`` rather than dropping the head."""
+    result = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-")
+    assert all(m.startswith("qwen3.5-4b-4bit,gemma-4-") for m in result), (
         f"csv completer dropped the head before the comma: "
-        f"{[m for m in result if not m.startswith('qwen3.5-4b,')]}"
+        f"{[m for m in result if not m.startswith('qwen3.5-4b-4bit,')]}"
     )
     assert len(result) >= 5, "should match at least 5 gemma-4-* tokens"
 
@@ -135,8 +135,8 @@ def test_alias_csv_completer_multiple_commas() -> None:
     """``--models a,b,c<TAB>`` only completes ``c``; ``a,b,`` is
     carried through unchanged. Lock this in because rpartition vs
     partition is an easy-to-flip bug."""
-    result = alias_csv_completer("qwen3.5-4b,gemma-4-12b,qwen3.6-")
-    assert all(m.startswith("qwen3.5-4b,gemma-4-12b,qwen3.6-") for m in result), (
+    result = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-")
+    assert all(m.startswith("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") for m in result), (
         "csv completer must preserve all prior csv tokens"
     )
 
@@ -156,8 +156,8 @@ def test_alias_csv_completer_handles_whitespace_after_comma() -> None:
     must match that contract so users who naturally type the
     human-friendly ``a, b, c`` shape get suggestions instead of an
     empty list."""
-    spaced = alias_csv_completer("qwen3.5-4b, gemma-4-")
-    tight = alias_csv_completer("qwen3.5-4b,gemma-4-")
+    spaced = alias_csv_completer("qwen3.5-4b-4bit, gemma-4-")
+    tight = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-")
     assert len(spaced) == len(tight), (
         "csv completer must produce the same number of matches whether "
         "the user typed `a,b` or `a, b`"
diff --git a/tests/test_cli_chat.py b/tests/test_cli_chat.py
index 3aa4e781..3718449e 100644
--- a/tests/test_cli_chat.py
+++ b/tests/test_cli_chat.py
@@ -107,8 +107,8 @@ def test_chat_no_model_defaults_to_qwen35_4b():
     # signals the default plumbed through. The canonical alias is the one
     # we documented as the default; confirm via the round-trip name.
     assert (
-        args.model == "qwen3.5-4b"
-        or getattr(args, "_original_alias", None) == "qwen3.5-4b"
+        args.model == "qwen3.5-4b-4bit"
+        or getattr(args, "_original_alias", None) == "qwen3.5-4b-4bit"
     )
 
 
@@ -116,15 +116,15 @@ def test_chat_with_alias_overrides_default():
     """`rapid-mlx chat <alias>` uses the user-supplied alias, not the default."""
     captured: list = []
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "chat", "smollm3-3b"]),
+        patch.object(sys, "argv", ["rapid-mlx", "chat", "smollm3-3b-4bit"]),
         patch.object(cli, "chat_command", side_effect=captured.append),
     ):
         cli.main()
     assert len(captured) == 1
     args = captured[0]
     assert (
-        args.model == "smollm3-3b"
-        or getattr(args, "_original_alias", None) == "smollm3-3b"
+        args.model == "smollm3-3b-4bit"
+        or getattr(args, "_original_alias", None) == "smollm3-3b-4bit"
     )
 
 
@@ -209,7 +209,7 @@ def test_chat_command_repl_multi_turn(monkeypatch, capsys):
         ns.temperature = 0.0
         ns.ready_timeout = 5
         ns.response_timeout = 5
-        ns.model = "qwen3.5-4b"
+        ns.model = "qwen3.5-4b-4bit"
 
         cli.chat_command(ns)
 
@@ -236,7 +236,7 @@ def test_chat_command_system_prompt_prepended(monkeypatch):
         ns.temperature = 0.0
         ns.ready_timeout = 5
         ns.response_timeout = 5
-        ns.model = "qwen3.5-4b"
+        ns.model = "qwen3.5-4b-4bit"
         cli.chat_command(ns)
     assert payloads[0]["messages"][0] == {"role": "system", "content": "be terse"}
     assert payloads[0]["messages"][1] == {"role": "user", "content": "q1"}
@@ -246,7 +246,7 @@ def test_chat_command_default_thinking_off_sends_enable_thinking_false(monkeypat
     """Chat REPL defaults to thinking OFF.
 
     Reasoning models like Qwen3.5 otherwise leak raw chain-of-thought into
-    the user-visible REPL output, and on the default qwen3.5-4b model
+    the user-visible REPL output, and on the default qwen3.5-4b-4bit model
     degenerate into infinite repetition until max-tokens — producing zero
     usable output for a brand-new user. Pinning the default here so a
     refactor doesn't silently restore the broken behavior shipped in 0.6.26.
@@ -264,7 +264,7 @@ def test_chat_command_default_thinking_off_sends_enable_thinking_false(monkeypat
         ns.temperature = 0.0
         ns.ready_timeout = 5
         ns.response_timeout = 5
-        ns.model = "qwen3.5-4b"
+        ns.model = "qwen3.5-4b-4bit"
         cli.chat_command(ns)
     assert payloads[0].get("enable_thinking") is False
     # The unsupported nested form must NOT be present.
@@ -288,7 +288,7 @@ def test_chat_command_explicit_think_omits_enable_thinking_field(monkeypatch):
         ns.temperature = 0.0
         ns.ready_timeout = 5
         ns.response_timeout = 5
-        ns.model = "qwen3.5-4b"
+        ns.model = "qwen3.5-4b-4bit"
         cli.chat_command(ns)
     assert "enable_thinking" not in payloads[0]
 
@@ -330,7 +330,7 @@ def test_chat_command_survives_connection_failure(monkeypatch, capsys):
     ns.temperature = 0.0
     ns.ready_timeout = 1
     ns.response_timeout = 2
-    ns.model = "qwen3.5-4b"
+    ns.model = "qwen3.5-4b-4bit"
     # Should not raise — REPL prints "Request failed" and continues to "exit".
     cli.chat_command(ns)
     captured = capsys.readouterr()
@@ -387,7 +387,7 @@ def _capture(self):
         ns.temperature = 0.0
         ns.ready_timeout = 5
         ns.response_timeout = 5
-        ns.model = "qwen3.5-4b"
+        ns.model = "qwen3.5-4b-4bit"
         cli.chat_command(ns)
 
         _ErrHandler.do_POST = orig  # type: ignore[assignment]
@@ -412,7 +412,7 @@ def _ns_for_chat(port: int, **overrides) -> object:
     ns.temperature = 0.0
     ns.ready_timeout = 5
     ns.response_timeout = 5
-    ns.model = "qwen3.5-4b"
+    ns.model = "qwen3.5-4b-4bit"
     for k, v in overrides.items():
         setattr(ns, k, v)
     return ns
@@ -756,7 +756,7 @@ def test_chat_command_save_refuses_on_empty_conversation(monkeypatch, tmp_path,
 
 
 def test_stream_chat_response_aborts_on_no_whitespace_repetition(monkeypatch):
-    """The new char-level guard must fire on the qwen3.5-4b regression
+    """The new char-level guard must fire on the qwen3.5-4b-4bit regression
     where the model emits ``BarleyBarleyBarley...`` with NO whitespace
     separator. The whitespace-token guard cannot catch this — a single
     chunk of 6000 chars splits to one token whose count is 1.
@@ -1105,7 +1105,7 @@ def _fake_wait(base_url, proc, timeout_s):
         port = fake_port
         inputs = iter(["first turn", "/model bogus", "second turn", "exit"])
         monkeypatch.setattr("builtins.input", lambda _p="": next(inputs))
-        ns = _ns_for_chat(fake_port, model="qwen3.5-4b")
+        ns = _ns_for_chat(fake_port, model="qwen3.5-4b-4bit")
         ns.base_url = None
         ns.port = None
         cli.chat_command(ns)
@@ -1142,7 +1142,7 @@ def test_chat_command_slash_command_dispatch_uses_exact_match(
         inputs = iter(
             [
                 f"/savefoo {target}",
-                "/modelfoo qwen3.5-4b",
+                "/modelfoo qwen3.5-4b-4bit",
                 "exit",
             ]
         )
@@ -1221,12 +1221,12 @@ def test_stream_chat_response_repetition_truncates_at_cutoff_in_one_chunk(
 def test_chat_think_bumps_max_tokens_default_to_4096():
     """``--think`` with no explicit ``--max-tokens`` raises the default
     from 2048 to 4096 so reasoning + final answer fit a small-model
-    budget. Round-1 finding: ``chat qwen3.5-4b --think`` consumed the
+    budget. Round-1 finding: ``chat qwen3.5-4b-4bit --think`` consumed the
     full 2048 budget with reasoning alone and emitted an empty answer
     with ``finish_reason='length'``."""
     captured: list = []
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b", "--think"]),
+        patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b-4bit", "--think"]),
         patch.object(cli, "chat_command", side_effect=captured.append),
     ):
         cli.main()
@@ -1392,7 +1392,7 @@ def test_chat_port_unbound_exits_with_friendly_error(capsys, monkeypatch):
     ns.temperature = 0.0
     ns.ready_timeout = 1
     ns.response_timeout = 1
-    ns.model = "qwen3.5-4b"
+    ns.model = "qwen3.5-4b-4bit"
     with pytest.raises(SystemExit) as exc:
         cli.chat_command(ns)
     assert exc.value.code == 1
@@ -1411,7 +1411,7 @@ def test_run_is_alias_for_chat(monkeypatch):
     same args as ``rapid-mlx chat <model>``."""
     captured: list = []
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "run", "qwen3.5-4b"]),
+        patch.object(sys, "argv", ["rapid-mlx", "run", "qwen3.5-4b-4bit"]),
         patch.object(cli, "chat_command", side_effect=captured.append),
     ):
         cli.main()
@@ -1432,7 +1432,7 @@ def test_run_alias_accepts_chat_flags():
         patch.object(
             sys,
             "argv",
-            ["rapid-mlx", "run", "qwen3.5-4b", "--think", "--max-tokens", "1024"],
+            ["rapid-mlx", "run", "qwen3.5-4b-4bit", "--think", "--max-tokens", "1024"],
         ),
         patch.object(cli, "chat_command", side_effect=captured.append),
     ):
@@ -1534,7 +1534,7 @@ def test_serve_accepts_no_think_as_alias_for_no_thinking():
     ``no_thinking=True`` destination as ``serve --no-thinking``."""
     captured: list = []
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b", "--no-think"]),
+        patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-think"]),
         patch.object(cli, "serve_command", side_effect=captured.append),
     ):
         cli.main()
@@ -1944,7 +1944,7 @@ def getsockname(self):
     monkeypatch.setattr("subprocess.Popen", _FakePopen)
 
     log_path = tmp_path / "fake.log"
-    proc, base_url = cli._spawn_chat_server("qwen3.5-4b", str(log_path))
+    proc, base_url = cli._spawn_chat_server("qwen3.5-4b-4bit", str(log_path))
 
     assert captured["env"] is not None
     assert captured["env"].get("RAPID_MLX_CHAT_SPAWN") == "1"
@@ -2235,7 +2235,7 @@ def test_chat_allow_abbrev_disabled_rejects_ambiguous_no_thi(capsys):
     With ``allow_abbrev=False`` argparse must reject the ambiguous form
     instead of silently resolving it to whichever flag was added first."""
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b", "--no-thi"]),
+        patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b-4bit", "--no-thi"]),
         pytest.raises(SystemExit),
     ):
         cli.main()
@@ -2247,7 +2247,7 @@ def test_serve_allow_abbrev_disabled_rejects_ambiguous_no_thi(capsys):
     """Same as the chat case — ``serve`` also got the hidden cross-alias
     and the same ambiguity must be reported, not silently resolved."""
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b", "--no-thi"]),
+        patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-thi"]),
         pytest.raises(SystemExit),
     ):
         cli.main()
diff --git a/tests/test_cli_info.py b/tests/test_cli_info.py
index 13e0e359..aff94c38 100644
--- a/tests/test_cli_info.py
+++ b/tests/test_cli_info.py
@@ -28,11 +28,11 @@ def _run_info(model_name: str) -> str:
 
 
 def test_info_resolves_alias_to_hf_path() -> None:
-    """Typing the alias ``qwen3.5-4b`` must show the resolution arrow
-    ``qwen3.5-4b → mlx-community/Qwen3.5-4B-MLX-4bit``. A refactor that
+    """Typing the alias ``qwen3.5-4b-4bit`` must show the resolution arrow
+    ``qwen3.5-4b-4bit → mlx-community/Qwen3.5-4B-MLX-4bit``. A refactor that
     bypasses ``resolve_model`` would drop the alias signal."""
-    out = _run_info("qwen3.5-4b")
-    assert "qwen3.5-4b" in out
+    out = _run_info("qwen3.5-4b-4bit")
+    assert "qwen3.5-4b-4bit" in out
     assert "→" in out, f"expected alias-resolution arrow in output, got:\n{out}"
     assert "mlx-community/Qwen3.5-4B-MLX-4bit" in out, (
         f"expected resolved HF path in output, got:\n{out}"
diff --git a/tests/test_cli_models.py b/tests/test_cli_models.py
index b9e2c1c9..6bd4c2c2 100644
--- a/tests/test_cli_models.py
+++ b/tests/test_cli_models.py
@@ -41,20 +41,20 @@ def test_models_command_shows_capability_columns():
 
 
 def test_models_command_renders_hybrid_marker_for_qwen35():
-    """Hybrid models (e.g. qwen3.5-4b) must show '✗ hybrid' + tier 'n/a'.
+    """Hybrid models (e.g. qwen3.5-4b-4bit) must show '✗ hybrid' + tier 'n/a'.
 
     The point of the column is to spare users an `info` round-trip when
     deciding whether spec-decode/suffix-decode will help. Trust the gate.
     """
     out = _capture_models_output()
     profiles = list_profiles()
-    qwen35_4b = profiles.get("qwen3.5-4b")
-    assert qwen35_4b is not None, "qwen3.5-4b alias missing — fixture drift"
-    assert qwen35_4b.is_hybrid, "qwen3.5-4b should still be is_hybrid=True"
+    qwen35_4b = profiles.get("qwen3.5-4b-4bit")
+    assert qwen35_4b is not None, "qwen3.5-4b-4bit alias missing — fixture drift"
+    assert qwen35_4b.is_hybrid, "qwen3.5-4b-4bit should still be is_hybrid=True"
 
-    # Find the qwen3.5-4b row and confirm the hybrid markers.
-    matches = [line for line in out.splitlines() if "qwen3.5-4b " in line]
-    assert matches, "no row found for qwen3.5-4b"
+    # Find the qwen3.5-4b-4bit row and confirm the hybrid markers.
+    matches = [line for line in out.splitlines() if "qwen3.5-4b-4bit " in line]
+    assert matches, "no row found for qwen3.5-4b-4bit"
     row = matches[0]
     assert "✗ hybrid" in row, f"expected '✗ hybrid' marker in row: {row!r}"
     assert "n/a" in row, f"expected suffix tier 'n/a' in row: {row!r}"
@@ -67,15 +67,15 @@ def test_models_command_renders_parser_for_hermes3_8b():
     alias registry, (2) the suffix-tier cell shows the tier currently
     recorded in ``aliases.json``. Reading the expected tier from the
     registry (not hardcoding it) means a future bench re-sweep that
-    reclassifies hermes3-8b doesn't break this test, while a *display*
+    reclassifies hermes3-8b-4bit doesn't break this test, while a *display*
     regression (tier dropped from the row entirely) still does.
     """
     out = _capture_models_output()
-    matches = [line for line in out.splitlines() if "hermes3-8b " in line]
-    assert matches, "no row found for hermes3-8b"
+    matches = [line for line in out.splitlines() if "hermes3-8b-4bit " in line]
+    assert matches, "no row found for hermes3-8b-4bit"
     row = matches[0]
-    profile = list_profiles().get("hermes3-8b")
-    assert profile is not None, "hermes3-8b alias missing — fixture drift"
+    profile = list_profiles().get("hermes3-8b-4bit")
+    assert profile is not None, "hermes3-8b-4bit alias missing — fixture drift"
     assert (profile.tool_call_parser or "") in row, (
         f"expected tool parser {profile.tool_call_parser!r} in row: {row!r}"
     )
@@ -194,7 +194,7 @@ def test_models_default_view_unchanged(monkeypatch, capsys):
 
 def test_cached_view_renders_alias_for_known_repo(tmp_path, monkeypatch, capsys):
     """A cached HF repo whose path matches an alias should render under
-    the alias name (e.g. ``qwen3.5-4b``), not the raw HF path."""
+    the alias name (e.g. ``qwen3.5-4b-4bit``), not the raw HF path."""
     from vllm_mlx.model_aliases import list_profiles
 
     profiles = list_profiles()
diff --git a/tests/test_dflash_eligibility.py b/tests/test_dflash_eligibility.py
index 0d1458bc..7a4abf29 100644
--- a/tests/test_dflash_eligibility.py
+++ b/tests/test_dflash_eligibility.py
@@ -103,7 +103,7 @@ def test_check_rejects_4bit_main_model() -> None:
         dflash_draft_model="z-lab/Qwen3.5-27B-DFlash",
     )
     with pytest.raises(DFlashUnavailable, match="4-bit"):
-        check(p, alias="qwen3.5-27b")
+        check(p, alias="qwen3.5-27b-4bit")
 
 
 def test_check_message_lists_eligible_aliases() -> None:
@@ -133,7 +133,7 @@ def test_report_collects_all_failures() -> None:
         is_moe=True,
         # supports_dflash=False (default) → 3 reasons total
     )
-    r = report(bad, alias="qwen3.6-35b")
+    r = report(bad, alias="qwen3.6-35b-4bit")
     assert len(r.reasons) == 3, f"expected 3 reasons, got: {r.reasons}"
     joined = " ".join(r.reasons)
     assert "MoE" in joined
@@ -168,20 +168,20 @@ def test_qwen3_5_27b_8bit_alias_passes_check() -> None:
 
 
 def test_default_qwen3_5_27b_alias_fails_check_with_4bit_reason() -> None:
-    """The default ``qwen3.5-27b`` alias points at the 4-bit variant —
+    """The default ``qwen3.5-27b-4bit`` alias points at the 4-bit variant —
     eligibility must reject it with a clear 4-bit hint, not the
     generic 'not enabled' message (since supports_dflash=False).
     Confirms users get the right pointer when they pick the wrong
     quantization."""
     from vllm_mlx.model_aliases import resolve_profile
 
-    profile = resolve_profile("qwen3.5-27b")
+    profile = resolve_profile("qwen3.5-27b-4bit")
     assert profile is not None
     # Match-string: capture both reasons (4-bit + not-opted-in). The
     # bare ``raises`` would pass even if the gate silently degraded to
     # the generic message, defeating the point of this regression test.
     with pytest.raises(DFlashUnavailable) as excinfo:
-        check(profile, alias="qwen3.5-27b")
+        check(profile, alias="qwen3.5-27b-4bit")
     msg = str(excinfo.value)
     assert "4-bit" in msg, (
         f"4-bit hint missing from DFlashUnavailable message; got:\n{msg}"
diff --git a/tests/test_dflash_integration.py b/tests/test_dflash_integration.py
index c9cfacd1..f5b5ae06 100644
--- a/tests/test_dflash_integration.py
+++ b/tests/test_dflash_integration.py
@@ -95,11 +95,11 @@ def test_info_dflash_block_skipped_for_unknown_alias(capsys) -> None:
 
 
 def test_info_dflash_marks_4bit_alias_ineligible(capsys) -> None:
-    """The default ``qwen3.5-27b`` alias points at the 4-bit variant and
+    """The default ``qwen3.5-27b-4bit`` alias points at the 4-bit variant and
     must surface as ineligible with the right gate failing."""
     from vllm_mlx.cli import info_command
 
-    args = type("Args", (), {"model": "qwen3.5-27b"})()
+    args = type("Args", (), {"model": "qwen3.5-27b-4bit"})()
     info_command(args)
     captured = capsys.readouterr()
     assert "DFlash eligibility" in captured.out
@@ -153,10 +153,10 @@ def test_models_listing_renders_dflash_column(capsys) -> None:
 
     # A non-DFlash alias renders — in the DFlash column.
     ineligible_row = next(
-        (line for line in lines if "qwen3.5-4b " in line),
+        (line for line in lines if "qwen3.5-4b-4bit " in line),
         None,
     )
-    assert ineligible_row is not None, "qwen3.5-4b row missing"
+    assert ineligible_row is not None, "qwen3.5-4b-4bit row missing"
     assert "—" in ineligible_row, f"DFlash column should be —: {ineligible_row!r}"
 
 
diff --git a/tests/test_doctor_baseline.py b/tests/test_doctor_baseline.py
index 777e23f1..4fa35e13 100644
--- a/tests/test_doctor_baseline.py
+++ b/tests/test_doctor_baseline.py
@@ -195,7 +195,7 @@ def test_distinct_inputs_produce_distinct_slugs(self):
 
     def test_simple_aliases_unchanged(self):
         # Common case: an alias with no special chars stays human-readable.
-        assert safe_model_slug("qwen3.5-4b") == "qwen3.5-4b"
+        assert safe_model_slug("qwen3.5-4b-4bit") == "qwen3.5-4b-4bit"
 
     def test_hf_path_round_trips_via_unquote(self):
         import urllib.parse
diff --git a/tests/test_doctor_runner.py b/tests/test_doctor_runner.py
index 83f747fc..69d92edc 100644
--- a/tests/test_doctor_runner.py
+++ b/tests/test_doctor_runner.py
@@ -221,7 +221,7 @@ def crashing():
 
 class TestDefaultBootTimeout:
     """Single generous default beats heuristics that miss models like
-    'qwen3-coder' (80B, no param-count hint in alias)."""
+    'qwen3-coder-4bit' (80B, no param-count hint in alias)."""
 
     def test_default_is_generous(self):
         from vllm_mlx.doctor.cli import DEFAULT_BOOT_TIMEOUT_S
diff --git a/tests/test_engine_router_non_stream.py b/tests/test_engine_router_non_stream.py
index b6af4af0..b7566141 100644
--- a/tests/test_engine_router_non_stream.py
+++ b/tests/test_engine_router_non_stream.py
@@ -55,7 +55,7 @@ def decode(self, ids):
         return "".join(self._id_to_text.get(i, f"<UNK:{i}>") for i in ids)
 
 
-# Harmony token IDs from openai/gpt-oss-20b (same constants the real
+# Harmony token IDs from openai/gpt-oss-20b-mxfp4-q8 (same constants the real
 # router reads). Keep in sync with tests/test_output_router.py.
 _HARMONY_VOCAB = {
     "<|return|>": 200002,
diff --git a/tests/test_finalize_harmony_raw_text.py b/tests/test_finalize_harmony_raw_text.py
index f719c50d..5d589828 100644
--- a/tests/test_finalize_harmony_raw_text.py
+++ b/tests/test_finalize_harmony_raw_text.py
@@ -32,7 +32,7 @@
 from vllm_mlx.reasoning.qwen3_parser import Qwen3ReasoningParser
 from vllm_mlx.service.helpers import _finalize_content_and_reasoning
 
-# A realistic gpt-oss-20b harmony non-stream response: analysis channel
+# A realistic gpt-oss-20b-mxfp4-q8 harmony non-stream response: analysis channel
 # (CoT) followed by final channel (answer), terminated with <|return|>.
 _HARMONY_RAW = (
     "<|channel|>analysis<|message|>"
diff --git a/tests/test_harmony_parsers.py b/tests/test_harmony_parsers.py
index 77b6cda4..3d06ba58 100644
--- a/tests/test_harmony_parsers.py
+++ b/tests/test_harmony_parsers.py
@@ -263,8 +263,8 @@ def test_extract_analysis_and_final(self, parser):
     def test_extract_final_terminated_by_end_token(self, parser):
         """Final channel terminated by ``<|end|>`` (not ``<|return|>``).
 
-        Regression for the v0.6.64 gpt-oss-20b empty-TextBlock flake:
-        gpt-oss-20b emits ``<|end|>`` after the final channel for a
+        Regression for the v0.6.64 gpt-oss-20b-mxfp4-q8 empty-TextBlock flake:
+        gpt-oss-20b-mxfp4-q8 emits ``<|end|>`` after the final channel for a
         sizeable fraction of non-streaming responses, and the prior
         ``<|return|>``-only regex silently dropped that content. The
         streaming path already accepts both terminators; the
@@ -306,7 +306,7 @@ def test_literal_end_token_in_content_end_only_terminator(self, parser):
         DeepSeek round-2 follow-up: the ``<|return|>``-first preference
         only covers the case where the real terminator is
         ``<|return|>``. When the model emits ``<|end|>`` as the
-        message terminator (gpt-oss-20b's common case) AND the
+        message terminator (gpt-oss-20b-mxfp4-q8's common case) AND the
         answer contains a literal ``<|end|>``, the non-greedy
         fallback would still truncate. Greedy ``(.*)`` in
         ``_FINAL_PATTERN_END`` now consumes up to the LAST
@@ -696,7 +696,7 @@ def test_skips_non_function_types(self):
 class TestHarmonyEnginePipeline:
     """End-to-end through the engine layer: clean_output_text → tool parser.
 
-    Reproduces the v0.6.64 bug where gpt-oss-20b's commentary-only tool
+    Reproduces the v0.6.64 bug where gpt-oss-20b-mxfp4-q8's commentary-only tool
     calls came back as plain text instead of structured ``tool_calls``.
     Root cause: ``_clean_gpt_oss_output`` in ``api/utils.py`` only matched
     a ``<|channel|>final<|message|>`` block; commentary-only output fell
@@ -712,7 +712,7 @@ class TestHarmonyEnginePipeline:
     """
 
     def test_commentary_only_output_extracts_tool_call(self):
-        """Real gpt-oss-20b output for a single tool call.
+        """Real gpt-oss-20b-mxfp4-q8 output for a single tool call.
 
         Captured verbatim from ``mlx-community/gpt-oss-20b-MXFP4-Q8`` via
         ``/v1/chat/completions`` with ``tools=[get_weather]`` (2026-05-22).
@@ -826,7 +826,7 @@ def test_tool_parser_unterminated_call_is_now_parsed(self):
         """Commentary block without trailing ``<|call|>`` IS parsed now.
 
         Earlier behavior treated a missing ``<|call|>`` terminator as
-        "incomplete". Empirically (gpt-oss-20b via /v1/chat/completions,
+        "incomplete". Empirically (gpt-oss-20b-mxfp4-q8 via /v1/chat/completions,
         2026-05-22) ``<|call|>`` is part of the harmony stop-token set,
         so the engine consumes it and ``output_text`` ends with the
         JSON args alone. Refusing to parse meant zero tool calls
diff --git a/tests/test_memory_capacity_check.py b/tests/test_memory_capacity_check.py
index ecbcdc40..54b67a7a 100644
--- a/tests/test_memory_capacity_check.py
+++ b/tests/test_memory_capacity_check.py
@@ -62,7 +62,7 @@ def test_hard_warning_fires_on_24gb_mac_with_14gb_model_realistic_load(
     actually hit this gets the strongest message, not a soft hint."""
     _patch_size_bytes(monkeypatch, size_gb=14.0)
     with patch.dict("sys.modules", {"psutil": _fake_psutil(24.0, used_gb=6.0)}):
-        _check_memory_capacity("/local/path/to/gemma-4-26b")
+        _check_memory_capacity("/local/path/to/gemma-4-26b-4bit")
     out = capsys.readouterr().out
     assert "kernel panic" in out, (
         f"the very case that filed the issue must hit the hard tier: {out!r}"
@@ -77,7 +77,7 @@ def test_hard_warning_still_fires_at_fresh_boot_on_24gb_mac(monkeypatch, capsys)
     """
     _patch_size_bytes(monkeypatch, size_gb=14.0)
     with patch.dict("sys.modules", {"psutil": _fake_psutil(24.0, used_gb=0.0)}):
-        _check_memory_capacity("/local/path/to/gemma-4-26b")
+        _check_memory_capacity("/local/path/to/gemma-4-26b-4bit")
     out = capsys.readouterr().out
     assert "Memory pressure" in out, f"expected warning, got: {out!r}"
     # 0 + 21 = 21 / 24 = 87.5% → HARD tier
@@ -208,7 +208,7 @@ def test_warning_includes_actionable_recommendations(monkeypatch, capsys):
     Pins the actionability of the message."""
     _patch_size_bytes(monkeypatch, size_gb=14.0)
     with patch.dict("sys.modules", {"psutil": _fake_psutil(24.0, used_gb=0.0)}):
-        _check_memory_capacity("/local/path/to/gemma-4-26b")
+        _check_memory_capacity("/local/path/to/gemma-4-26b-4bit")
     out = capsys.readouterr().out
     assert "--gpu-memory-utilization" in out
 
diff --git a/tests/test_model_aliases.py b/tests/test_model_aliases.py
index 75ded31d..1b32e6c6 100644
--- a/tests/test_model_aliases.py
+++ b/tests/test_model_aliases.py
@@ -9,8 +9,8 @@
 
 
 def test_known_alias_resolves():
-    assert resolve_model("qwen3.5-9b") == "mlx-community/Qwen3.5-9B-4bit"
-    assert resolve_model("llama3-3b") == "mlx-community/Llama-3.2-3B-Instruct-4bit"
+    assert resolve_model("qwen3.5-9b-4bit") == "mlx-community/Qwen3.5-9B-4bit"
+    assert resolve_model("llama3-3b-4bit") == "mlx-community/Llama-3.2-3B-Instruct-4bit"
 
 
 def test_full_path_passes_through():
@@ -24,12 +24,12 @@ def test_unknown_name_passes_through():
 
 def test_local_path_takes_priority_over_alias(tmp_path):
     """A local directory matching an alias name should win."""
-    local_dir = tmp_path / "qwen3.5-9b"
+    local_dir = tmp_path / "qwen3.5-9b-4bit"
     local_dir.mkdir()
     old_cwd = os.getcwd()
     try:
         os.chdir(tmp_path)
-        assert resolve_model("qwen3.5-9b") == "qwen3.5-9b"
+        assert resolve_model("qwen3.5-9b-4bit") == "qwen3.5-9b-4bit"
     finally:
         os.chdir(old_cwd)
 
@@ -37,14 +37,14 @@ def test_local_path_takes_priority_over_alias(tmp_path):
 def test_list_aliases_nonempty():
     aliases = list_aliases()
     assert len(aliases) >= 15
-    assert "qwen3.5-9b" in aliases
+    assert "qwen3.5-9b-4bit" in aliases
 
 
 def test_hermes_alias_not_llama():
     """Hermes-3 should be under its own name, not llama3-8b."""
     aliases = list_aliases()
     assert "llama3-8b" not in aliases
-    assert "hermes3-8b" in aliases
+    assert "hermes3-8b-4bit" in aliases
 
 
 def test_suggest_similar_stays_within_family():
@@ -61,11 +61,11 @@ def test_suggest_similar_stays_within_family():
 
 
 def test_suggest_similar_correctly_typo_for_close_size():
-    """Typing ``qwen3.5-30b`` (typo for ``qwen3.5-35b``) should rank the
+    """Typing ``qwen3.5-30b`` (typo for ``qwen3.5-35b-8bit``) should rank the
     correct alias first."""
     suggestions = suggest_similar("qwen3.5-30b")
     assert suggestions, "expected at least one suggestion"
-    assert suggestions[0] == "qwen3.5-35b", suggestions
+    assert suggestions[0] == "qwen3.5-35b-8bit", suggestions
 
 
 def test_suggest_similar_empty_for_nonsense():
@@ -89,10 +89,10 @@ def test_suggest_similar_one_letter_no_match():
 
 def test_suggest_similar_matches_partial_family_token():
     """A bare family name like ``hermes`` should suggest aliases that
-    share that prefix (``hermes3-8b``), not return [] just because there's
+    share that prefix (``hermes3-8b-4bit``), not return [] just because there's
     no exact ``hermes-foo`` separator pattern."""
     suggestions = suggest_similar("hermes")
-    assert "hermes3-8b" in suggestions, suggestions
+    assert "hermes3-8b-4bit" in suggestions, suggestions
 
 
 # --- Letter-only fallback (separator-mismatched names) ----------------
@@ -101,8 +101,8 @@ def test_suggest_similar_matches_partial_family_token():
 def test_suggest_similar_letter_fallback_handles_separator_mismatch():
     """Real bug from the field: ``rapid-mlx chat gemma4-27b`` returned
     zero suggestions because the strict family parser sees ``gemma4`` and
-    no alias starts with ``gemma4`` (we have ``gemma-4-26b`` and
-    ``gemma3-27b``). The letter-only fallback must catch this — extract
+    no alias starts with ``gemma4`` (we have ``gemma-4-26b-4bit`` and
+    ``gemma3-27b-4bit``). The letter-only fallback must catch this — extract
     ``gemma`` and match the whole gemma family."""
     suggestions = suggest_similar("gemma4-27b")
     assert suggestions, "letter-only fallback must produce gemma family suggestions"
@@ -111,34 +111,17 @@ def test_suggest_similar_letter_fallback_handles_separator_mismatch():
         assert s.startswith("gemma"), s
 
 
-def test_suggest_similar_short_alias_does_not_shadow_sized_variants():
-    """When a short alias (e.g. ``gemma4``) exists AND the user types
-    a size-qualified name (``gemma4-26b``), the strict-family pass must
-    NOT short-circuit to just ``[gemma4]``. Otherwise users hunting
-    for the 26B variant get bait-and-switched onto the 12B default.
-    Fall through to the letter-only pass so size-specific aliases
-    surface."""
-    suggestions = suggest_similar("gemma4-26b")
-    assert suggestions, "size-qualified typo must produce suggestions"
-    # The 26B variant must be among the suggestions — that's what the
-    # user actually wanted.
-    assert "gemma-4-26b" in suggestions, suggestions
-    # Sanity: the bare ``gemma4`` short alias should NOT be the only
-    # suggestion (the whole point of this regression test).
-    assert suggestions != ["gemma4"], suggestions
-
-
 def test_suggest_similar_letter_fallback_collapsed_separator():
     """User collapses our hyphen — ``mistral24b`` should still suggest
-    ``mistral-24b``, not return []."""
-    assert "mistral-24b" in suggest_similar("mistral24b")
+    ``mistral-24b-4bit``, not return []."""
+    assert "mistral-24b-4bit" in suggest_similar("mistral24b")
 
 
 def test_suggest_similar_letter_fallback_skips_legit_looking_names():
     """When the input has no size/quant suffix tokens (i.e., looks
     structurally like a legit single-segment HF repo ID), suggest_similar
-    must return [] — not bait-and-switch ``gpt2`` to ``gpt-oss-20b`` or
-    ``qwen-coder`` to ``qwen3-coder``. The CLI layer's POPULAR_ALIASES
+    must return [] — not bait-and-switch ``gpt2`` to ``gpt-oss-20b-mxfp4-q8`` or
+    ``qwen-coder`` to ``qwen3-coder-4bit``. The CLI layer's POPULAR_ALIASES
     fallback handles those cases at presentation time."""
     # ``gpt2`` has been pinned by test_suggest_similar_lets_legitimate_hf_ids_through;
     # this case adds the partial-family equivalent.
@@ -152,7 +135,7 @@ def test_suggest_similar_letter_fallback_skips_legit_looking_names():
         ("Gemma4-27b", "gemma"),  # lowercased
         ("gemma_4-27b", "gemma"),  # stops at non-letter
         ("mistral24b", "mistral"),
-        ("qwen3.5-4b", "qwen"),  # stops at first digit
+        ("qwen3.5-4b-4bit", "qwen"),  # stops at first digit
         ("123abc", ""),  # leading non-letter → empty
         ("", ""),  # empty input
         ("ab", "ab"),  # short prefix (caller enforces ≥3 minimum)
diff --git a/tests/test_model_profiles_ssot.py b/tests/test_model_profiles_ssot.py
index 8866907f..49169c92 100644
--- a/tests/test_model_profiles_ssot.py
+++ b/tests/test_model_profiles_ssot.py
@@ -83,12 +83,12 @@ def test_orphan_aliases_now_covered() -> None:
     """Pin the 6 specific aliases that were orphans before this PR to
     catch a regression where someone deletes their profile."""
     for orphan in (
-        "bonsai-1.7b",
-        "bonsai-4b",
-        "bonsai-8b",
-        "ministral-3b",
-        "nemotron-30b",
-        "nemotron-nano",
+        "bonsai-1.7b-unpacked",
+        "bonsai-4b-unpacked",
+        "bonsai-8b-unpacked",
+        "ministral-3b-4bit",
+        "nemotron-30b-4bit",
+        "nemotron-30b-4bit",
     ):
         profile = resolve_profile(orphan)
         assert profile is not None, f"{orphan} regressed to orphan"
@@ -108,14 +108,14 @@ def test_list_aliases_returns_legacy_string_view() -> None:
     aliases = list_aliases()
     assert len(aliases) >= 65
     assert all(isinstance(p, str) for p in aliases.values())
-    assert aliases["qwen3.5-4b"] == "mlx-community/Qwen3.5-4B-MLX-4bit"
+    assert aliases["qwen3.5-4b-4bit"] == "mlx-community/Qwen3.5-4B-MLX-4bit"
     assert aliases["qwen3-0.6b-8bit"] == "mlx-community/Qwen3-0.6B-8bit"
 
 
 def test_list_profiles_returns_rich_dataclass_view() -> None:
     profiles = list_profiles()
     assert len(profiles) >= 65
-    p = profiles["qwen3.5-4b"]
+    p = profiles["qwen3.5-4b-4bit"]
     assert isinstance(p, AliasProfile)
     assert p.hf_path == "mlx-community/Qwen3.5-4B-MLX-4bit"
     assert p.tool_call_parser == "hermes"
@@ -138,7 +138,7 @@ def test_list_profiles_returns_rich_dataclass_view() -> None:
 
 def test_resolve_model_unchanged_for_callers() -> None:
     """Existing callers of ``resolve_model`` must keep getting a string."""
-    assert resolve_model("qwen3.5-4b") == "mlx-community/Qwen3.5-4B-MLX-4bit"
+    assert resolve_model("qwen3.5-4b-4bit") == "mlx-community/Qwen3.5-4B-MLX-4bit"
     assert (
         resolve_model("mlx-community/Qwen3.5-4B-MLX-4bit")
         == "mlx-community/Qwen3.5-4B-MLX-4bit"
@@ -150,7 +150,7 @@ def test_resolve_model_unchanged_for_callers() -> None:
 
 
 def test_resolve_profile_by_alias_name() -> None:
-    p = resolve_profile("qwen3.5-4b")
+    p = resolve_profile("qwen3.5-4b-4bit")
     assert p is not None
     assert p.tool_call_parser == "hermes"
 
@@ -173,11 +173,11 @@ def test_resolve_profile_returns_none_for_unknown() -> None:
 
 
 def test_detect_model_config_prefers_alias_profile_over_regex() -> None:
-    """``qwen3.5-4b`` (alias) and the matching qwen3.5 regex pattern
+    """``qwen3.5-4b-4bit`` (alias) and the matching qwen3.5 regex pattern
     happen to agree today, but the alias path is the one we contract on
     — pin a known field that exists on the alias profile so a future
     regex change can't silently take over."""
-    cfg = detect_model_config("qwen3.5-4b")
+    cfg = detect_model_config("qwen3.5-4b-4bit")
     assert cfg is not None
     assert cfg.tool_call_parser == "hermes"
     assert cfg.is_hybrid is True
@@ -212,7 +212,7 @@ def test_detect_model_config_alias_wins_over_regex_when_they_disagree() -> None:
     import vllm_mlx.model_aliases as ma
     from vllm_mlx.model_aliases import AliasProfile
 
-    real = ma._aliases["qwen3.5-4b"]
+    real = ma._aliases["qwen3.5-4b-4bit"]
     forged = AliasProfile(
         hf_path=real.hf_path,
         tool_call_parser="ALIAS_WINS",  # the regex would say "hermes"
@@ -220,8 +220,8 @@ def test_detect_model_config_alias_wins_over_regex_when_they_disagree() -> None:
         is_hybrid=real.is_hybrid,
         supports_spec_decode=real.supports_spec_decode,
     )
-    with patch.dict(ma._aliases, {"qwen3.5-4b": forged}):
-        cfg = detect_model_config("qwen3.5-4b")
+    with patch.dict(ma._aliases, {"qwen3.5-4b-4bit": forged}):
+        cfg = detect_model_config("qwen3.5-4b-4bit")
     assert cfg is not None
     assert cfg.tool_call_parser == "ALIAS_WINS", (
         "regex shadowed the alias profile — alias-first lookup is broken"
@@ -329,20 +329,20 @@ def test_per_alias_schema_allows_independent_overrides() -> None:
     even if they map to the same family. This is what we couldn't do
     before, and it's the architectural reason for the refactor."""
     profiles = list_profiles()
-    p1 = profiles["qwen3.5-4b"]
+    p1 = profiles["qwen3.5-4b-4bit"]
     # Object identity check would be wrong; equality on a value-typed
     # dataclass is what we actually want — separate AliasProfile
     # instances per alias means we can mutate one without touching the
     # other. (Mutation isn't supported because the dataclass is frozen,
     # but a re-load with edited JSON would work.)
-    assert p1 is not profiles["qwen3.5-9b"]
+    assert p1 is not profiles["qwen3.5-9b-4bit"]
 
 
 # ---- Reverse-lookup behaviour with shared hf_paths -----------------------
 
 
 def test_reverse_lookup_for_shared_hf_path_is_deterministic() -> None:
-    """Two aliases (``nemotron-30b`` and ``nemotron-nano``) point at the
+    """Two aliases (``nemotron-30b-4bit`` and ``nemotron-30b-4bit``) point at the
     same MLX repo. Reverse lookup by HF path should return the
     JSON-insertion-order-first alias's profile, deterministically.
 
@@ -352,23 +352,23 @@ def test_reverse_lookup_for_shared_hf_path_is_deterministic() -> None:
     about who's the canonical alias).
     """
     profiles = list_profiles()
-    nemotron_30b = profiles["nemotron-30b"]
-    nemotron_nano = profiles["nemotron-nano"]
+    nemotron_30b = profiles["nemotron-30b-4bit"]
+    nemotron_nano = profiles["nemotron-30b-4bit"]
     assert nemotron_30b.hf_path == nemotron_nano.hf_path
 
-    # nemotron-30b appears first in aliases.json, so reverse lookup
-    # by the shared HF path returns nemotron-30b's profile object.
+    # nemotron-30b-4bit appears first in aliases.json, so reverse lookup
+    # by the shared HF path returns nemotron-30b-4bit's profile object.
     via_path = resolve_profile(nemotron_30b.hf_path)
     assert via_path is not None
     assert via_path is nemotron_30b
 
 
 def test_reverse_lookup_handles_deepseek_v4_flash_duplicate() -> None:
-    """``deepseek-v4-flash`` and ``deepseek-v4-flash-8bit`` share
+    """``deepseek-v4-flash-8bit`` and ``deepseek-v4-flash-8bit`` share
     ``mlx-community/DeepSeek-V4-Flash-8bit`` — same regression guard
     pattern as the nemotron pair, different family."""
     profiles = list_profiles()
-    flash = profiles["deepseek-v4-flash"]
+    flash = profiles["deepseek-v4-flash-8bit"]
     flash_8bit = profiles["deepseek-v4-flash-8bit"]
     assert flash.hf_path == flash_8bit.hf_path
     via_path = resolve_profile(flash.hf_path)
diff --git a/tests/test_postprocessor.py b/tests/test_postprocessor.py
index 00458485..8c3b5d4e 100644
--- a/tests/test_postprocessor.py
+++ b/tests/test_postprocessor.py
@@ -133,7 +133,7 @@ def test_channel_routed_accumulators_populated(self):
         emitted events to the client but never updated the per-processor
         accumulators that ``_build_usage`` reads to compute the
         reasoning/content split. Confirmed by parallel onboarding agents
-        on gemma-4-26b and gpt-oss-20b.
+        on gemma-4-26b-4bit and gpt-oss-20b.
         """
         cfg = _make_cfg()
         pp = StreamingPostProcessor(cfg)
diff --git a/tests/test_prefix_boundary_path_parity.py b/tests/test_prefix_boundary_path_parity.py
index 4236e95d..cefffb2b 100644
--- a/tests/test_prefix_boundary_path_parity.py
+++ b/tests/test_prefix_boundary_path_parity.py
@@ -173,12 +173,12 @@ async def _drain():
 
 
 def test_non_hybrid_model_skips_boundary_both_paths(monkeypatch):
-    """Pure Transformer models (gpt-oss-20b, qwen3-coder, etc.) must NOT
+    """Pure Transformer models (gpt-oss-20b-mxfp4-q8, qwen3-coder-4bit, etc.) must NOT
     take the boundary-split path even if a multi-message conversation
     would otherwise produce ``prefix_boundary > 0``.
 
     Why: ``BatchGenerator.insert_segments`` empirically corrupts harmony
-    tool-call channel state across multi-turn-with-tools on gpt-oss-20b
+    tool-call channel state across multi-turn-with-tools on gpt-oss-20b-mxfp4-q8
     (pydantic_ai 6_multi_tool drops from 6/6 to 5/6 — agent loops on
     ``add(3,4)`` until ``request_limit`` exhausts). Pure Transformers
     don't need the boundary save anyway — trim+supersequence reuse
diff --git a/tests/test_sampling_params_passthrough.py b/tests/test_sampling_params_passthrough.py
index 1432ae98..7779aa73 100644
--- a/tests/test_sampling_params_passthrough.py
+++ b/tests/test_sampling_params_passthrough.py
@@ -27,7 +27,7 @@
 
 # A realistic payload — Qwen3.6 published coding-tuned sampling.
 QWEN36_CODING_PAYLOAD = {
-    "model": "qwen3.6-35b",
+    "model": "qwen3.6-35b-4bit",
     "messages": [{"role": "user", "content": "hi"}],
     "temperature": 0.6,
     "top_p": 0.95,
@@ -69,7 +69,7 @@ def test_chat_completion_request_defaults_to_none_when_unset():
     'client explicitly chose a value'. Mixing them would make us override
     SamplingParams defaults even when the client wanted defaults."""
     req = ChatCompletionRequest(
-        model="qwen3.5-4b",
+        model="qwen3.5-4b-4bit",
         messages=[{"role": "user", "content": "hi"}],
     )
 
@@ -83,7 +83,7 @@ def test_chat_completion_request_defaults_to_none_when_unset():
 def test_completion_request_preserves_extended_sampling_params():
     """Mirror of the chat-request test for /v1/completions."""
     payload = {
-        "model": "qwen3.6-35b",
+        "model": "qwen3.6-35b-4bit",
         "prompt": "hi",
         "temperature": 0.6,
         "top_p": 0.95,
@@ -155,7 +155,7 @@ def test_chat_kwargs_omits_extended_params_when_client_silent():
     NOT contain them — otherwise we'd override the engine's defaults with
     None and break the SamplingParams contract."""
     req = ChatCompletionRequest(
-        model="qwen3.5-4b",
+        model="qwen3.5-4b-4bit",
         messages=[{"role": "user", "content": "hi"}],
     )
     chat_kwargs = _build_chat_kwargs(req)
@@ -198,7 +198,7 @@ async def stream_generate(self, **kw):
             yield _FakeOutput()
 
     req = CompletionRequest(
-        model="qwen3.6-35b",
+        model="qwen3.6-35b-4bit",
         prompt="hi",
         temperature=0.6,
         top_p=0.95,
@@ -246,7 +246,7 @@ def test_completion_route_omits_extended_params_when_client_silent():
     """Mirror of the chat-route variant: legacy /v1/completions clients that
     don't set these fields must not see them leaked as None into engine
     kwargs (which would override SamplingParams defaults)."""
-    req = CompletionRequest(model="qwen3.5-4b", prompt="hi")
+    req = CompletionRequest(model="qwen3.5-4b-4bit", prompt="hi")
     extended_kwargs: dict = {}
     for name in (
         "top_k",
diff --git a/tests/test_share_cli.py b/tests/test_share_cli.py
index 4a58ea0e..6fb0227e 100644
--- a/tests/test_share_cli.py
+++ b/tests/test_share_cli.py
@@ -40,7 +40,7 @@ def _isolated_state_dir(tmp_path, monkeypatch):
 
 def _make_args(**overrides):
     defaults = dict(
-        model="qwen3.5-4b",
+        model="qwen3.5-4b-4bit",
         port=18765,  # explicit so the env-var fallback path isn't exercised
         thinking=False,  # default: forward --no-thinking to serve
         cors_origins=None,  # None → CLI default allowlist
@@ -124,7 +124,7 @@ def test_share_command_happy_path(capsys):
         patch.object(share_cli.ws_tunnel, "wait_for_public_url", return_value=True),
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
@@ -227,9 +227,9 @@ def test_register_adds_share_to_subparsers():
     parser = argparse.ArgumentParser()
     subparsers = parser.add_subparsers(dest="command")
     share_cli.register(subparsers)
-    args = parser.parse_args(["share", "qwen3.5-4b"])
+    args = parser.parse_args(["share", "qwen3.5-4b-4bit"])
     assert args.command == "share"
-    assert args.model == "qwen3.5-4b"
+    assert args.model == "qwen3.5-4b-4bit"
 
 
 def test_share_command_rejects_garbage_port_env(monkeypatch):
@@ -277,7 +277,7 @@ def test_spawn_serve_passes_loopback_host():
     access at ``http://<lan-ip>:<port>``."""
     with patch("subprocess.Popen") as mock_popen:
         share_cli._spawn_serve(
-            alias="qwen3.5-4b",
+            alias="qwen3.5-4b-4bit",
             port=18765,
             api_key="K",
             log_path=MagicMock(),
@@ -293,7 +293,7 @@ def test_spawn_serve_passes_api_key_via_env_not_argv():
     appear in argv where ``ps`` / shell history would leak it."""
     with patch("subprocess.Popen") as mock_popen:
         share_cli._spawn_serve(
-            alias="qwen3.5-4b",
+            alias="qwen3.5-4b-4bit",
             port=18765,
             api_key="SECRET_KEY_HERE",
             log_path=MagicMock(),
@@ -466,7 +466,7 @@ def test_share_command_exits_nonzero_when_serve_crashes(capsys):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", return_value=None),
         pytest.raises(SystemExit) as exc_info,
@@ -496,7 +496,7 @@ def test_share_command_exits_nonzero_when_serve_exits_cleanly(capsys):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", return_value=None),
         pytest.raises(SystemExit) as exc_info,
@@ -533,7 +533,7 @@ def fake_sleep(*_a, **_k):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=fake_sleep),
         pytest.raises(SystemExit) as exc_info,
@@ -558,7 +558,7 @@ def test_share_command_ctrl_c_keeps_exit_zero(capsys):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
@@ -591,7 +591,7 @@ def raise_sigterm_handler_once(*_a, **_k):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=raise_sigterm_handler_once),
     ):
@@ -617,7 +617,7 @@ def test_register_share_cors_origins_accepts_multiple_values():
     args = parser.parse_args(
         [
             "share",
-            "qwen3.5-4b",
+            "qwen3.5-4b-4bit",
             "--cors-origins",
             "https://a.com",
             "https://b.com",
@@ -647,7 +647,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args):  # noqa: ARG001
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
@@ -694,7 +694,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args):  # noqa: ARG001
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ]
@@ -804,7 +804,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args):  # noqa: ARG001
         spawn_argv.append(alias)
         return serve_proc
 
-    args = _make_args(model="qwen3.5-4b")
+    args = _make_args(model="qwen3.5-4b-4bit")
     # Simulate argparse having rewritten the alias.
     args._original_alias = "Qwen3.5-4B"
 
@@ -817,7 +817,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args):  # noqa: ARG001
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
@@ -845,12 +845,12 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args):  # noqa: ARG001
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
-        share_cli.share_command(_make_args(model="qwen3.5-4b"))
-    assert spawn_argv == ["qwen3.5-4b"]
+        share_cli.share_command(_make_args(model="qwen3.5-4b-4bit"))
+    assert spawn_argv == ["qwen3.5-4b-4bit"]
 
 
 # ─────────────────────────── download-gate behavior ─────────────────────────
@@ -873,7 +873,7 @@ def test_share_command_runs_download_gate_for_uncached_hf_repo():
 
 
 def test_share_command_skips_download_gate_for_local_alias():
-    """Aliases without ``/`` (e.g. ``qwen3.5-4b``) are NOT HF repo ids —
+    """Aliases without ``/`` (e.g. ``qwen3.5-4b-4bit``) are NOT HF repo ids —
     the gate short-circuits before any HF API call. Verified by
     asserting ``is_repo_cached`` was never called."""
     with (
@@ -881,7 +881,7 @@ def test_share_command_skips_download_gate_for_local_alias():
         patch("vllm_mlx._download_gate.is_repo_cached") as cached,
         pytest.raises(SystemExit),
     ):
-        share_cli.share_command(_make_args(model="qwen3.5-4b"))
+        share_cli.share_command(_make_args(model="qwen3.5-4b-4bit"))
     cached.assert_not_called()
 
 
@@ -1104,7 +1104,7 @@ def test_register_share_chat_frontend_default_is_none():
     parser = argparse.ArgumentParser()
     subparsers = parser.add_subparsers(dest="command")
     share_cli.register(subparsers)
-    args = parser.parse_args(["share", "qwen3.5-4b"])
+    args = parser.parse_args(["share", "qwen3.5-4b-4bit"])
     assert args.chat_frontend is None
 
 
@@ -1113,7 +1113,7 @@ def test_register_share_chat_frontend_accepts_value():
     subparsers = parser.add_subparsers(dest="command")
     share_cli.register(subparsers)
     args = parser.parse_args(
-        ["share", "qwen3.5-4b", "--chat-frontend", "https://my-fork.example"]
+        ["share", "qwen3.5-4b-4bit", "--chat-frontend", "https://my-fork.example"]
     )
     assert args.chat_frontend == "https://my-fork.example"
 
@@ -1150,7 +1150,7 @@ def test_share_command_forwards_chat_frontend_to_banner(capsys):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
@@ -1179,7 +1179,7 @@ def test_share_command_omits_chat_line_when_frontend_disabled(capsys):
         patch.object(share_cli, "_pick_port", return_value=18765),
         patch.object(share_cli, "_maybe_confirm_download"),
         patch.object(
-            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b"
+            share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit"
         ),
         patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()),
     ):
diff --git a/tests/test_smoke_matrix.sh b/tests/test_smoke_matrix.sh
index e5c45902..00befac9 100755
--- a/tests/test_smoke_matrix.sh
+++ b/tests/test_smoke_matrix.sh
@@ -72,7 +72,7 @@ print(text)
 # test 4 to verify the reasoning parser is actually splitting thinking
 # tokens into ``reasoning_content`` — combined length is unreliable because
 # some models compensate for disabled thinking by writing longer answers
-# in ``content`` (e.g. qwen3.5-4b on simple math drops thinking ratio
+# in ``content`` (e.g. qwen3.5-4b-4bit on simple math drops thinking ratio
 # below the previous 1.5x heuristic).
 stream_chat_split() {
     local body="$1"
diff --git a/tests/test_stop_string_enforcement.py b/tests/test_stop_string_enforcement.py
index 699121f9..5a923211 100644
--- a/tests/test_stop_string_enforcement.py
+++ b/tests/test_stop_string_enforcement.py
@@ -8,7 +8,7 @@
 scheduler has to scan the decoded output itself.
 
 This was missing on the text path (MLLMScheduler had it, Scheduler did
-not), which surfaced as 4 failing regression-suite tests on qwen3.5-4b:
+not), which surfaced as 4 failing regression-suite tests on qwen3.5-4b-4bit:
 tests 1 (newline), 2 (literal word), 4 (Unicode), 5 (streaming).
 
 These unit tests exercise ``Scheduler._process_batch_responses`` directly with
diff --git a/tests/test_suffix_bench_methodology.py b/tests/test_suffix_bench_methodology.py
index e9c0fed9..5ad7d16c 100644
--- a/tests/test_suffix_bench_methodology.py
+++ b/tests/test_suffix_bench_methodology.py
@@ -53,7 +53,7 @@ def test_negative_tokens_rejected(self):
 
     def test_decode_time_floor_rejects_short_window(self):
         # 80 tokens generated in 0.04s → 2000 tok/s. This is the exact
-        # failure mode that burned smollm3-3b's code_edit run in v2 —
+        # failure mode that burned smollm3-3b-4bit's code_edit run in v2 —
         # technically >32 tokens (the old guard) but still meaningless.
         wr = bench._classify_run(completion_tokens=80, decode_time=0.04, total_time=0.6)
         assert wr.tps is None
diff --git a/tests/test_suffix_decoding_tier.py b/tests/test_suffix_decoding_tier.py
index da1309d0..14bb64d3 100644
--- a/tests/test_suffix_decoding_tier.py
+++ b/tests/test_suffix_decoding_tier.py
@@ -191,7 +191,7 @@ def test_avoid_shows_worst_workload(self):
         assert "0.78" in table
 
     def test_long_avoid_note_fits_box_when_max_width_set(self):
-        """Regression: long ``avoid`` notes (e.g. ``gemma-4-26b``) used
+        """Regression: long ``avoid`` notes (e.g. ``gemma-4-26b-4bit``) used
         to overflow the right ``│`` border. Truncation must keep the
         tier word + numeric speedup whole while shortening the trailing
         rationale to ``…)``."""
@@ -217,7 +217,7 @@ def test_short_tier_note_is_not_wrongly_truncated(self):
     def test_table_rows_all_same_width_for_long_avoid(self):
         """Box-frame alignment invariant: every bordered row must end at
         the same column. Pre-fix the ``Suffix tier`` row for an alias
-        like ``gemma-4-26b`` would render past the right ``│``."""
+        like ``gemma-4-26b-4bit`` would render past the right ``│``."""
         cfg = ModelConfig(
             suffix_decoding_tier="avoid",
             suffix_bench_speedup={"json_array": 0.20},
diff --git a/tests/test_telemetry_emit.py b/tests/test_telemetry_emit.py
index 86ba1d6d..92b86ed7 100644
--- a/tests/test_telemetry_emit.py
+++ b/tests/test_telemetry_emit.py
@@ -80,7 +80,7 @@ def test_request_no_op_when_disabled(fake_home, stub_queue):
 
     emit.request(
         endpoint="/v1/chat/completions",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=100,
@@ -147,7 +147,7 @@ def test_cli_kill_switch_overrides_opt_in(opted_in, stub_queue):
         emit.session_end(subcommand="serve", duration_seconds=42)
         emit.request(
             endpoint="/v1/chat/completions",
-            model_alias="qwen3.5-9b",
+            model_alias="qwen3.5-9b-4bit",
             stream=True,
             tool_call_used=False,
             prompt_tokens=100,
@@ -289,7 +289,7 @@ def test_request_buckets_not_raw_numbers(opted_in, stub_queue):
 
     emit.request(
         endpoint="/v1/chat/completions",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=137,
@@ -469,7 +469,7 @@ def test_request_endpoint_constrained_to_allowlist(opted_in, stub_queue):
     # Allowed endpoint round-trips verbatim (after strip).
     emit.request(
         endpoint="/v1/chat/completions",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=10,
@@ -483,7 +483,7 @@ def test_request_endpoint_constrained_to_allowlist(opted_in, stub_queue):
     # Query string + fragment stripped before allowlist match.
     emit.request(
         endpoint="/v1/chat/completions?api_key=sk-PROD-SECRET#anchor",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=10,
@@ -502,7 +502,7 @@ def test_request_endpoint_constrained_to_allowlist(opted_in, stub_queue):
     # string leaks into the payload.
     emit.request(
         endpoint="/internal/dump?path=/Users/alice/secrets.txt",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=10,
@@ -530,7 +530,7 @@ def test_request_endpoint_normalizes_full_url_to_path(opted_in, stub_queue):
 
     emit.request(
         endpoint="https://api.example.com/v1/chat/completions",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=10,
@@ -549,7 +549,7 @@ def test_request_endpoint_normalizes_full_url_to_path(opted_in, stub_queue):
     # Combined: full URL + query + fragment still resolves correctly.
     emit.request(
         endpoint="https://host/v1/chat/completions?key=sk-PROD-LEAK#frag",
-        model_alias="qwen3.5-9b",
+        model_alias="qwen3.5-9b-4bit",
         stream=True,
         tool_call_used=False,
         prompt_tokens=10,
@@ -661,7 +661,7 @@ def test_safe_does_not_swallow_signature_mismatch(opted_in, stub_queue):
     with pytest.raises(TypeError):
         emit.request(
             # endpoint missing
-            model_alias="qwen3.5-9b",
+            model_alias="qwen3.5-9b-4bit",
             stream=True,
             tool_call_used=False,
             prompt_tokens=10,
@@ -715,7 +715,7 @@ def test_flag_values_never_cross_telemetry_boundary(opted_in, stub_queue):
     prompt = "summarize this confidential email about Q3 numbers"
     argv = [
         "serve",
-        "qwen3.5-9b",
+        "qwen3.5-9b-4bit",
         "--api-key",
         secret,
         "--auth-header",
diff --git a/tests/test_telemetry_redact.py b/tests/test_telemetry_redact.py
index f7c9eaa2..86549eff 100644
--- a/tests/test_telemetry_redact.py
+++ b/tests/test_telemetry_redact.py
@@ -126,8 +126,8 @@ def test_normalize_model_path_redacts_local(raw):
 
 def test_normalize_model_path_bare_alias_passes():
     """Bare alias names (no slash) are public + harmless."""
-    assert normalize_model_path("qwen3.5-9b") == "qwen3.5-9b"
-    assert normalize_model_path("hermes3-8b") == "hermes3-8b"
+    assert normalize_model_path("qwen3.5-9b-4bit") == "qwen3.5-9b-4bit"
+    assert normalize_model_path("hermes3-8b-4bit") == "hermes3-8b-4bit"
 
 
 def test_normalize_model_path_empty():
diff --git a/vllm_mlx/_completion.py b/vllm_mlx/_completion.py
index 3a9a0ff7..2ba65c6c 100644
--- a/vllm_mlx/_completion.py
+++ b/vllm_mlx/_completion.py
@@ -116,7 +116,7 @@ def alias_csv_completer(prefix: str = "", **_: Any) -> list[str]:
     """Comma-separated-list variant for ``doctor --models a,b,c``.
 
     The user-visible prefix at completion time contains everything
-    typed for this flag — e.g. ``qwen3.5-4b,gem`` when partway through
+    typed for this flag — e.g. ``qwen3.5-4b-4bit,gem`` when partway through
     the second entry. We split on the last comma, strip whitespace
     around the tail so ``--models a, gem<TAB>`` works the same as
     ``--models a,gem<TAB>`` (the runtime ``split + strip`` accepts
diff --git a/vllm_mlx/_download_gate.py b/vllm_mlx/_download_gate.py
index 4e14cbd4..ec997a73 100644
--- a/vllm_mlx/_download_gate.py
+++ b/vllm_mlx/_download_gate.py
@@ -3,7 +3,7 @@
 
 Persona-3 ("Ollama switcher") feedback (2026-05): running
 
-    rapid-mlx chat qwen3-coder
+    rapid-mlx chat qwen3-coder-4bit
 
 against an alias that wasn't yet cached silently kicked off a 41.8 GB
 download with no ``[Y/n]`` prompt. The download itself ran fine, but
diff --git a/vllm_mlx/agents/__init__.py b/vllm_mlx/agents/__init__.py
index 10835740..92ff5718 100644
--- a/vllm_mlx/agents/__init__.py
+++ b/vllm_mlx/agents/__init__.py
@@ -4,7 +4,7 @@
     from vllm_mlx.agents import get_profile, list_profiles
 
     profile = get_profile("hermes")
-    config = profile.render_config("http://localhost:8000/v1", "qwen3.5-9b")
+    config = profile.render_config("http://localhost:8000/v1", "qwen3.5-9b-4bit")
 """
 
 from __future__ import annotations
diff --git a/vllm_mlx/agents/profiles/aider.yaml b/vllm_mlx/agents/profiles/aider.yaml
index 3a8cfb6e..261b5be4 100644
--- a/vllm_mlx/agents/profiles/aider.yaml
+++ b/vllm_mlx/agents/profiles/aider.yaml
@@ -11,10 +11,10 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
-    - "deepseek-r1-8b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
+    - "deepseek-r1-8b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/cline.yaml b/vllm_mlx/agents/profiles/cline.yaml
index 641e32e2..08b5e011 100644
--- a/vllm_mlx/agents/profiles/cline.yaml
+++ b/vllm_mlx/agents/profiles/cline.yaml
@@ -13,9 +13,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "gemma-4-26b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "gemma-4-26b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/codex.yaml b/vllm_mlx/agents/profiles/codex.yaml
index b3c47b1c..a2bdf438 100644
--- a/vllm_mlx/agents/profiles/codex.yaml
+++ b/vllm_mlx/agents/profiles/codex.yaml
@@ -17,9 +17,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/generic.yaml b/vllm_mlx/agents/profiles/generic.yaml
index 39ef387d..4ee271f2 100644
--- a/vllm_mlx/agents/profiles/generic.yaml
+++ b/vllm_mlx/agents/profiles/generic.yaml
@@ -10,9 +10,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-4b"
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
+    - "qwen3.5-4b-4bit"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
 
 streaming:
   extra_tool_tags: []
diff --git a/vllm_mlx/agents/profiles/goose.yaml b/vllm_mlx/agents/profiles/goose.yaml
index 4287f07f..2e49a6b3 100644
--- a/vllm_mlx/agents/profiles/goose.yaml
+++ b/vllm_mlx/agents/profiles/goose.yaml
@@ -11,9 +11,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/hermes.yaml b/vllm_mlx/agents/profiles/hermes.yaml
index 849f0506..afbd05e7 100644
--- a/vllm_mlx/agents/profiles/hermes.yaml
+++ b/vllm_mlx/agents/profiles/hermes.yaml
@@ -18,11 +18,11 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-4b"
-    - "qwen3.5-9b"
-    - "gemma-4-26b"
+    - "qwen3.5-4b-4bit"
+    - "qwen3.5-9b-4bit"
+    - "gemma-4-26b-4bit"
     - "qwen3.5-35b-a3b"
-    - "qwen3.6-35b"
+    - "qwen3.6-35b-4bit"
   parser_override: null  # use auto-detect per model
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/langchain.yaml b/vllm_mlx/agents/profiles/langchain.yaml
index 55e10ded..8f8e9062 100644
--- a/vllm_mlx/agents/profiles/langchain.yaml
+++ b/vllm_mlx/agents/profiles/langchain.yaml
@@ -11,9 +11,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
 
 streaming:
   max_tools: 10
diff --git a/vllm_mlx/agents/profiles/openclaude.yaml b/vllm_mlx/agents/profiles/openclaude.yaml
index be8a890b..7d10bcca 100644
--- a/vllm_mlx/agents/profiles/openclaude.yaml
+++ b/vllm_mlx/agents/profiles/openclaude.yaml
@@ -13,10 +13,10 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
-    - "gemma-4-26b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
+    - "gemma-4-26b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/opencode.yaml b/vllm_mlx/agents/profiles/opencode.yaml
index 28db5c2b..0143eeab 100644
--- a/vllm_mlx/agents/profiles/opencode.yaml
+++ b/vllm_mlx/agents/profiles/opencode.yaml
@@ -33,10 +33,10 @@ config:
 
 models:
   recommended:
-    - "qwen3-coder-30b"
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3-coder-30b-4bit"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/openhands.yaml b/vllm_mlx/agents/profiles/openhands.yaml
index 8820f984..de6c8735 100644
--- a/vllm_mlx/agents/profiles/openhands.yaml
+++ b/vllm_mlx/agents/profiles/openhands.yaml
@@ -12,9 +12,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
   parser_override: null
 
 streaming:
diff --git a/vllm_mlx/agents/profiles/pydanticai.yaml b/vllm_mlx/agents/profiles/pydanticai.yaml
index 6c34da17..bacfde84 100644
--- a/vllm_mlx/agents/profiles/pydanticai.yaml
+++ b/vllm_mlx/agents/profiles/pydanticai.yaml
@@ -11,9 +11,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
 
 streaming:
   max_tools: 10
diff --git a/vllm_mlx/agents/profiles/smolagents.yaml b/vllm_mlx/agents/profiles/smolagents.yaml
index 425a400c..e5880bee 100644
--- a/vllm_mlx/agents/profiles/smolagents.yaml
+++ b/vllm_mlx/agents/profiles/smolagents.yaml
@@ -11,9 +11,9 @@ config:
 
 models:
   recommended:
-    - "qwen3.5-9b"
-    - "qwen3.6-35b"
-    - "qwen3.5-4b"
+    - "qwen3.5-9b-4bit"
+    - "qwen3.6-35b-4bit"
+    - "qwen3.5-4b-4bit"
 
 streaming:
   max_tools: 10
diff --git a/vllm_mlx/aliases.json b/vllm_mlx/aliases.json
index f8051597..7a5edf99 100644
--- a/vllm_mlx/aliases.json
+++ b/vllm_mlx/aliases.json
@@ -1,5 +1,5 @@
 {
-  "qwen3.5-4b": {
+  "qwen3.5-4b-4bit": {
     "hf_path": "mlx-community/Qwen3.5-4B-MLX-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -14,7 +14,7 @@
     "is_moe": false,
     "supports_spec_decode": false
   },
-  "qwen3.5-9b": {
+  "qwen3.5-9b-4bit": {
     "hf_path": "mlx-community/Qwen3.5-9B-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -29,14 +29,14 @@
     "is_moe": false,
     "supports_spec_decode": false
   },
-  "qwen3.5-27b": {
+  "qwen3.5-27b-4bit": {
     "hf_path": "mlx-community/Qwen3.5-27B-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
     "is_hybrid": true,
     "supports_spec_decode": false
   },
-  "qwen3.5-35b": {
+  "qwen3.5-35b-8bit": {
     "hf_path": "mlx-community/Qwen3.5-35B-A3B-8bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -52,7 +52,7 @@
     "supports_spec_decode": false,
     "is_moe": true
   },
-  "qwen3.5-122b": {
+  "qwen3.5-122b-mxfp4": {
     "hf_path": "nightmedia/Qwen3.5-122B-A10B-Text-mxfp4-mlx",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -78,7 +78,7 @@
     "supports_dflash": true,
     "dflash_draft_model": "z-lab/Qwen3.5-27B-DFlash"
   },
-  "qwen3.6-27b": {
+  "qwen3.6-27b-4bit": {
     "hf_path": "mlx-community/Qwen3.6-27B-4bit",
     "tool_call_parser": "qwen3_coder_xml",
     "reasoning_parser": "qwen3",
@@ -101,7 +101,7 @@
     "is_hybrid": true,
     "supports_spec_decode": false
   },
-  "qwen3.6-35b": {
+  "qwen3.6-35b-4bit": {
     "hf_path": "mlx-community/Qwen3.6-35B-A3B-4bit",
     "tool_call_parser": "qwen3_coder_xml",
     "reasoning_parser": "qwen3",
@@ -141,13 +141,6 @@
     "supports_spec_decode": false,
     "is_moe": true
   },
-  "deepseek-v4-flash": {
-    "hf_path": "mlx-community/DeepSeek-V4-Flash-8bit",
-    "tool_call_parser": "deepseek",
-    "reasoning_parser": "deepseek_r1",
-    "is_hybrid": false,
-    "supports_spec_decode": true
-  },
   "deepseek-v4-flash-2bit": {
     "hf_path": "mlx-community/DeepSeek-V4-Flash-2bit-DQ",
     "tool_call_parser": "deepseek",
@@ -201,14 +194,14 @@
     "is_moe": false,
     "supports_spec_decode": true
   },
-  "qwen3-coder": {
+  "qwen3-coder-4bit": {
     "hf_path": "lmstudio-community/Qwen3-Coder-Next-MLX-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
     "is_hybrid": true,
     "supports_spec_decode": false
   },
-  "qwen3-vl-4b": {
+  "qwen3-vl-4b-4bit": {
     "hf_path": "mlx-community/Qwen3-VL-4B-Instruct-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -221,7 +214,7 @@
       "code_edit": 1.001
     }
   },
-  "llama3-1b": {
+  "llama3-1b-4bit": {
     "hf_path": "mlx-community/Llama-3.2-1B-Instruct-4bit",
     "tool_call_parser": "llama",
     "reasoning_parser": null,
@@ -235,7 +228,7 @@
       "code_edit": 0.851
     }
   },
-  "llama3-3b": {
+  "llama3-3b-4bit": {
     "hf_path": "mlx-community/Llama-3.2-3B-Instruct-4bit",
     "tool_call_parser": "llama",
     "reasoning_parser": null,
@@ -247,7 +240,7 @@
       "json_array": 0.875
     }
   },
-  "hermes3-8b": {
+  "hermes3-8b-4bit": {
     "hf_path": "mlx-community/Hermes-3-Llama-3.1-8B-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -261,7 +254,7 @@
       "code_edit": 0.584
     }
   },
-  "gemma-4-12b": {
+  "gemma-4-12b-4bit": {
     "hf_path": "mlx-community/gemma-4-12B-it-4bit",
     "tool_call_parser": "gemma4",
     "reasoning_parser": "gemma4",
@@ -287,7 +280,7 @@
       "top_k": 64
     }
   },
-  "gemma-4-26b": {
+  "gemma-4-26b-4bit": {
     "hf_path": "mlx-community/gemma-4-26b-a4b-it-4bit",
     "tool_call_parser": "gemma4",
     "reasoning_parser": "gemma4",
@@ -303,7 +296,7 @@
       "top_k": 64
     }
   },
-  "gemma-4-31b": {
+  "gemma-4-31b-4bit": {
     "hf_path": "mlx-community/gemma-4-31b-it-4bit",
     "tool_call_parser": "gemma4",
     "reasoning_parser": "gemma4",
@@ -334,7 +327,7 @@
       "top_k": 64
     }
   },
-  "gemma-4-12b-qat": {
+  "gemma-4-12b-qat-4bit": {
     "hf_path": "mlx-community/gemma-4-12B-it-qat-4bit",
     "tool_call_parser": "gemma4",
     "reasoning_parser": "gemma4",
@@ -360,7 +353,7 @@
       "top_k": 64
     }
   },
-  "gemma-4-26b-qat": {
+  "gemma-4-26b-qat-4bit": {
     "hf_path": "mlx-community/gemma-4-26B-A4B-it-qat-4bit",
     "tool_call_parser": "gemma4",
     "reasoning_parser": "gemma4",
@@ -373,7 +366,7 @@
       "top_k": 64
     }
   },
-  "gemma-4-31b-qat": {
+  "gemma-4-31b-qat-4bit": {
     "hf_path": "mlx-community/gemma-4-31B-it-qat-4bit",
     "tool_call_parser": "gemma4",
     "reasoning_parser": "gemma4",
@@ -399,20 +392,7 @@
       "top_k": 64
     }
   },
-  "gemma4": {
-    "hf_path": "mlx-community/gemma-4-12B-it-qat-4bit",
-    "tool_call_parser": "gemma4",
-    "reasoning_parser": "gemma4",
-    "is_hybrid": false,
-    "is_moe": false,
-    "supports_spec_decode": true,
-    "recommended_sampling": {
-      "temperature": 1.0,
-      "top_p": 0.95,
-      "top_k": 64
-    }
-  },
-  "gemma3-12b": {
+  "gemma3-12b-4bit": {
     "hf_path": "mlx-community/gemma-3-12b-it-qat-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -424,8 +404,8 @@
       "top_k": 64
     }
   },
-  "phi4-14b": {
-    "hf_path": "mlx-community/phi-4-mini-instruct-4bit",
+  "phi-4-14b-4bit": {
+    "hf_path": "mlx-community/phi-4-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
     "is_hybrid": false,
@@ -438,14 +418,14 @@
       "code_edit": 0.874
     }
   },
-  "mistral-24b": {
+  "mistral-24b-4bit": {
     "hf_path": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
     "is_hybrid": false,
     "supports_spec_decode": true
   },
-  "devstral-24b": {
+  "devstral-24b-4bit": {
     "hf_path": "mlx-community/Devstral-Small-2507-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -462,7 +442,7 @@
       "temperature": 0.15
     }
   },
-  "glm4.7-9b": {
+  "glm4.7-9b-4bit": {
     "hf_path": "mlx-community/GLM-4.7-Flash-4bit",
     "tool_call_parser": "glm47",
     "reasoning_parser": "glm4",
@@ -479,7 +459,7 @@
       "top_p": 0.95
     }
   },
-  "glm4.5-air": {
+  "glm4.5-air-4bit": {
     "hf_path": "mlx-community/GLM-4.5-Air-4bit",
     "tool_call_parser": "glm47",
     "reasoning_parser": "glm4",
@@ -497,28 +477,28 @@
       "top_p": 0.95
     }
   },
-  "gpt-oss-20b": {
+  "gpt-oss-20b-mxfp4-q8": {
     "hf_path": "mlx-community/gpt-oss-20b-MXFP4-Q8",
     "tool_call_parser": "harmony",
     "reasoning_parser": "harmony",
     "is_hybrid": false,
     "supports_spec_decode": true
   },
-  "minimax-m2.5": {
+  "minimax-m2.5-4bit": {
     "hf_path": "lmstudio-community/MiniMax-M2.5-MLX-4bit",
     "tool_call_parser": "minimax",
     "reasoning_parser": "minimax",
     "is_hybrid": false,
     "supports_spec_decode": true
   },
-  "minimax-m2.7": {
+  "minimax-m2.7-mxfp4": {
     "hf_path": "mlx-community/MiniMax-M2.7-4bit-mxfp4",
     "tool_call_parser": "minimax",
     "reasoning_parser": "minimax",
     "is_hybrid": false,
     "supports_spec_decode": true
   },
-  "deepseek-r1-8b": {
+  "deepseek-r1-8b-4bit": {
     "hf_path": "mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit",
     "tool_call_parser": "deepseek_v31",
     "reasoning_parser": "deepseek_r1",
@@ -531,7 +511,7 @@
       "code_edit": 0.289
     }
   },
-  "deepseek-r1-32b": {
+  "deepseek-r1-32b-4bit": {
     "hf_path": "mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit",
     "tool_call_parser": "deepseek",
     "reasoning_parser": "deepseek_r1",
@@ -545,14 +525,14 @@
       "code_edit": 1.002
     }
   },
-  "qwopus-9b": {
+  "qwopus-9b-4bit": {
     "hf_path": "Jackrong/MLX-Qwopus3.5-9B-v3-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
     "is_hybrid": true,
     "supports_spec_decode": false
   },
-  "qwopus-27b": {
+  "qwopus-27b-4bit": {
     "hf_path": "Jackrong/MLX-Qwopus3.5-27B-v3-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -566,21 +546,21 @@
     "is_hybrid": true,
     "supports_spec_decode": false
   },
-  "kimi-48b": {
+  "kimi-48b-4bit": {
     "hf_path": "mlx-community/Kimi-K2-Instruct-4bit",
     "tool_call_parser": "kimi",
     "reasoning_parser": null,
     "is_hybrid": false,
     "supports_spec_decode": true
   },
-  "kimi-k2.5": {
+  "kimi-k2.5-3bit": {
     "hf_path": "mlx-community/Kimi-K2.5-3bit",
     "tool_call_parser": "kimi",
     "reasoning_parser": "qwen3",
     "is_hybrid": false,
     "supports_spec_decode": true
   },
-  "ministral-3b": {
+  "ministral-3b-4bit": {
     "hf_path": "mlx-community/Ministral-3-3B-Instruct-2512-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -592,7 +572,7 @@
       "json_array": 1.08
     }
   },
-  "hermes4-70b": {
+  "hermes4-70b-4bit": {
     "hf_path": "lmstudio-community/Hermes-4-70B-MLX-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "glm4",
@@ -606,7 +586,7 @@
       "code_edit": 0.679
     }
   },
-  "qwen3-vl-8b": {
+  "qwen3-vl-8b-4bit": {
     "hf_path": "mlx-community/Qwen3-VL-8B-Instruct-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -619,7 +599,7 @@
       "code_edit": 1.024
     }
   },
-  "qwen3-vl-30b": {
+  "qwen3-vl-30b-4bit": {
     "hf_path": "mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -633,7 +613,7 @@
     },
     "is_moe": true
   },
-  "devstral-v2-24b": {
+  "devstral-v2-24b-4bit": {
     "hf_path": "mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -649,7 +629,7 @@
       "temperature": 0.15
     }
   },
-  "qwen3-coder-30b": {
+  "qwen3-coder-30b-4bit": {
     "hf_path": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -664,7 +644,7 @@
     },
     "is_moe": true
   },
-  "gemma-3n-e4b": {
+  "gemma-3n-e4b-4bit": {
     "hf_path": "mlx-community/gemma-3n-E4B-it-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -676,7 +656,7 @@
       "top_k": 64
     }
   },
-  "gemma3-1b": {
+  "gemma3-1b-4bit": {
     "hf_path": "mlx-community/gemma-3-1b-it-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -694,7 +674,7 @@
       "top_k": 64
     }
   },
-  "gemma3-27b": {
+  "gemma3-27b-4bit": {
     "hf_path": "mlx-community/gemma-3-27b-it-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
@@ -706,15 +686,7 @@
       "top_k": 64
     }
   },
-  "nemotron-30b": {
-    "hf_path": "lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-4bit",
-    "tool_call_parser": "hermes",
-    "reasoning_parser": "qwen3",
-    "is_hybrid": true,
-    "supports_spec_decode": false,
-    "is_moe": true
-  },
-  "nemotron-nano": {
+  "nemotron-30b-4bit": {
     "hf_path": "lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -722,7 +694,7 @@
     "supports_spec_decode": false,
     "is_moe": true
   },
-  "bonsai-1.7b": {
+  "bonsai-1.7b-unpacked": {
     "hf_path": "prism-ml/Bonsai-1.7B-unpacked",
     "tool_call_parser": "hermes",
     "reasoning_parser": "glm4",
@@ -736,7 +708,7 @@
       "code_edit": 1.051
     }
   },
-  "bonsai-4b": {
+  "bonsai-4b-unpacked": {
     "hf_path": "prism-ml/Bonsai-4B-unpacked",
     "tool_call_parser": "hermes",
     "reasoning_parser": "glm4",
@@ -749,7 +721,7 @@
       "code_edit": 1.012
     }
   },
-  "bonsai-8b": {
+  "bonsai-8b-unpacked": {
     "hf_path": "prism-ml/Bonsai-8B-unpacked",
     "tool_call_parser": "hermes",
     "reasoning_parser": "glm4",
@@ -762,7 +734,7 @@
       "code_edit": 1.181
     }
   },
-  "smollm3-3b": {
+  "smollm3-3b-4bit": {
     "hf_path": "mlx-community/SmolLM3-3B-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": "qwen3",
@@ -774,11 +746,20 @@
       "tool_loop": 0.776
     }
   },
-  "granite4-tiny": {
+  "granite4-tiny-4bit": {
     "hf_path": "mlx-community/granite-4.0-h-tiny-4bit",
     "tool_call_parser": "hermes",
     "reasoning_parser": null,
     "is_hybrid": true,
     "supports_spec_decode": false
+  },
+  "phi-4-mini-4bit": {
+    "hf_path": "mlx-community/phi-4-mini-instruct-4bit",
+    "tool_call_parser": "hermes",
+    "reasoning_parser": null,
+    "is_hybrid": false,
+    "is_moe": false,
+    "supports_spec_decode": true,
+    "suffix_decoding_tier": "unknown"
   }
 }
diff --git a/vllm_mlx/api/utils.py b/vllm_mlx/api/utils.py
index 29cd4fb5..1ec74a4f 100644
--- a/vllm_mlx/api/utils.py
+++ b/vllm_mlx/api/utils.py
@@ -133,7 +133,7 @@ def _clean_gpt_oss_output(text: str) -> str:
     """
     # Tool-call structure must survive to the harmony tool parser:
     # if the model emitted ``<|channel|>commentary to=functions.X...<|call|>``
-    # (which gpt-oss-20b does for every tool invocation), the parser needs
+    # (which gpt-oss-20b-mxfp4-q8 does for every tool invocation), the parser needs
     # those structural tokens intact to extract the call. Stripping them
     # here drops the args into plain text and the parser returns 0 calls.
     # Same regression class as PR #436 but for the tool parser. Final
diff --git a/vllm_mlx/cli.py b/vllm_mlx/cli.py
index fd0ed6c9..ad7c102a 100644
--- a/vllm_mlx/cli.py
+++ b/vllm_mlx/cli.py
@@ -10,9 +10,9 @@
     rapid-mlx chat <model>                 Interactive chat REPL
 
 Usage:
-    rapid-mlx serve qwen3.5-4b --port 8000
-    rapid-mlx bench qwen3.5-4b --num-prompts 10
-    rapid-mlx chat qwen3.5-4b
+    rapid-mlx serve qwen3.5-4b-4bit --port 8000
+    rapid-mlx bench qwen3.5-4b-4bit --num-prompts 10
+    rapid-mlx chat qwen3.5-4b-4bit
 """
 
 import argparse
@@ -1374,7 +1374,7 @@ def _print_cached_models() -> None:
         return
 
     # Reverse-map HF repo path → alias name so the alias column matches the
-    # user's mental model (``qwen3.5-4b`` not ``mlx-community/Qwen3.5-4B...``).
+    # user's mental model (``qwen3.5-4b-4bit`` not ``mlx-community/Qwen3.5-4B...``).
     profiles = list_profiles()
     hf_to_alias: dict[str, str] = {}
     for alias, p in profiles.items():
@@ -1450,11 +1450,11 @@ def models_command(args):
     print(f"  Available models ({len(profiles)} aliases)")
 
     # Widths sized to fit the longest values currently in aliases.json:
-    # alias 22 (qwen3.5-122b-mxfp4 etc.), tool 16 (qwen3_coder_xml + 1 pad),
-    # reasoning 12 (deepseek_r1 + 1 pad), spec 10 ("✗ hybrid"), tier 11,
-    # dflash 7 ("✓ ready"/"—").
+    # alias 24 (deepseek-v4-flash-8bit is 22 chars; +2 pad after explicit
+    # quant rename), tool 16 (qwen3_coder_xml + 1 pad), reasoning 12
+    # (deepseek_r1 + 1 pad), spec 10 ("✗ hybrid"), tier 11, dflash 7.
     cols = (
-        ("Alias", 22),
+        ("Alias", 24),
         ("Tools", 16),
         ("Reasoning", 12),
         ("Spec-Decode", 10),
@@ -1487,7 +1487,7 @@ def models_command(args):
         # registry column is pure declarative state.
         dflash = "✓" if p.supports_dflash else "—"
         row = (
-            f"  {alias:<22} {tools:<16} {reasoning:<12} "
+            f"  {alias:<24} {tools:<16} {reasoning:<12} "
             f"{spec:<10} {tier:<11} {dflash:<7}"
         )
         print(row)
@@ -1607,7 +1607,7 @@ def ps_command(_args):
         try:
             i = cmd.index("serve") + 1
             # Pre-PR this loop ``break``ed on the first positional, so a
-            # ``rapid-mlx serve qwen3.5-4b --port 8005`` ended with
+            # ``rapid-mlx serve qwen3.5-4b-4bit --port 8005`` ended with
             # port="8000" because the positional model token came before
             # ``--port``. Keep scanning for flags after we've captured the
             # model — argparse accepts them on either side.
@@ -1673,7 +1673,7 @@ def _spawn_chat_server(
 
     If ``served_name`` is given, it is passed via ``--served-model-name`` so
     the spawned server exposes the alias as the API model name (e.g. user
-    typed ``qwen3.5-4b`` → API requests use ``qwen3.5-4b`` rather than the
+    typed ``qwen3.5-4b-4bit`` → API requests use ``qwen3.5-4b-4bit`` rather than the
     expanded HF path).
     """
     import socket
@@ -1809,7 +1809,7 @@ def _has_short_pattern_dominating_suffix(
 
     - ``"BarleyBarleyBarley..."`` (no whitespace separator) — the entire
       suffix collapses to a single ``str.split()`` token whose count
-      never increments. Real qwen3.5-4b regression surfaced in the
+      never increments. Real qwen3.5-4b-4bit regression surfaced in the
       0.6.28 onboarding test.
     - Long-cycle phrase loops, e.g. a ~280-char clause that repeats
       verbatim until ``max_tokens``. Surfaced when asked "describe the
@@ -2051,7 +2051,7 @@ def _close_open_md_spans() -> None:
     #    pattern. Catches the form ``"BarleyBarleyBarley..."`` (no
     #    whitespace separator), where ``piece.split()`` produces one
     #    giant token whose count never increments — this was a real
-    #    qwen3.5-4b regression in 0.6.28 (issue surfaced post-release).
+    #    qwen3.5-4b-4bit regression in 0.6.28 (issue surfaced post-release).
     REPEAT_LIMIT = 25
     repeat_last: str | None = None
     repeat_run = 0
@@ -2442,7 +2442,7 @@ def _sigterm_handler(*_):
     # we can distinguish "user did not pass it" from "user passed 2048
     # explicitly". When ``--think`` is set and the user did not supply a
     # value, raise the default from 2048 to 4096 so the reasoning trace +
-    # final answer both fit (the round-1 finding: ``chat qwen3.5-4b
+    # final answer both fit (the round-1 finding: ``chat qwen3.5-4b-4bit
     # --think`` filled the 2048 budget with reasoning and emitted an
     # empty answer with ``finish_reason='length'``).
     user_passed_max_tokens = args.max_tokens is not None
@@ -2487,7 +2487,7 @@ def _sigterm_handler(*_):
     #
     # Default thinking OFF in the REPL. Reasoning models (Qwen3.5/3.6, etc.)
     # otherwise emit raw chain-of-thought to stdout AND, on the default
-    # qwen3.5-4b model, degenerate into infinite repetition until max-tokens
+    # qwen3.5-4b-4bit model, degenerate into infinite repetition until max-tokens
     # truncates the response — producing zero usable output for a brand-new
     # user. ``--think`` opts back in for users who explicitly want to see
     # reasoning traces; ``--no-think`` is preserved as the legacy form.
@@ -3264,12 +3264,12 @@ def main():
         formatter_class=argparse.RawDescriptionHelpFormatter,
         epilog="""\
 Examples:
-  rapid-mlx chat                                      # interactive REPL (defaults to qwen3.5-4b)
-  rapid-mlx chat qwen3.5-9b --think                   # larger model, surface reasoning
-  rapid-mlx serve qwen3.5-9b --port 8000              # OpenAI-compatible server
+  rapid-mlx chat                                      # interactive REPL (defaults to qwen3.5-4b-4bit)
+  rapid-mlx chat qwen3.5-9b-4bit --think                   # larger model, surface reasoning
+  rapid-mlx serve qwen3.5-9b-4bit --port 8000              # OpenAI-compatible server
   rapid-mlx serve mlx-community/Qwen3.5-9B-4bit       # full HF repo also works
   rapid-mlx models                                    # list all aliases
-  rapid-mlx info qwen3.5-9b                           # show per-alias profile
+  rapid-mlx info qwen3.5-9b-4bit                           # show per-alias profile
 """,
     )
     parser.add_argument(
@@ -4092,13 +4092,13 @@ def main():
         "pull", help="Download a model to the HuggingFace cache (no server)"
     )
     pull_parser.add_argument(
-        "model", help="Model alias (e.g. qwen3.5-4b) or HF repo (org/name)"
+        "model", help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (org/name)"
     ).completer = alias_completer
     rm_parser = subparsers.add_parser(
         "rm", help="Remove a cached model from the HuggingFace cache"
     )
     rm_parser.add_argument(
-        "model", help="Model alias (e.g. qwen3.5-4b) or HF repo (org/name)"
+        "model", help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (org/name)"
     ).completer = alias_completer
     subparsers.add_parser("ps", help="List running rapid-mlx servers")
 
@@ -4135,9 +4135,9 @@ def main():
     chat_parser.add_argument(
         "model",
         nargs="?",
-        default="qwen3.5-4b",
-        help="Model alias (e.g. qwen3.5-4b) or HF repo (org/name). "
-        "Defaults to qwen3.5-4b when omitted.",
+        default="qwen3.5-4b-4bit",
+        help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (org/name). "
+        "Defaults to qwen3.5-4b-4bit when omitted.",
     ).completer = alias_completer
     chat_parser.add_argument(
         "--system",
@@ -4213,7 +4213,7 @@ def main():
     )
     info_parser.add_argument(
         "model",
-        help="Model alias (e.g. qwen3.5-4b) or HF repo (e.g. mlx-community/SmolLM3-3B-4bit)",
+        help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (e.g. mlx-community/SmolLM3-3B-4bit)",
     ).completer = alias_completer
 
     # Agents command
@@ -4271,14 +4271,14 @@ def main():
         "--model",
         type=str,
         default=None,
-        help="Model alias for check tier (default: qwen3.5-35b)",
+        help="Model alias for check tier (default: qwen3.5-35b-8bit)",
     ).completer = alias_completer
     doctor_parser.add_argument(
         "--models",
         type=str,
         default=None,
         help="Comma-separated model aliases for full / benchmark tiers "
-        "(full default: qwen3.5-35b,qwen3.6-35b; "
+        "(full default: qwen3.5-35b-8bit,qwen3.6-35b-4bit; "
         "benchmark default: auto-discovered from local cache)",
     ).completer = alias_csv_completer
     doctor_parser.add_argument(
diff --git a/vllm_mlx/doctor/__init__.py b/vllm_mlx/doctor/__init__.py
index 2e1ebafb..4227c476 100644
--- a/vllm_mlx/doctor/__init__.py
+++ b/vllm_mlx/doctor/__init__.py
@@ -4,8 +4,8 @@
 
 Three tiers + a benchmark sweep:
   - smoke      (~2 min, no model)         — pytest, ruff, CLI sanity
-  - check      (~15 min, qwen3.5-35b)     — server + perf + agents + baseline diff
-  - full       (~2-3 hr, 2 models)        — check across qwen3.5-35b + qwen3.6-35b
+  - check      (~15 min, qwen3.5-35b-8bit)     — server + perf + agents + baseline diff
+  - full       (~2-3 hr, 2 models)        — check across qwen3.5-35b-8bit + qwen3.6-35b-4bit
   - benchmark  (overnight, all models)    — cross-model × cross-engine scorecard
 
 Entry point: ``rapid-mlx doctor {smoke,check,full,benchmark}``
diff --git a/vllm_mlx/doctor/baseline.py b/vllm_mlx/doctor/baseline.py
index c0812598..f84d5f43 100644
--- a/vllm_mlx/doctor/baseline.py
+++ b/vllm_mlx/doctor/baseline.py
@@ -10,7 +10,7 @@
     {
       "captured_at": "2026-04-15T21:00:00",
       "rapid_mlx_version": "0.5.1",
-      "model": "qwen3.5-35b",
+      "model": "qwen3.5-35b-8bit",
       "metrics": {
         "decode_tps": 156.2,
         "ttft_cold_ms": 412,
diff --git a/vllm_mlx/doctor/cli.py b/vllm_mlx/doctor/cli.py
index 901f1a91..ace4abae 100644
--- a/vllm_mlx/doctor/cli.py
+++ b/vllm_mlx/doctor/cli.py
@@ -20,7 +20,7 @@
 # Default model used by the check tier.  Tier 3 (full) loops a wider list.
 # A real-capacity 8-bit model is required so eval failures can be cleanly
 # attributed to rapid-mlx bugs rather than small-model quant noise.
-DEFAULT_CHECK_MODEL = "qwen3.5-35b"
+DEFAULT_CHECK_MODEL = "qwen3.5-35b-8bit"
 
 # Model list for the full tier: real-capacity Qwen lines only. No 4B
 # (small models can't separate model errors from engine errors) and no
@@ -29,7 +29,7 @@
 # refusal, multi-turn context drift; failures don't cleanly attribute
 # to rapid-mlx so it's noise here). Add Gemma back when a tighter
 # instruct variant ships.
-DEFAULT_FULL_MODELS = ["qwen3.5-35b", "qwen3.6-35b"]
+DEFAULT_FULL_MODELS = ["qwen3.5-35b-8bit", "qwen3.6-35b-4bit"]
 
 # Agent profiles to exercise per-model in the full tier.  None ⇒ all
 # loaded profiles.  Limit here if a particular profile is too slow to
@@ -180,7 +180,7 @@ def run_full_tier(models: list[str], update_baselines: bool = False):
             update_baselines=update_baselines,
             agent_profiles=profile_names,
             # boot_timeout_s=None → _suggested_boot_timeout picks 600s for
-            # the 27B+ models (qwen3.5-35b, gemma-4-26b) and 180s for
+            # the 27B+ models (qwen3.5-35b-8bit, gemma-4-26b-4bit) and 180s for
             # smaller ones, so the same logic applies regardless of which
             # tier called us.
         )
@@ -226,7 +226,7 @@ def _resolve_agent_profiles(explicit: list[str] | None) -> list[str]:
 # Earlier iterations tried to pick a tier-/alias-aware shorter budget for
 # the small-model case, but every heuristic missed at least one supported
 # large model (Qwen3-Coder lacks a 'NNb' hint in its alias, MiniMax M2.5
-# is huge but named 'minimax-m2.5', etc.).  The optimisation isn't worth
+# is huge but named 'minimax-m2.5-4bit', etc.).  The optimisation isn't worth
 # the false-fail risk.
 DEFAULT_BOOT_TIMEOUT_S = 600
 
diff --git a/vllm_mlx/engine/batched.py b/vllm_mlx/engine/batched.py
index 40b9253a..aa71c8a3 100644
--- a/vllm_mlx/engine/batched.py
+++ b/vllm_mlx/engine/batched.py
@@ -1199,7 +1199,7 @@ async def chat(
         #
         # Hybrid-only gate: the boundary split routes through
         # ``BatchGenerator.insert_segments`` which on pure-Transformer models
-        # (e.g. gpt-oss-20b harmony) corrupts the harmony tool-call channel
+        # (e.g. gpt-oss-20b-mxfp4-q8 harmony) corrupts the harmony tool-call channel
         # state across multi-turn-with-tools and the agent loops forever.
         # Pure Transformers don't need the boundary save anyway — the prefix
         # cache already reuses via trim+supersequence. Only hybrid models
@@ -1234,7 +1234,7 @@ def _is_hybrid_model(self) -> bool:
 
         Pure Transformer models don't have this constraint — trim works
         — so they don't need the boundary save. Worse, the boundary
-        split routes through ``insert_segments`` which on gpt-oss-20b
+        split routes through ``insert_segments`` which on gpt-oss-20b-mxfp4-q8
         empirically corrupts harmony tool-call channel state across
         multi-turn-with-tools (pydantic_ai multi_tool 5/6 → loops on
         ``add(3,4)``). Gating the entire boundary path on this flag is
@@ -1518,7 +1518,7 @@ async def _stream_with_output_router(
             # returns a response missing the ``logprobs`` field entirely
             # because ``_extract_streaming_token_logprobs`` sees
             # ``chunk.logprobs is None`` for every routed chunk. Confirmed
-            # on gpt-oss-20b PyPI v0.6.66 during the 2026-05-23 onboarding
+            # on gpt-oss-20b-mxfp4-q8 PyPI v0.6.66 during the 2026-05-23 onboarding
             # sweep. PR #450 fixed the pre-existing AttributeError on the
             # non-routed path but couldn't surface this gap because its
             # tests use single-token GenerationOutput stubs that never go
@@ -1546,7 +1546,7 @@ async def _stream_with_output_router(
                     # TOOL_CALL it would override the accumulated body with
                     # just the end-marker token's text, dropping the body on
                     # the floor and breaking streaming tool calls for gemma4
-                    # and harmony — caught on gemma-4-26b post-v0.6.61.
+                    # and harmony — caught on gemma-4-26b-4bit post-v0.6.61.
                     if event.channel == Channel.TOOL_CALL:
                         event_text = event.text
                         # Tool-call channel aggregates many tokens; the
diff --git a/vllm_mlx/model_aliases.py b/vllm_mlx/model_aliases.py
index 6360e9c3..2cefef61 100644
--- a/vllm_mlx/model_aliases.py
+++ b/vllm_mlx/model_aliases.py
@@ -38,8 +38,8 @@
 # Reverse index: hf_path → first alias that references it. Built once
 # alongside ``_aliases`` so reverse lookups in ``resolve_profile`` are
 # O(1) instead of scanning all 50+ profiles on every cache-miss.
-# When two aliases share the same hf_path (e.g. ``nemotron-30b`` and
-# ``nemotron-nano`` both pointing at the same MLX repo), the first one
+# When two aliases share the same hf_path (e.g. ``nemotron-30b-4bit`` and
+# ``nemotron-30b-4bit`` both pointing at the same MLX repo), the first one
 # in JSON order wins. The contract is "any profile valid for this
 # path" rather than "the canonical alias", so this is fine.
 _hf_to_alias: dict[str, str] | None = None
@@ -306,7 +306,7 @@ def resolve_profile(name: str) -> AliasProfile | None:
     """Return the profile for an alias name or full HF path.
 
     Two lookups in order:
-    1. Direct alias name match (``qwen3.5-4b``).
+    1. Direct alias name match (``qwen3.5-4b-4bit``).
     2. Reverse HF-path match (``mlx-community/Qwen3.5-4B-MLX-4bit``)
        via the pre-built ``_hf_to_alias`` index — O(1).
 
@@ -332,7 +332,7 @@ def _family_prefix(name: str) -> str:
     ``hermes`` → ``hermes`` (single token, no change)
 
     Used to keep typo suggestions inside the same family — ``deepseek-v4-27b``
-    suggests ``deepseek-v4-flash``, not ``deepseek-r1-32b``.
+    suggests ``deepseek-v4-flash-8bit``, not ``deepseek-r1-32b-4bit``.
     """
     parts = name.split("-")
     while parts:
@@ -355,7 +355,7 @@ def _letters_only_prefix(name: str) -> str:
     returns nothing useful — handles cases where the user collapses or
     inserts separators we don't use (``gemma4-27b`` → ``gemma``, matches
     our ``gemma-4-*`` and ``gemma3-*`` aliases; ``mistral24b`` →
-    ``mistral``, matches ``mistral-24b``).
+    ``mistral``, matches ``mistral-24b-4bit``).
     """
     out = []
     for ch in name.lower():
@@ -372,7 +372,7 @@ def suggest_similar(name: str, n: int = 3, cutoff: float = 0.5) -> list[str]:
     Family-aware in two passes:
     1. **Strict family match** — uses ``_family_prefix`` (drops trailing
        size/quant tokens). Keeps the wrong-family bait-and-switch (typing
-       ``deepseek-v4-27b`` and being told ``deepseek-r1-32b``) from
+       ``deepseek-v4-27b`` and being told ``deepseek-r1-32b-4bit``) from
        happening, and prevents legitimate single-segment HuggingFace IDs
        like ``gpt2`` or ``bert-base-uncased`` from spuriously matching.
     2. **Letter-only prefix fallback** — if step 1 finds nothing, retry
@@ -397,7 +397,7 @@ def suggest_similar(name: str, n: int = 3, cutoff: float = 0.5) -> list[str]:
         if same_fam and same_fam != [fam]:
             # If we found candidates in the same strict family, trust the
             # cutoff — even if it filters everything out. The cutoff
-            # rejecting ``gpt2`` against ``gpt-oss-20b`` is the
+            # rejecting ``gpt2`` against ``gpt-oss-20b-mxfp4-q8`` is the
             # legitimate-HF-ID guarantee at work; the letter-only
             # fallback below would override that and is wrong here.
             return difflib.get_close_matches(name, same_fam, n=n, cutoff=cutoff)
@@ -439,12 +439,12 @@ def suggest_similar(name: str, n: int = 3, cutoff: float = 0.5) -> list[str]:
 # the small/fast tier and one well-known representative per category —
 # auto-generation would spit out alphabetic noise like ``bonsai-*`` first.
 POPULAR_ALIASES: tuple[str, ...] = (
-    "qwen3.5-4b",  # default smoke / small
-    "qwen3.5-9b",  # mid-size general
-    "qwen3.6-27b",  # latest hybrid family
-    "qwen3-coder-30b",  # coding
-    "gemma4",  # gemma family rep (12B QAT 4-bit)
-    "llama3-3b",  # tiny llama
-    "mistral-24b",  # mistral
-    "deepseek-r1-32b",  # reasoning
+    "qwen3.5-4b-4bit",  # default smoke / small
+    "qwen3.5-9b-4bit",  # mid-size general
+    "qwen3.6-27b-4bit",  # latest hybrid family
+    "qwen3-coder-30b-4bit",  # coding
+    "gemma-4-12b-qat-4bit",  # gemma family rep (12B QAT 4-bit)
+    "llama3-3b-4bit",  # tiny llama
+    "mistral-24b-4bit",  # mistral
+    "deepseek-r1-32b-4bit",  # reasoning
 )
diff --git a/vllm_mlx/model_auto_config.py b/vllm_mlx/model_auto_config.py
index 1ecdf057..267460c0 100644
--- a/vllm_mlx/model_auto_config.py
+++ b/vllm_mlx/model_auto_config.py
@@ -317,7 +317,7 @@ def detect_model_config(model_path: str) -> ModelConfig | None:
 
     Two-stage lookup:
     1. **Alias profile** (single source of truth) — if ``model_path`` is a
-       known alias name (``qwen3.5-4b``) or maps to one's HF path
+       known alias name (``qwen3.5-4b-4bit``) or maps to one's HF path
        (``mlx-community/Qwen3.5-4B-MLX-4bit``), return that profile's
        config directly. This guarantees per-alias granularity for any
        optimization that varies by size/quant within a family.
diff --git a/vllm_mlx/output_router_harmony.py b/vllm_mlx/output_router_harmony.py
index 2ab98413..ee19ed8b 100644
--- a/vllm_mlx/output_router_harmony.py
+++ b/vllm_mlx/output_router_harmony.py
@@ -73,7 +73,7 @@
 
 # HuggingFace hub cache snapshot dir pattern. Path components have the
 # form ``models--<owner>--<name>`` so ``/.../models--openai--gpt-oss-20b
-# /snapshots/<sha>/`` resolves to the identity ``openai/gpt-oss-20b``
+# /snapshots/<sha>/`` resolves to the identity ``openai/gpt-oss-20b-mxfp4-q8``
 # (the basename is the snapshot SHA, which on its own gives no hint
 # that this is a gpt-oss tokenizer). Codex round-14 BLOCKING — the
 # previous basename-only check rejected this path shape and the gate
@@ -99,7 +99,7 @@
 #   * round-12: anchored basename still let arbitrary owners through
 #     (``some-user/gpt-oss-remapped``) → restrict to known owners.
 #   * round-13: pure remote-id-prefix matching rejected legitimate
-#     LOCAL paths (``/models/gpt-oss-20b``, ``~/.cache/.../gpt-oss-20b``)
+#     LOCAL paths (``/models/gpt-oss-20b-mxfp4-q8``, ``~/.cache/.../gpt-oss-20b-mxfp4-q8``)
 #     and made production fall back to the leaking legacy router.
 #   * round-14: HF cache snapshot dir ``models--openai--gpt-oss-20b
 #     /snapshots/<sha>`` has SHA basename → recognise the ``models--
@@ -177,7 +177,7 @@ def _is_known_harmony_identity(name_or_path: str) -> bool:
 # and corrupts content / tool-call arguments (codex round-1 BLOCKING).
 # Pick short strings that exercise common body-vocab regions: plain
 # English, JSON-shaped text, and the smoking-gun multi-token word
-# ``commentary`` from PR #514 (``comment``+``ary`` on gpt-oss-20b).
+# ``commentary`` from PR #514 (``comment``+``ary`` on gpt-oss-20b-mxfp4-q8).
 _BODY_VOCAB_PROBES = (
     "Hello world",
     'functions.get_weather {"a":1}',
diff --git a/vllm_mlx/reasoning/harmony_parser.py b/vllm_mlx/reasoning/harmony_parser.py
index 5f352c58..6fb69c5e 100644
--- a/vllm_mlx/reasoning/harmony_parser.py
+++ b/vllm_mlx/reasoning/harmony_parser.py
@@ -26,7 +26,7 @@
 )
 
 # Final channel content. Harmony spec uses ``<|return|>`` to terminate the
-# final channel, but gpt-oss-20b emits ``<|end|>`` in practice for a sizeable
+# final channel, but gpt-oss-20b-mxfp4-q8 emits ``<|end|>`` in practice for a sizeable
 # fraction of non-streaming responses (observed in v0.6.64 pr_validate runs:
 # anthropic_sdk 0/5, langchain 2/6, pydantic_ai 1/6 on
 # ``mlx-community/gpt-oss-20b-MXFP4-Q8`` — every non-streaming test landed
diff --git a/vllm_mlx/routes/anthropic.py b/vllm_mlx/routes/anthropic.py
index 0a565ab1..17dfc09f 100644
--- a/vllm_mlx/routes/anthropic.py
+++ b/vllm_mlx/routes/anthropic.py
@@ -530,7 +530,7 @@ async def _stream_anthropic_messages(
             # stripped at the token layer, so its state machine
             # never leaves the "Unknown channel, suppress" arm and
             # this loop emits no ``content_block_delta`` events. The
-            # symptom (v0.6.64 pr_validate on gpt-oss-20b: anthropic
+            # symptom (v0.6.64 pr_validate on gpt-oss-20b-mxfp4-q8: anthropic
             # stream test 4 returned 0 content chunks) is the
             # streaming counterpart of the non-streaming empty-
             # TextBlock bug fixed in
diff --git a/vllm_mlx/runtime/model_registry.py b/vllm_mlx/runtime/model_registry.py
index f91e37b0..28932e9a 100644
--- a/vllm_mlx/runtime/model_registry.py
+++ b/vllm_mlx/runtime/model_registry.py
@@ -7,11 +7,11 @@
 
 Usage:
     registry = ModelRegistry()
-    registry.add("qwen3.5-4b", engine, is_default=True)
-    registry.add("qwen3.5-27b", engine2)
+    registry.add("qwen3.5-4b-4bit", engine, is_default=True)
+    registry.add("qwen3.5-27b-4bit", engine2)
 
     # Request routing
-    engine = registry.get_engine("qwen3.5-27b")  # specific
+    engine = registry.get_engine("qwen3.5-27b-4bit")  # specific
     engine = registry.get_engine("default")       # default
     engine = registry.get_engine(None)             # default
 """
diff --git a/vllm_mlx/service/postprocessor.py b/vllm_mlx/service/postprocessor.py
index 0809cae9..02ffa2f0 100644
--- a/vllm_mlx/service/postprocessor.py
+++ b/vllm_mlx/service/postprocessor.py
@@ -702,7 +702,7 @@ def _process_channel_routed(
         # to the client but leave both accumulators empty — _build_usage
         # then sees ``reasoning_text=None`` and omits the field entirely,
         # creating stream/non-stream usage shape drift. Verified on
-        # gemma-4-26b + gpt-oss-20b during the v0.6.66 onboarding sweep.
+        # gemma-4-26b-4bit + gpt-oss-20b-mxfp4-q8 during the v0.6.66 onboarding sweep.
         if content:
             self.accumulated_text += content
         if reasoning:
@@ -982,7 +982,7 @@ def finalize(self) -> list[StreamEvent]:
         # Previously gated on ``has_pending_tool_call`` — but that gate
         # uses the SAME canonical-wrapper check as the streaming parser, so
         # by construction it can never catch what the streaming parser
-        # missed. The 2026-05-20 ≥20B onboarding sweep caught gemma-4-26b
+        # missed. The 2026-05-20 ≥20B onboarding sweep caught gemma-4-26b-4bit
         # producing structured tool_calls in non-stream mode that the
         # streaming parser dropped on the floor; the only difference between
         # the two modes was this gate. See knowledge/guided_generation_gaps_2026-05-20.md
diff --git a/vllm_mlx/share/cli.py b/vllm_mlx/share/cli.py
index 4f7431d8..06746c96 100644
--- a/vllm_mlx/share/cli.py
+++ b/vllm_mlx/share/cli.py
@@ -237,7 +237,7 @@ def _pick_port(preferred: int) -> int:
 def _resolve_served_model_name(port: int, api_key: str) -> str | None:
     """Read the model id rapid-mlx serve is exposing via /v1/models.
 
-    The CLI accepts a short alias (``qwen3.5-4b``) but the OpenAI
+    The CLI accepts a short alias (``qwen3.5-4b-4bit``) but the OpenAI
     endpoint only recognises the full HF model id
     (``mlx-community/Qwen3.5-4B-MLX-4bit``). Without this lookup the
     curl example we paste into the security banner fails on first
@@ -459,10 +459,10 @@ def share_command(args: argparse.Namespace) -> None:
     # alias resolution BEFORE dispatching to us — by the time we get
     # here ``args.model`` is the rewritten HF repo (e.g.
     # ``mlx-community/Qwen3.5-4B-MLX-4bit``) and the user-typed alias
-    # lives on ``args._original_alias`` (e.g. ``qwen3.5-4b``). The child
+    # lives on ``args._original_alias`` (e.g. ``qwen3.5-4b-4bit``). The child
     # ``serve`` subprocess re-runs alias resolution on whatever we pass
     # it. We want the child to land the same way ``rapid-mlx serve
-    # qwen3.5-4b`` does — including setting ``_model_alias`` on the
+    # qwen3.5-4b-4bit`` does — including setting ``_model_alias`` on the
     # server so the public ``/v1/models`` endpoint advertises (and
     # accepts) the short alias the user actually typed. So we forward
     # the original alias to the child when one is set; fall back to
@@ -767,7 +767,7 @@ def register(subparsers: argparse._SubParsersAction) -> None:
     )
     p.add_argument(
         "model",
-        help="Alias to serve (same names as `rapid-mlx serve`, e.g. qwen3.5-4b)",
+        help="Alias to serve (same names as `rapid-mlx serve`, e.g. qwen3.5-4b-4bit)",
     ).completer = alias_completer
     p.add_argument(
         "--port",
diff --git a/vllm_mlx/telemetry/redact.py b/vllm_mlx/telemetry/redact.py
index e91bba9b..38dadc72 100644
--- a/vllm_mlx/telemetry/redact.py
+++ b/vllm_mlx/telemetry/redact.py
@@ -103,7 +103,7 @@ def bucket_memory_gb(bytes_: int) -> int:
 def normalize_model_path(path: str) -> str:
     """Pass through ``org/name`` repo IDs; redact local paths to ``"<local>"``.
 
-    A local ``./qwen3.5-9b`` checkout that resolve_model() prefers over
+    A local ``./qwen3.5-9b-4bit`` checkout that resolve_model() prefers over
     the alias would otherwise leak the user's home-directory layout via
     the model name.
     """
@@ -128,7 +128,7 @@ def normalize_model_path(path: str) -> str:
         if _HF_REPO_RE.match(path):
             return path
         return "<local>"
-    # Bare alias names (``qwen3.5-9b``) are public + harmless.
+    # Bare alias names (``qwen3.5-9b-4bit``) are public + harmless.
     return path
 
 
diff --git a/vllm_mlx/tool_parsers/harmony_tool_parser.py b/vllm_mlx/tool_parsers/harmony_tool_parser.py
index d87f2fe4..bf4bc537 100644
--- a/vllm_mlx/tool_parsers/harmony_tool_parser.py
+++ b/vllm_mlx/tool_parsers/harmony_tool_parser.py
@@ -35,7 +35,7 @@ def _generate_tool_id() -> str:
 # Terminator: ``<|call|>`` is the in-output token, but the engine stops
 # generation when it emits it (``<|call|>`` is part of the harmony EOS
 # set), so the token is consumed and never appears in ``output_text``.
-# Empirically (gpt-oss-20b via /v1/chat/completions, 2026-05-22) the
+# Empirically (gpt-oss-20b-mxfp4-q8 via /v1/chat/completions, 2026-05-22) the
 # commentary block ends with the JSON args and no terminator. Accept
 # end-of-string OR the next channel marker as alternative terminators
 # so a complete-but-unterminated tool call still parses. Same regression

From 7a39e93a79b2f124656b899ac8278f43538055fe Mon Sep 17 00:00:00 2001
From: Raullen Chai <raullenchai@gmail.com>
Date: Tue, 9 Jun 2026 17:38:20 -0700
Subject: [PATCH 2/6] fix(aliases): codex round 1 + lint cleanup on PR #547
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex BLOCKING (round 1, pr_validate):
- ``tests/test_aliases_contract.py:455-457`` had duplicate
  ``"nemotron-30b-4bit"`` keys after the sweep collapsed both
  ``nemotron-30b`` and ``nemotron-nano`` to the same canonical name.
  Python silently overwrote the first key with the second, so the
  test no longer pinned both pre-rename cases. Collapsed to a single
  entry (the post-rename registry now contains only one nemotron
  alias).
- ``tests/test_model_profiles_ssot.py:90-91`` had the same
  duplicate-in-tuple-iteration pattern. Same fix.
- ``test_reverse_lookup_for_shared_hf_path_is_deterministic`` and
  ``test_reverse_lookup_handles_deepseek_v4_flash_duplicate`` were
  pinning the tie-break for two now-removed duplicate-hf_path pairs
  (``nemotron-30b`` / ``nemotron-nano`` and ``deepseek-v4-flash`` /
  ``deepseek-v4-flash-8bit``). After the rename, no pair of aliases
  shares an hf_path, so the tie-break is unreachable from the live
  registry. Removed both tests with a comment pointing at the
  remaining reverse-lookup mechanism test
  (``test_reverse_lookup_index_built_once_after_first_load``).

Lint (ruff check + ruff format):
- Auto-fixable F401 / F541 / I001 across the 4 PR-touched scripts
  (``bench_engine_parity.py``, ``bench_readme_refresh.py``,
  ``local_bench_vs_ollama.py``, ``mhi_eval.py``). These were
  pre-existing issues the sweep re-surfaced.
- Manual fixes inside ``scripts/mhi_eval.py``:
  * ``from tau_bench.types import EnvRunResult`` is an availability
    probe — annotated ``# noqa: F401``.
  * E741 single-letter ``l`` rebound to ``ch``.
- Ruff format applied to the 7 touched .py files.

Full-unit (e2e):
- ``test_weather_with_fallback`` + ``test_multi_step_tool_chain``
  failed once on the initial run, both passed on local rerun
  against the same live qwen3.5-4b-4bit server. These are
  model-behaviour tests (which tool name the model picks for a
  given prompt) and are flakey by design — not caused by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/bench_engine_parity.py    | 132 ++++++++++++++++++++-----
 scripts/bench_readme_refresh.py   |   5 +-
 scripts/local_bench_vs_ollama.py  |   4 +-
 scripts/mhi_eval.py               | 156 ++++++++++++++++++++----------
 tests/test_aliases_contract.py    |   5 +-
 tests/test_cli_argcomplete.py     |   6 +-
 tests/test_cli_chat.py            |   8 +-
 tests/test_model_profiles_ssot.py |  47 ++-------
 8 files changed, 240 insertions(+), 123 deletions(-)

diff --git a/scripts/bench_engine_parity.py b/scripts/bench_engine_parity.py
index f2d6404e..747e82b4 100644
--- a/scripts/bench_engine_parity.py
+++ b/scripts/bench_engine_parity.py
@@ -91,7 +91,11 @@ def bench_ttft_cold(base_url: str, model: str) -> dict:
     ttft = (first_token_time - t0) if first_token_time else elapsed
     tps = total_tokens / (elapsed - ttft) if elapsed > ttft and total_tokens > 0 else 0
 
-    return {"ttft_ms": round(ttft * 1000, 1), "decode_tps": round(tps, 1), "tokens": total_tokens}
+    return {
+        "ttft_ms": round(ttft * 1000, 1),
+        "decode_tps": round(tps, 1),
+        "tokens": total_tokens,
+    }
 
 
 def bench_ttft_cached(base_url: str, model: str) -> dict:
@@ -135,13 +139,19 @@ def bench_ttft_cached(base_url: str, model: str) -> dict:
 
         elapsed = time.perf_counter() - t0
         ttft = (first_token_time - t0) if first_token_time else elapsed
-        tps = total_tokens / (elapsed - ttft) if elapsed > ttft and total_tokens > 0 else 0
+        tps = (
+            total_tokens / (elapsed - ttft)
+            if elapsed > ttft and total_tokens > 0
+            else 0
+        )
         results.append({"ttft_ms": round(ttft * 1000, 1), "decode_tps": round(tps, 1)})
 
     # First is cold, rest are cached
     return {
         "cold_ttft_ms": results[0]["ttft_ms"],
-        "cached_ttft_ms": round(sum(r["ttft_ms"] for r in results[1:]) / len(results[1:]), 1),
+        "cached_ttft_ms": round(
+            sum(r["ttft_ms"] for r in results[1:]) / len(results[1:]), 1
+        ),
         "avg_tps": round(sum(r["decode_tps"] for r in results) / len(results), 1),
     }
 
@@ -223,7 +233,12 @@ def bench_decode_long(base_url: str, model: str) -> dict:
         f"{base_url}/chat/completions",
         json={
             "model": model,
-            "messages": [{"role": "user", "content": "Write a detailed essay about the history of computing. Be thorough."}],
+            "messages": [
+                {
+                    "role": "user",
+                    "content": "Write a detailed essay about the history of computing. Be thorough.",
+                }
+            ],
             "max_tokens": 256,
             "stream": True,
         },
@@ -244,7 +259,11 @@ def bench_decode_long(base_url: str, model: str) -> dict:
     decode_time = elapsed - ttft
     tps = total_tokens / decode_time if decode_time > 0 and total_tokens > 0 else 0
 
-    return {"ttft_ms": round(ttft * 1000, 1), "decode_tps": round(tps, 1), "tokens": total_tokens}
+    return {
+        "ttft_ms": round(ttft * 1000, 1),
+        "decode_tps": round(tps, 1),
+        "tokens": total_tokens,
+    }
 
 
 def run_suite(name: str, base_url: str, model: str) -> dict:
@@ -258,11 +277,15 @@ def run_suite(name: str, base_url: str, model: str) -> dict:
 
     print("\n  [1/5] Cold TTFT...")
     results["cold"] = bench_ttft_cold(base_url, model)
-    print(f"        TTFT: {results['cold']['ttft_ms']}ms, {results['cold']['decode_tps']} tok/s")
+    print(
+        f"        TTFT: {results['cold']['ttft_ms']}ms, {results['cold']['decode_tps']} tok/s"
+    )
 
     print("  [2/5] Cached TTFT (3 turns, same system prompt)...")
     results["cached"] = bench_ttft_cached(base_url, model)
-    print(f"        Cold: {results['cached']['cold_ttft_ms']}ms, Cached: {results['cached']['cached_ttft_ms']}ms, {results['cached']['avg_tps']} tok/s")
+    print(
+        f"        Cold: {results['cached']['cold_ttft_ms']}ms, Cached: {results['cached']['cached_ttft_ms']}ms, {results['cached']['avg_tps']} tok/s"
+    )
 
     print("  [3/5] Multi-turn (4 turns)...")
     results["multi_turn"] = bench_multi_turn(base_url, model)
@@ -270,11 +293,15 @@ def run_suite(name: str, base_url: str, model: str) -> dict:
 
     print("  [4/5] Tool call (3 calls)...")
     results["tool_call"] = bench_tool_call(base_url, model)
-    print(f"        Avg: {results['tool_call']['avg_latency_ms']}ms, {results['tool_call']['success_rate']:.0%} success")
+    print(
+        f"        Avg: {results['tool_call']['avg_latency_ms']}ms, {results['tool_call']['success_rate']:.0%} success"
+    )
 
     print("  [5/5] Long decode (256 tokens)...")
     results["long_decode"] = bench_decode_long(base_url, model)
-    print(f"        {results['long_decode']['decode_tps']} tok/s, {results['long_decode']['tokens']} tokens")
+    print(
+        f"        {results['long_decode']['decode_tps']} tok/s, {results['long_decode']['tokens']} tokens"
+    )
 
     return results
 
@@ -288,9 +315,11 @@ def main():
             print(f"  {name} ({url}): OK")
         except Exception as e:
             print(f"  {name} ({url}): NOT AVAILABLE — {e}")
-            print(f"\nPlease start both servers:")
-            print(f"  Terminal 1: rapid-mlx serve qwen3.5-4b-4bit --port 8000")
-            print(f"  Terminal 2: rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching")
+            print("\nPlease start both servers:")
+            print("  Terminal 1: rapid-mlx serve qwen3.5-4b-4bit --port 8000")
+            print(
+                "  Terminal 2: rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching"
+            )
             sys.exit(1)
 
     model_simple = detect_model(SIMPLE_URL)
@@ -302,38 +331,90 @@ def main():
 
     # Compare
     print(f"\n{'=' * 60}")
-    print(f"  COMPARISON: BatchedEngine vs SimpleEngine")
+    print("  COMPARISON: BatchedEngine vs SimpleEngine")
     print(f"{'=' * 60}")
 
     comparisons = [
-        ("Cold TTFT", simple_results["cold"]["ttft_ms"], batched_results["cold"]["ttft_ms"], "ms", True),
-        ("Cold decode", simple_results["cold"]["decode_tps"], batched_results["cold"]["decode_tps"], "tok/s", False),
-        ("Cached TTFT", simple_results["cached"]["cached_ttft_ms"], batched_results["cached"]["cached_ttft_ms"], "ms", True),
-        ("Cached decode", simple_results["cached"]["avg_tps"], batched_results["cached"]["avg_tps"], "tok/s", False),
-        ("Multi-turn avg", simple_results["multi_turn"]["avg_turn_ms"], batched_results["multi_turn"]["avg_turn_ms"], "ms", True),
-        ("Tool call avg", simple_results["tool_call"]["avg_latency_ms"], batched_results["tool_call"]["avg_latency_ms"], "ms", True),
-        ("Long decode", simple_results["long_decode"]["decode_tps"], batched_results["long_decode"]["decode_tps"], "tok/s", False),
+        (
+            "Cold TTFT",
+            simple_results["cold"]["ttft_ms"],
+            batched_results["cold"]["ttft_ms"],
+            "ms",
+            True,
+        ),
+        (
+            "Cold decode",
+            simple_results["cold"]["decode_tps"],
+            batched_results["cold"]["decode_tps"],
+            "tok/s",
+            False,
+        ),
+        (
+            "Cached TTFT",
+            simple_results["cached"]["cached_ttft_ms"],
+            batched_results["cached"]["cached_ttft_ms"],
+            "ms",
+            True,
+        ),
+        (
+            "Cached decode",
+            simple_results["cached"]["avg_tps"],
+            batched_results["cached"]["avg_tps"],
+            "tok/s",
+            False,
+        ),
+        (
+            "Multi-turn avg",
+            simple_results["multi_turn"]["avg_turn_ms"],
+            batched_results["multi_turn"]["avg_turn_ms"],
+            "ms",
+            True,
+        ),
+        (
+            "Tool call avg",
+            simple_results["tool_call"]["avg_latency_ms"],
+            batched_results["tool_call"]["avg_latency_ms"],
+            "ms",
+            True,
+        ),
+        (
+            "Long decode",
+            simple_results["long_decode"]["decode_tps"],
+            batched_results["long_decode"]["decode_tps"],
+            "tok/s",
+            False,
+        ),
     ]
 
-    print(f"\n  {'Metric':<20s} {'Simple':>10s} {'Batched':>10s} {'Diff':>10s} {'Verdict':>10s}")
+    print(
+        f"\n  {'Metric':<20s} {'Simple':>10s} {'Batched':>10s} {'Diff':>10s} {'Verdict':>10s}"
+    )
     print(f"  {'─' * 62}")
 
     all_pass = True
     for name, simple_val, batched_val, unit, lower_is_better in comparisons:
         if lower_is_better:
-            diff_pct = ((batched_val - simple_val) / simple_val * 100) if simple_val > 0 else 0
+            diff_pct = (
+                ((batched_val - simple_val) / simple_val * 100) if simple_val > 0 else 0
+            )
             verdict = "OK" if diff_pct < 5 else "WARN" if diff_pct < 10 else "FAIL"
         else:
-            diff_pct = ((simple_val - batched_val) / simple_val * 100) if simple_val > 0 else 0
+            diff_pct = (
+                ((simple_val - batched_val) / simple_val * 100) if simple_val > 0 else 0
+            )
             verdict = "OK" if diff_pct < 5 else "WARN" if diff_pct < 10 else "FAIL"
 
         if verdict != "OK":
             all_pass = False
 
         sign = "+" if diff_pct > 0 else ""
-        print(f"  {name:<20s} {simple_val:>8.1f}{unit:>3s} {batched_val:>8.1f}{unit:>3s} {sign}{diff_pct:>+7.1f}% {verdict:>8s}")
+        print(
+            f"  {name:<20s} {simple_val:>8.1f}{unit:>3s} {batched_val:>8.1f}{unit:>3s} {sign}{diff_pct:>+7.1f}% {verdict:>8s}"
+        )
 
-    print(f"\n  Overall: {'PASS — BatchedEngine within 5%' if all_pass else 'REVIEW NEEDED'}")
+    print(
+        f"\n  Overall: {'PASS — BatchedEngine within 5%' if all_pass else 'REVIEW NEEDED'}"
+    )
 
     # Save results
     output = {
@@ -347,6 +428,7 @@ def main():
 
     out_path = "reports/engine_parity_benchmark.json"
     import os
+
     os.makedirs(os.path.dirname(out_path), exist_ok=True)
     with open(out_path, "w") as f:
         json.dump(output, f, indent=2)
diff --git a/scripts/bench_readme_refresh.py b/scripts/bench_readme_refresh.py
index 50a2fb32..a2fc1999 100644
--- a/scripts/bench_readme_refresh.py
+++ b/scripts/bench_readme_refresh.py
@@ -108,7 +108,10 @@ class ModelSpec:
         "Ollama Gemma 3 12B (Gemma 4 not yet on llama.cpp)",
     ),
     ModelSpec(
-        "gpt-oss-20b-mxfp4-q8", "mlx-community/gpt-oss-20b-MXFP4-Q8", "gpt-oss:20b", "Same arch"
+        "gpt-oss-20b-mxfp4-q8",
+        "mlx-community/gpt-oss-20b-MXFP4-Q8",
+        "gpt-oss:20b",
+        "Same arch",
     ),
     ModelSpec(
         "qwen3.6-35b-4bit",
diff --git a/scripts/local_bench_vs_ollama.py b/scripts/local_bench_vs_ollama.py
index e8dd23cf..a9a4c268 100644
--- a/scripts/local_bench_vs_ollama.py
+++ b/scripts/local_bench_vs_ollama.py
@@ -683,7 +683,9 @@ def summary_row(metric: str, ratio: float, desc: str) -> None:
 # ── Main ──────────────────────────────────────────────────────────────────────
 def main() -> int:
     parser = argparse.ArgumentParser(description="Benchmark Rapid-MLX vs Ollama")
-    parser.add_argument("--model", default="qwen3.5-4b-4bit", help="Rapid-MLX model name")
+    parser.add_argument(
+        "--model", default="qwen3.5-4b-4bit", help="Rapid-MLX model name"
+    )
     parser.add_argument(
         "--ollama-model",
         default=None,
diff --git a/scripts/mhi_eval.py b/scripts/mhi_eval.py
index 77f7843f..d49ca9ee 100644
--- a/scripts/mhi_eval.py
+++ b/scripts/mhi_eval.py
@@ -25,18 +25,17 @@
 import os
 import subprocess
 import sys
-import tempfile
-import textwrap
 import time
 from datetime import datetime
-from pathlib import Path
 
 # ---------------------------------------------------------------------------
 # OpenAI client helper
 # ---------------------------------------------------------------------------
 
+
 def get_client(base_url: str, api_key: str = "not-needed"):
     from openai import OpenAI
+
     return OpenAI(base_url=base_url, api_key=api_key)
 
 
@@ -51,14 +50,17 @@ def detect_model(client) -> str:
 
 TAU_TASK_IDS = [24, 10, 5, 17, 33, 14, 15, 20, 30, 4]
 
+
 def run_tau_bench(base_url: str, model: str, api_key: str = "not-needed") -> dict:
     """Run 10 curated TAU-bench retail tasks."""
     try:
-        from tau_bench.envs import get_env
         from tau_bench.agents.tool_calling_agent import ToolCallingAgent
-        from tau_bench.types import EnvRunResult
+        from tau_bench.envs import get_env
+        from tau_bench.types import EnvRunResult  # noqa: F401 — availability probe
     except ImportError:
-        return {"error": "tau-bench not installed. pip install tau-bench @ git+https://github.com/sierra-research/tau-bench.git"}
+        return {
+            "error": "tau-bench not installed. pip install tau-bench @ git+https://github.com/sierra-research/tau-bench.git"
+        }
 
     os.environ["OPENAI_API_KEY"] = api_key
     os.environ["OPENAI_API_BASE"] = base_url
@@ -102,12 +104,14 @@ def run_tau_bench(base_url: str, model: str, api_key: str = "not-needed") -> dic
 
         status = "PASS" if reward == 1.0 else "FAIL"
         print(f"  [TAU] Task {idx:3d}: {status} ({elapsed:.1f}s)")
-        results.append({
-            "task_id": idx,
-            "reward": reward,
-            "elapsed_s": round(elapsed, 1),
-            "error": error,
-        })
+        results.append(
+            {
+                "task_id": idx,
+                "reward": reward,
+                "elapsed_s": round(elapsed, 1),
+                "error": error,
+            }
+        )
 
     passed = sum(1 for r in results if r["reward"] == 1.0)
     score = passed / len(results)
@@ -125,8 +129,16 @@ def run_tau_bench(base_url: str, model: str, api_key: str = "not-needed") -> dic
 # ---------------------------------------------------------------------------
 
 HUMANEVAL_IDS = [
-    "HumanEval/0", "HumanEval/1", "HumanEval/2", "HumanEval/3", "HumanEval/4",
-    "HumanEval/5", "HumanEval/6", "HumanEval/7", "HumanEval/8", "HumanEval/9",
+    "HumanEval/0",
+    "HumanEval/1",
+    "HumanEval/2",
+    "HumanEval/3",
+    "HumanEval/4",
+    "HumanEval/5",
+    "HumanEval/6",
+    "HumanEval/7",
+    "HumanEval/8",
+    "HumanEval/9",
 ]
 
 
@@ -160,7 +172,14 @@ def run_humaneval(base_url: str, model: str, api_key: str = "not-needed") -> dic
             )
             completion = resp.choices[0].text or ""
         except Exception as e:
-            results.append({"task_id": task_id, "passed": False, "elapsed_s": round(time.time() - t0, 1), "error": str(e)})
+            results.append(
+                {
+                    "task_id": task_id,
+                    "passed": False,
+                    "elapsed_s": round(time.time() - t0, 1),
+                    "error": str(e),
+                }
+            )
             print(f"  [HumanEval] {task_id}: FAIL (API error)")
             continue
 
@@ -180,12 +199,14 @@ def run_humaneval(base_url: str, model: str, api_key: str = "not-needed") -> dic
 
         status = "PASS" if passed else "FAIL"
         print(f"  [HumanEval] {task_id}: {status} ({elapsed:.1f}s)")
-        results.append({
-            "task_id": task_id,
-            "passed": passed,
-            "elapsed_s": round(elapsed, 1),
-            "error": None,
-        })
+        results.append(
+            {
+                "task_id": task_id,
+                "passed": passed,
+                "elapsed_s": round(elapsed, 1),
+                "error": None,
+            }
+        )
 
     passed_count = sum(1 for r in results if r.get("passed"))
     score = passed_count / len(results)
@@ -264,7 +285,9 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict:
         # Use the pre-formatted 5-shot prompt from tinyMMLU
         formatted = item.get("input_formatted", "")
         if not formatted:
-            choices_text = "\n".join(f"{chr(65+i)}. {c}" for i, c in enumerate(choices))
+            choices_text = "\n".join(
+                f"{chr(65 + i)}. {c}" for i, c in enumerate(choices)
+            )
             formatted = f"{question}\n{choices_text}\nAnswer:"
 
         t0 = time.time()
@@ -278,7 +301,14 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict:
             )
             answer = resp.choices[0].text or ""
         except Exception as e:
-            results.append({"idx": idx, "correct": False, "elapsed_s": round(time.time() - t0, 1), "error": str(e)})
+            results.append(
+                {
+                    "idx": idx,
+                    "correct": False,
+                    "elapsed_s": round(time.time() - t0, 1),
+                    "error": str(e),
+                }
+            )
             print(f"  [MMLU] Q{idx} ({subject}): FAIL (API error)")
             continue
 
@@ -287,17 +317,21 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict:
         correct = predicted == correct_letter
         elapsed = time.time() - t0
 
-        status = "PASS" if correct else f"FAIL (got {predicted}, expected {correct_letter})"
+        status = (
+            "PASS" if correct else f"FAIL (got {predicted}, expected {correct_letter})"
+        )
         print(f"  [MMLU] Q{idx} ({subject}): {status} ({elapsed:.1f}s)")
-        results.append({
-            "idx": idx,
-            "subject": subject,
-            "correct": correct,
-            "predicted": predicted,
-            "expected": correct_letter,
-            "elapsed_s": round(elapsed, 1),
-            "error": None,
-        })
+        results.append(
+            {
+                "idx": idx,
+                "subject": subject,
+                "correct": correct,
+                "predicted": predicted,
+                "expected": correct_letter,
+                "elapsed_s": round(elapsed, 1),
+                "error": None,
+            }
+        )
 
     correct_count = sum(1 for r in results if r.get("correct"))
     score = correct_count / len(results)
@@ -313,22 +347,23 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict:
 def _extract_letter(text: str) -> str:
     """Extract A/B/C/D from model response."""
     import re
+
     text = text.strip()
     # Direct single letter
     if len(text) == 1 and text.upper() in "ABCD":
         return text.upper()
     # "The answer is B" / "Answer: B" / "correct answer is C"
-    m = re.search(r'(?:answer|option)\s*(?:is|:)\s*([A-Da-d])', text, re.IGNORECASE)
+    m = re.search(r"(?:answer|option)\s*(?:is|:)\s*([A-Da-d])", text, re.IGNORECASE)
     if m:
         return m.group(1).upper()
     # "B." or "B)" at start of line
-    m = re.search(r'^([A-Da-d])[.\):]', text, re.MULTILINE)
+    m = re.search(r"^([A-Da-d])[.\):]", text, re.MULTILINE)
     if m:
         return m.group(1).upper()
     # Last single letter A-D in the text (models often explain then conclude)
-    letters = re.findall(r'\b([A-Da-d])\b', text)
+    letters = re.findall(r"\b([A-Da-d])\b", text)
     # Filter to only A-D
-    valid = [l.upper() for l in letters if l.upper() in "ABCD"]
+    valid = [ch.upper() for ch in letters if ch.upper() in "ABCD"]
     if valid:
         return valid[-1]  # Take last mentioned letter
     return "?"
@@ -339,9 +374,9 @@ def _extract_letter(text: str) -> str:
 # ---------------------------------------------------------------------------
 
 WEIGHTS = {
-    "tau_bench": 0.50,   # Agent tool use — highest signal for model×harness
-    "humaneval": 0.30,   # Code generation
-    "tinyMMLU": 0.20,    # Knowledge baseline
+    "tau_bench": 0.50,  # Agent tool use — highest signal for model×harness
+    "humaneval": 0.30,  # Code generation
+    "tinyMMLU": 0.20,  # Knowledge baseline
 }
 
 
@@ -358,14 +393,28 @@ def compute_mhi(suite_results: dict) -> float:
 # Main
 # ---------------------------------------------------------------------------
 
+
 def main():
     parser = argparse.ArgumentParser(description="MHI Eval — Model-Harness Index")
-    parser.add_argument("--base-url", default="http://localhost:8000/v1", help="OpenAI-compatible API base URL")
-    parser.add_argument("--model", default=None, help="Model name (auto-detected if not set)")
+    parser.add_argument(
+        "--base-url",
+        default="http://localhost:8000/v1",
+        help="OpenAI-compatible API base URL",
+    )
+    parser.add_argument(
+        "--model", default=None, help="Model name (auto-detected if not set)"
+    )
     parser.add_argument("--api-key", default="not-needed", help="API key")
-    parser.add_argument("--suite", default="all", choices=["all", "tau", "humaneval", "mmlu"], help="Which suite to run")
+    parser.add_argument(
+        "--suite",
+        default="all",
+        choices=["all", "tau", "humaneval", "mmlu"],
+        help="Which suite to run",
+    )
     parser.add_argument("--output", default=None, help="Output JSON path")
-    parser.add_argument("--label", default=None, help="Label for this run (e.g. 'qwopus27b+hermes')")
+    parser.add_argument(
+        "--label", default=None, help="Label for this run (e.g. 'qwopus27b+hermes')"
+    )
     args = parser.parse_args()
 
     # Detect model
@@ -386,13 +435,13 @@ def main():
                     break
         label = name[:50]
 
-    print(f"\n{'='*60}")
-    print(f"  MHI Eval — Model-Harness Index")
+    print(f"\n{'=' * 60}")
+    print("  MHI Eval — Model-Harness Index")
     print(f"  Model: {model}")
     print(f"  Label: {label}")
     print(f"  Base URL: {args.base_url}")
     print(f"  Suite: {args.suite}")
-    print(f"{'='*60}\n")
+    print(f"{'=' * 60}\n")
 
     results = {}
     t_start = time.time()
@@ -421,7 +470,7 @@ def main():
     mhi_score = compute_mhi(results)
 
     # Summary
-    print(f"\n{'='*60}")
+    print(f"\n{'=' * 60}")
     print(f"  MHI Score: {mhi_score}")
     print(f"  Label: {label}")
     print(f"  Time: {total_time:.0f}s")
@@ -429,8 +478,10 @@ def main():
     for suite, weight in WEIGHTS.items():
         if suite in results and "score" in results[suite]:
             r = results[suite]
-            print(f"  {suite:12s}: {r['passed']}/{r['total']} ({r['score']:.0%}) × {weight:.0%} weight")
-    print(f"{'='*60}\n")
+            print(
+                f"  {suite:12s}: {r['passed']}/{r['total']} ({r['score']:.0%}) × {weight:.0%} weight"
+            )
+    print(f"{'=' * 60}\n")
 
     # Save results
     output = {
@@ -444,7 +495,10 @@ def main():
         "suites": results,
     }
 
-    out_path = args.output or f"reports/mhi/{label.replace('/', '_')}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
+    out_path = (
+        args.output
+        or f"reports/mhi/{label.replace('/', '_')}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
+    )
     os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
     with open(out_path, "w") as f:
         json.dump(output, f, indent=2)
diff --git a/tests/test_aliases_contract.py b/tests/test_aliases_contract.py
index c007dfe4..a5a2e159 100644
--- a/tests/test_aliases_contract.py
+++ b/tests/test_aliases_contract.py
@@ -453,7 +453,6 @@ def test_audit_batch_reasoning_parser_wirings() -> None:
     """
     profiles = list_profiles()
     expected = {
-        "nemotron-30b-4bit": "qwen3",
         "nemotron-30b-4bit": "qwen3",
         "kimi-k2.5-3bit": "qwen3",
         "hermes4-70b-4bit": "glm4",
@@ -574,7 +573,9 @@ def test_aliases_with_known_broken_hf_paths_stay_fixed() -> None:
     # gpt-oss-20b-mxfp4-q8 previously pointed at mlx-community/GPT-OSS-20B-4bit
     # which 404s; the canonical mlx-community release uses the
     # MXFP4-Q8 hybrid quantization.
-    assert profiles["gpt-oss-20b-mxfp4-q8"].hf_path != "mlx-community/GPT-OSS-20B-4bit", (
+    assert (
+        profiles["gpt-oss-20b-mxfp4-q8"].hf_path != "mlx-community/GPT-OSS-20B-4bit"
+    ), (
         "gpt-oss-20b-mxfp4-q8 must not regress to the 404 path; current canonical "
         "upload is mlx-community/gpt-oss-20b-MXFP4-Q8."
     )
diff --git a/tests/test_cli_argcomplete.py b/tests/test_cli_argcomplete.py
index b034c1cd..69a6413f 100644
--- a/tests/test_cli_argcomplete.py
+++ b/tests/test_cli_argcomplete.py
@@ -136,9 +136,9 @@ def test_alias_csv_completer_multiple_commas() -> None:
     carried through unchanged. Lock this in because rpartition vs
     partition is an easy-to-flip bug."""
     result = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-")
-    assert all(m.startswith("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") for m in result), (
-        "csv completer must preserve all prior csv tokens"
-    )
+    assert all(
+        m.startswith("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") for m in result
+    ), "csv completer must preserve all prior csv tokens"
 
 
 def test_aliases_path_resolves_to_real_file() -> None:
diff --git a/tests/test_cli_chat.py b/tests/test_cli_chat.py
index 3718449e..6aae57ec 100644
--- a/tests/test_cli_chat.py
+++ b/tests/test_cli_chat.py
@@ -1534,7 +1534,9 @@ def test_serve_accepts_no_think_as_alias_for_no_thinking():
     ``no_thinking=True`` destination as ``serve --no-thinking``."""
     captured: list = []
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-think"]),
+        patch.object(
+            sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-think"]
+        ),
         patch.object(cli, "serve_command", side_effect=captured.append),
     ):
         cli.main()
@@ -2247,7 +2249,9 @@ def test_serve_allow_abbrev_disabled_rejects_ambiguous_no_thi(capsys):
     """Same as the chat case — ``serve`` also got the hidden cross-alias
     and the same ambiguity must be reported, not silently resolved."""
     with (
-        patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-thi"]),
+        patch.object(
+            sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-thi"]
+        ),
         pytest.raises(SystemExit),
     ):
         cli.main()
diff --git a/tests/test_model_profiles_ssot.py b/tests/test_model_profiles_ssot.py
index 49169c92..65d22b86 100644
--- a/tests/test_model_profiles_ssot.py
+++ b/tests/test_model_profiles_ssot.py
@@ -88,7 +88,6 @@ def test_orphan_aliases_now_covered() -> None:
         "bonsai-8b-unpacked",
         "ministral-3b-4bit",
         "nemotron-30b-4bit",
-        "nemotron-30b-4bit",
     ):
         profile = resolve_profile(orphan)
         assert profile is not None, f"{orphan} regressed to orphan"
@@ -339,43 +338,15 @@ def test_per_alias_schema_allows_independent_overrides() -> None:
 
 
 # ---- Reverse-lookup behaviour with shared hf_paths -----------------------
-
-
-def test_reverse_lookup_for_shared_hf_path_is_deterministic() -> None:
-    """Two aliases (``nemotron-30b-4bit`` and ``nemotron-30b-4bit``) point at the
-    same MLX repo. Reverse lookup by HF path should return the
-    JSON-insertion-order-first alias's profile, deterministically.
-
-    The contract is "any profile valid for this path", but we lock in
-    the order so a future re-shuffle of aliases.json is forced to
-    explicitly update this test (which is the right place to think
-    about who's the canonical alias).
-    """
-    profiles = list_profiles()
-    nemotron_30b = profiles["nemotron-30b-4bit"]
-    nemotron_nano = profiles["nemotron-30b-4bit"]
-    assert nemotron_30b.hf_path == nemotron_nano.hf_path
-
-    # nemotron-30b-4bit appears first in aliases.json, so reverse lookup
-    # by the shared HF path returns nemotron-30b-4bit's profile object.
-    via_path = resolve_profile(nemotron_30b.hf_path)
-    assert via_path is not None
-    assert via_path is nemotron_30b
-
-
-def test_reverse_lookup_handles_deepseek_v4_flash_duplicate() -> None:
-    """``deepseek-v4-flash-8bit`` and ``deepseek-v4-flash-8bit`` share
-    ``mlx-community/DeepSeek-V4-Flash-8bit`` — same regression guard
-    pattern as the nemotron pair, different family."""
-    profiles = list_profiles()
-    flash = profiles["deepseek-v4-flash-8bit"]
-    flash_8bit = profiles["deepseek-v4-flash-8bit"]
-    assert flash.hf_path == flash_8bit.hf_path
-    via_path = resolve_profile(flash.hf_path)
-    assert via_path is not None
-    # Both profiles agree on capability flags (same model), so either
-    # would be correct semantically. Pin the JSON order winner.
-    assert via_path is flash
+#
+# The original two tests in this section pinned the duplicate-hf_path
+# tie-break for ``(nemotron-30b, nemotron-nano)`` and
+# ``(deepseek-v4-flash, deepseek-v4-flash-8bit)``. After the explicit-quant
+# alias rename, those codename aliases are gone (see the PR description for
+# ``feat/explicit-alias-naming``) and aliases.json no longer has any pair
+# pointing at the same hf_path, so the tie-break is unreachable from the
+# current registry. The reverse-lookup *mechanism* is still exercised by
+# ``test_reverse_lookup_index_built_once_after_first_load`` below.
 
 
 def test_reverse_lookup_index_built_once_after_first_load() -> None:

From bdb00818db3080c127c4b620fc17f04b51ce9fc8 Mon Sep 17 00:00:00 2001
From: Raullen Chai <raullenchai@gmail.com>
Date: Tue, 9 Jun 2026 18:01:57 -0700
Subject: [PATCH 3/6] fix(aliases): codex round 2 + sweep-collateral repair on
 PR #547
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex BLOCKING (round 2, pr_validate):
- ``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:702``
  — the sweep rewrote the canonical spoofing example
  ``evil-org/gpt-oss-20b`` into ``evil-org/gpt-oss-20b-mxfp4-q8``. The
  spoof shape (third-party org publishing under the same bare repo
  name OpenAI uses) is the exact case this matcher must reject; the
  alias-suffixed variant tests a strictly easier case. Restored the
  canonical form. (Adding the suffixed variant separately would be
  redundant — the matcher already covers the broader shape.)
- Same file line 716 — the sweep also rewrote ``openai/gpt-oss-20b``
  (OpenAI's real bare repo id on HuggingFace) into the rapid-mlx alias
  ``openai/gpt-oss-20b-mxfp4-q8``. The bare repo id is what the
  matcher actually sees from upstream tokenizers, so dropping the
  unsuffixed form would have left a gap. Restored the bare repo id
  and the bare-suffix variants of the local-path examples
  (``/models/gpt-oss-20b``, etc.) alongside the alias-suffixed cases
  the sweep wrote.

Codex NIT (round 2, pr_validate):
- ``harness/README.md:104`` previously said the ``full`` tier's two
  baselines are "both 8-bit". After the rename, ``qwen3.6-35b`` is
  the 4-bit variant. Reworded to call out each model's quant.
- ``scripts/rename_aliases.py`` — the ``dropped`` counter always
  printed 0 because dropped codename aliases store their redirect
  target as a non-None string in ``rename_map``. Reworked the three
  counters (renamed / dropped / kept) to compute from the input
  data's perspective so they always sum back to the input alias
  count and ``dropped`` is the real number of MANUAL ``drop=True``
  specs the script processed. Verified: against ``main``'s aliases.json
  the script prints ``48 renamed, 3 dropped, 23 kept`` (= 74).
- ``scripts/sweep_alias_refs.py`` — the comment promised a
  "hand-written pass below" for ``gemma4`` that did not exist (the
  sweep deliberately leaves every ``gemma4`` occurrence alone because
  the literal is also the parser ID). Reworked the comment to make
  the no-op intent and reason explicit so a future maintainer doesn't
  hunt for a missing implementation.

Tests: 4789 passed, 11 skipped, 7 xfailed. ``test_simple_exec`` /
``test_multi_step_tool_chain`` flaked again (model-behaviour pick of
tool name varies run-to-run); rerunning against the same live server
passes both — same as round 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 harness/README.md                             |   8 +-
 scripts/rename_aliases.py                     |  17 ++-
 scripts/rename_map.json                       | 100 +++++++++---------
 scripts/sweep_alias_refs.py                   |   9 +-
 ...est_issue_513_harmony_streamable_parser.py |  20 +++-
 5 files changed, 90 insertions(+), 64 deletions(-)

diff --git a/harness/README.md b/harness/README.md
index 75ebf703..801209c1 100644
--- a/harness/README.md
+++ b/harness/README.md
@@ -100,10 +100,10 @@ Override the model with `--model qwen3.6-35b-4bit` (will need its own baseline).
 
 ### `full` (~2-3 hr, 3 models × 11 agent profiles)
 
-Loops the check tier across `qwen3.5-35b-8bit` and `qwen3.6-35b-4bit`
-(real-capacity Qwen lines — both 8-bit, both go through the Hermes
-parser path that most users hit). For each model, also runs all 11
-agent profiles' auto-generated test plans.
+Loops the check tier across `qwen3.5-35b-8bit` (8-bit) and
+`qwen3.6-35b-4bit` (4-bit) — real-capacity Qwen lines that both go
+through the Hermes parser path that most users hit. For each model,
+also runs all 11 agent profiles' auto-generated test plans.
 
 > Gemma 4 was previously in the default list for orthogonal coverage
 > but was dropped after PR #208 validation showed it fails multiple
diff --git a/scripts/rename_aliases.py b/scripts/rename_aliases.py
index a33fa8a5..ed71ec49 100644
--- a/scripts/rename_aliases.py
+++ b/scripts/rename_aliases.py
@@ -153,9 +153,20 @@ def main() -> None:
         json.dump(rename_map, fp, indent=2, sort_keys=True)
         fp.write("\n")
 
-    renamed = sum(1 for o, n in rename_map.items() if n and o != n)
-    dropped = sum(1 for n in rename_map.values() if n is None)
-    kept = sum(1 for o, n in rename_map.items() if n and o == n)
+    # Counters from the input data's perspective so the three lines add
+    # back up to the input alias count. Each input alias is exactly one
+    # of: dropped (MANUAL says ``drop=True``), renamed (name changed),
+    # or kept (name unchanged because the old key already carried the
+    # canonical quant suffix). Counting drops via ``rename_map[o] is None``
+    # would always print 0 because dropped codename aliases store their
+    # redirect target — non-None — in the rename map.
+    def _is_drop(old: str) -> bool:
+        spec = MANUAL.get(old)
+        return isinstance(spec, dict) and bool(spec.get("drop"))
+
+    dropped = sum(1 for old in data if _is_drop(old))
+    renamed = sum(1 for old in data if not _is_drop(old) and rename_map[old] != old)
+    kept = sum(1 for old in data if not _is_drop(old) and rename_map[old] == old)
     print(f"  renamed: {renamed}")
     print(f"  dropped: {dropped}")
     print(f"  kept (already explicit): {kept}")
diff --git a/scripts/rename_map.json b/scripts/rename_map.json
index 810983c0..8a9150b4 100644
--- a/scripts/rename_map.json
+++ b/scripts/rename_map.json
@@ -1,76 +1,74 @@
 {
-  "bonsai-1.7b": "bonsai-1.7b-unpacked",
-  "bonsai-4b": "bonsai-4b-unpacked",
-  "bonsai-8b": "bonsai-8b-unpacked",
-  "deepseek-r1-32b": "deepseek-r1-32b-4bit",
-  "deepseek-r1-8b": "deepseek-r1-8b-4bit",
-  "deepseek-v4-flash": "deepseek-v4-flash-8bit",
+  "bonsai-1.7b-unpacked": "bonsai-1.7b-unpacked",
+  "bonsai-4b-unpacked": "bonsai-4b-unpacked",
+  "bonsai-8b-unpacked": "bonsai-8b-unpacked",
+  "deepseek-r1-32b-4bit": "deepseek-r1-32b-4bit",
+  "deepseek-r1-8b-4bit": "deepseek-r1-8b-4bit",
   "deepseek-v4-flash-2bit": "deepseek-v4-flash-2bit",
   "deepseek-v4-flash-4bit": "deepseek-v4-flash-4bit",
   "deepseek-v4-flash-8bit": "deepseek-v4-flash-8bit",
-  "devstral-24b": "devstral-24b-4bit",
-  "devstral-v2-24b": "devstral-v2-24b-4bit",
-  "gemma-3n-e4b": "gemma-3n-e4b-4bit",
-  "gemma-4-12b": "gemma-4-12b-4bit",
+  "devstral-24b-4bit": "devstral-24b-4bit",
+  "devstral-v2-24b-4bit": "devstral-v2-24b-4bit",
+  "gemma-3n-e4b-4bit": "gemma-3n-e4b-4bit",
+  "gemma-4-12b-4bit": "gemma-4-12b-4bit",
   "gemma-4-12b-8bit": "gemma-4-12b-8bit",
-  "gemma-4-12b-qat": "gemma-4-12b-qat-4bit",
+  "gemma-4-12b-qat-4bit": "gemma-4-12b-qat-4bit",
   "gemma-4-12b-qat-8bit": "gemma-4-12b-qat-8bit",
-  "gemma-4-26b": "gemma-4-26b-4bit",
-  "gemma-4-26b-qat": "gemma-4-26b-qat-4bit",
-  "gemma-4-31b": "gemma-4-31b-4bit",
+  "gemma-4-26b-4bit": "gemma-4-26b-4bit",
+  "gemma-4-26b-qat-4bit": "gemma-4-26b-qat-4bit",
+  "gemma-4-31b-4bit": "gemma-4-31b-4bit",
   "gemma-4-31b-8bit": "gemma-4-31b-8bit",
-  "gemma-4-31b-qat": "gemma-4-31b-qat-4bit",
+  "gemma-4-31b-qat-4bit": "gemma-4-31b-qat-4bit",
   "gemma-4-31b-qat-8bit": "gemma-4-31b-qat-8bit",
-  "gemma3-12b": "gemma3-12b-4bit",
-  "gemma3-1b": "gemma3-1b-4bit",
-  "gemma3-27b": "gemma3-27b-4bit",
-  "gemma4": "gemma-4-12b-qat-4bit",
-  "glm4.5-air": "glm4.5-air-4bit",
-  "glm4.7-9b": "glm4.7-9b-4bit",
-  "gpt-oss-20b": "gpt-oss-20b-mxfp4-q8",
-  "granite4-tiny": "granite4-tiny-4bit",
-  "hermes3-8b": "hermes3-8b-4bit",
-  "hermes4-70b": "hermes4-70b-4bit",
-  "kimi-48b": "kimi-48b-4bit",
-  "kimi-k2.5": "kimi-k2.5-3bit",
+  "gemma3-12b-4bit": "gemma3-12b-4bit",
+  "gemma3-1b-4bit": "gemma3-1b-4bit",
+  "gemma3-27b-4bit": "gemma3-27b-4bit",
+  "glm4.5-air-4bit": "glm4.5-air-4bit",
+  "glm4.7-9b-4bit": "glm4.7-9b-4bit",
+  "gpt-oss-20b-mxfp4-q8": "gpt-oss-20b-mxfp4-q8",
+  "granite4-tiny-4bit": "granite4-tiny-4bit",
+  "hermes3-8b-4bit": "hermes3-8b-4bit",
+  "hermes4-70b-4bit": "hermes4-70b-4bit",
+  "kimi-48b-4bit": "kimi-48b-4bit",
+  "kimi-k2.5-3bit": "kimi-k2.5-3bit",
   "llama-3.1-8b-8bit": "llama-3.1-8b-8bit",
-  "llama3-1b": "llama3-1b-4bit",
-  "llama3-3b": "llama3-3b-4bit",
-  "minimax-m2.5": "minimax-m2.5-4bit",
-  "minimax-m2.7": "minimax-m2.7-mxfp4",
-  "ministral-3b": "ministral-3b-4bit",
-  "mistral-24b": "mistral-24b-4bit",
-  "nemotron-30b": "nemotron-30b-4bit",
-  "nemotron-nano": "nemotron-30b-4bit",
-  "phi4-14b": "phi-4-14b-4bit",
+  "llama3-1b-4bit": "llama3-1b-4bit",
+  "llama3-3b-4bit": "llama3-3b-4bit",
+  "minimax-m2.5-4bit": "minimax-m2.5-4bit",
+  "minimax-m2.7-mxfp4": "minimax-m2.7-mxfp4",
+  "ministral-3b-4bit": "ministral-3b-4bit",
+  "mistral-24b-4bit": "mistral-24b-4bit",
+  "nemotron-30b-4bit": "nemotron-30b-4bit",
+  "phi-4-14b-4bit": "phi-4-14b-4bit",
+  "phi-4-mini-4bit": "phi-4-mini-4bit",
   "qwen3-0.6b-8bit": "qwen3-0.6b-8bit",
   "qwen3-4b-8bit": "qwen3-4b-8bit",
   "qwen3-8b-8bit": "qwen3-8b-8bit",
-  "qwen3-coder": "qwen3-coder-4bit",
-  "qwen3-coder-30b": "qwen3-coder-30b-4bit",
-  "qwen3-vl-30b": "qwen3-vl-30b-4bit",
-  "qwen3-vl-4b": "qwen3-vl-4b-4bit",
-  "qwen3-vl-8b": "qwen3-vl-8b-4bit",
-  "qwen3.5-122b": "qwen3.5-122b-mxfp4",
+  "qwen3-coder-30b-4bit": "qwen3-coder-30b-4bit",
+  "qwen3-coder-4bit": "qwen3-coder-4bit",
+  "qwen3-vl-30b-4bit": "qwen3-vl-30b-4bit",
+  "qwen3-vl-4b-4bit": "qwen3-vl-4b-4bit",
+  "qwen3-vl-8b-4bit": "qwen3-vl-8b-4bit",
   "qwen3.5-122b-8bit": "qwen3.5-122b-8bit",
-  "qwen3.5-27b": "qwen3.5-27b-4bit",
+  "qwen3.5-122b-mxfp4": "qwen3.5-122b-mxfp4",
+  "qwen3.5-27b-4bit": "qwen3.5-27b-4bit",
   "qwen3.5-27b-8bit": "qwen3.5-27b-8bit",
-  "qwen3.5-35b": "qwen3.5-35b-8bit",
   "qwen3.5-35b-4bit": "qwen3.5-35b-4bit",
-  "qwen3.5-4b": "qwen3.5-4b-4bit",
+  "qwen3.5-35b-8bit": "qwen3.5-35b-8bit",
+  "qwen3.5-4b-4bit": "qwen3.5-4b-4bit",
   "qwen3.5-4b-8bit": "qwen3.5-4b-8bit",
-  "qwen3.5-9b": "qwen3.5-9b-4bit",
+  "qwen3.5-9b-4bit": "qwen3.5-9b-4bit",
   "qwen3.5-9b-8bit": "qwen3.5-9b-8bit",
-  "qwen3.6-27b": "qwen3.6-27b-4bit",
+  "qwen3.6-27b-4bit": "qwen3.6-27b-4bit",
   "qwen3.6-27b-8bit": "qwen3.6-27b-8bit",
   "qwen3.6-27b-ud": "qwen3.6-27b-ud",
-  "qwen3.6-35b": "qwen3.6-35b-4bit",
+  "qwen3.6-35b-4bit": "qwen3.6-35b-4bit",
   "qwen3.6-35b-6bit": "qwen3.6-35b-6bit",
   "qwen3.6-35b-8bit": "qwen3.6-35b-8bit",
   "qwen3.6-35b-dwq": "qwen3.6-35b-dwq",
   "qwen3.6-35b-ud": "qwen3.6-35b-ud",
-  "qwopus-27b": "qwopus-27b-4bit",
+  "qwopus-27b-4bit": "qwopus-27b-4bit",
   "qwopus-27b-8bit": "qwopus-27b-8bit",
-  "qwopus-9b": "qwopus-9b-4bit",
-  "smollm3-3b": "smollm3-3b-4bit"
+  "qwopus-9b-4bit": "qwopus-9b-4bit",
+  "smollm3-3b-4bit": "smollm3-3b-4bit"
 }
diff --git a/scripts/sweep_alias_refs.py b/scripts/sweep_alias_refs.py
index 918f2ca7..bb16affe 100644
--- a/scripts/sweep_alias_refs.py
+++ b/scripts/sweep_alias_refs.py
@@ -163,8 +163,13 @@ def main() -> int:
     # elsewhere in the codebase. ``gemma4`` is the parser ID
     # (registered in ``gemma4_tool_parser.py``, referenced from
     # ``model_auto_config.py``, ``output_router.py``, etc.) — auto-
-    # rewriting it would corrupt the parser registry. These are handled
-    # by a hand-written pass below.
+    # rewriting it would corrupt the parser registry, so the sweep
+    # leaves every occurrence of ``gemma4`` untouched. The matching
+    # codename alias was removed from ``aliases.json`` by
+    # ``rename_aliases.py``; any remaining alias-context usage (e.g.
+    # ``rapid-mlx serve gemma4``) is a manual edit, NOT something this
+    # script will rewrite. Rerunning this script on a fresh checkout is
+    # therefore intentionally a no-op for ``gemma4``.
     HAND_HANDLED = {"gemma4"}
     rename_map = {o: n for o, n in rename_map.items() if o not in HAND_HANDLED}
 
diff --git a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py
index 0657adab..57758014 100644
--- a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py
+++ b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py
@@ -699,7 +699,13 @@ def encode(self, text, add_special_tokens=False):
         "my-not-gpt-oss-20b",
         "notgpt-oss-fake",
         "some-user/gpt-oss-remapped",
-        "evil-org/gpt-oss-20b-mxfp4-q8",
+        # ``evil-org/gpt-oss-20b`` is the canonical spoof case — a third-party
+        # org happens to use the same bare repo name as OpenAI's. A matcher
+        # that accepts any ``*/gpt-oss-20b`` would false-accept this. Kept
+        # verbatim (no alias suffix) so the spoof shape stays representative
+        # — adding the canonical alias suffix would only test a stricter
+        # variant that's already covered by the matcher.
+        "evil-org/gpt-oss-20b",
         "anonymous/gpt-oss",
     )
     for name in rejected_names:
@@ -713,15 +719,21 @@ class _Fake(_CompatTokenizerBase):
         )
 
     accepted_names = (
+        # OpenAI's canonical bare repo id — kept verbatim so the matcher
+        # is tested against the real upstream shape, not just the rapid-mlx
+        # alias form.
+        "openai/gpt-oss-20b",
+        # rapid-mlx alias post-rename — separately covered so an alias
+        # match doesn't shadow the bare repo match above.
         "openai/gpt-oss-20b-mxfp4-q8",
         "mlx-community/gpt-oss-20b-MXFP4-Q8",
         "unsloth/gpt-oss-20b-MLX-8bit",
         "gpt-oss-20b-mxfp4-q8",
         "gpt-oss",
-        "/models/gpt-oss-20b-mxfp4-q8",
-        "~/lmstudio-models/gpt-oss-20b-mxfp4-q8",
+        "/models/gpt-oss-20b",
+        "~/lmstudio-models/gpt-oss-20b",
         "./gpt-oss-20b-quantized",
-        "../models/gpt-oss-20b-mxfp4-q8",
+        "../models/gpt-oss-20b",
     )
     for name in accepted_names:
 

From 59ce089a5420d931d396cf9b0d92b96d87c1efd6 Mon Sep 17 00:00:00 2001
From: Raullen Chai <raullenchai@gmail.com>
Date: Tue, 9 Jun 2026 18:15:17 -0700
Subject: [PATCH 4/6] fix(aliases): regenerate rename_map.json from pre-rename
 aliases.json
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex round-3 NIT: the checked-in ``rename_map.json`` was the result
of an idempotent rerun against the already-renamed ``aliases.json``,
so every entry was an identity mapping (e.g. ``qwen3.5-4b-4bit`` →
``qwen3.5-4b-4bit``). A maintainer running ``scripts/sweep_alias_refs.py``
from a pre-rename checkout to verify the operation is reproducible
would see the sweep do nothing because no legacy name (``qwen3.5-4b``,
``gemma4``, ``nemotron-nano``, …) was in the map.

Regenerated from ``main``'s ``vllm_mlx/aliases.json`` so the map now
contains the real 74-entry legacy → canonical mapping plus the three
dropped codename redirects (``deepseek-v4-flash`` →
``deepseek-v4-flash-8bit``, ``gemma4`` → ``gemma-4-12b-qat-4bit``,
``nemotron-nano`` → ``nemotron-30b-4bit``).

The current ``aliases.json`` is untouched — only the auxiliary map
file used by the sweep tool changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 scripts/rename_map.json | 100 ++++++++++++++++++++--------------------
 1 file changed, 51 insertions(+), 49 deletions(-)

diff --git a/scripts/rename_map.json b/scripts/rename_map.json
index 8a9150b4..810983c0 100644
--- a/scripts/rename_map.json
+++ b/scripts/rename_map.json
@@ -1,74 +1,76 @@
 {
-  "bonsai-1.7b-unpacked": "bonsai-1.7b-unpacked",
-  "bonsai-4b-unpacked": "bonsai-4b-unpacked",
-  "bonsai-8b-unpacked": "bonsai-8b-unpacked",
-  "deepseek-r1-32b-4bit": "deepseek-r1-32b-4bit",
-  "deepseek-r1-8b-4bit": "deepseek-r1-8b-4bit",
+  "bonsai-1.7b": "bonsai-1.7b-unpacked",
+  "bonsai-4b": "bonsai-4b-unpacked",
+  "bonsai-8b": "bonsai-8b-unpacked",
+  "deepseek-r1-32b": "deepseek-r1-32b-4bit",
+  "deepseek-r1-8b": "deepseek-r1-8b-4bit",
+  "deepseek-v4-flash": "deepseek-v4-flash-8bit",
   "deepseek-v4-flash-2bit": "deepseek-v4-flash-2bit",
   "deepseek-v4-flash-4bit": "deepseek-v4-flash-4bit",
   "deepseek-v4-flash-8bit": "deepseek-v4-flash-8bit",
-  "devstral-24b-4bit": "devstral-24b-4bit",
-  "devstral-v2-24b-4bit": "devstral-v2-24b-4bit",
-  "gemma-3n-e4b-4bit": "gemma-3n-e4b-4bit",
-  "gemma-4-12b-4bit": "gemma-4-12b-4bit",
+  "devstral-24b": "devstral-24b-4bit",
+  "devstral-v2-24b": "devstral-v2-24b-4bit",
+  "gemma-3n-e4b": "gemma-3n-e4b-4bit",
+  "gemma-4-12b": "gemma-4-12b-4bit",
   "gemma-4-12b-8bit": "gemma-4-12b-8bit",
-  "gemma-4-12b-qat-4bit": "gemma-4-12b-qat-4bit",
+  "gemma-4-12b-qat": "gemma-4-12b-qat-4bit",
   "gemma-4-12b-qat-8bit": "gemma-4-12b-qat-8bit",
-  "gemma-4-26b-4bit": "gemma-4-26b-4bit",
-  "gemma-4-26b-qat-4bit": "gemma-4-26b-qat-4bit",
-  "gemma-4-31b-4bit": "gemma-4-31b-4bit",
+  "gemma-4-26b": "gemma-4-26b-4bit",
+  "gemma-4-26b-qat": "gemma-4-26b-qat-4bit",
+  "gemma-4-31b": "gemma-4-31b-4bit",
   "gemma-4-31b-8bit": "gemma-4-31b-8bit",
-  "gemma-4-31b-qat-4bit": "gemma-4-31b-qat-4bit",
+  "gemma-4-31b-qat": "gemma-4-31b-qat-4bit",
   "gemma-4-31b-qat-8bit": "gemma-4-31b-qat-8bit",
-  "gemma3-12b-4bit": "gemma3-12b-4bit",
-  "gemma3-1b-4bit": "gemma3-1b-4bit",
-  "gemma3-27b-4bit": "gemma3-27b-4bit",
-  "glm4.5-air-4bit": "glm4.5-air-4bit",
-  "glm4.7-9b-4bit": "glm4.7-9b-4bit",
-  "gpt-oss-20b-mxfp4-q8": "gpt-oss-20b-mxfp4-q8",
-  "granite4-tiny-4bit": "granite4-tiny-4bit",
-  "hermes3-8b-4bit": "hermes3-8b-4bit",
-  "hermes4-70b-4bit": "hermes4-70b-4bit",
-  "kimi-48b-4bit": "kimi-48b-4bit",
-  "kimi-k2.5-3bit": "kimi-k2.5-3bit",
+  "gemma3-12b": "gemma3-12b-4bit",
+  "gemma3-1b": "gemma3-1b-4bit",
+  "gemma3-27b": "gemma3-27b-4bit",
+  "gemma4": "gemma-4-12b-qat-4bit",
+  "glm4.5-air": "glm4.5-air-4bit",
+  "glm4.7-9b": "glm4.7-9b-4bit",
+  "gpt-oss-20b": "gpt-oss-20b-mxfp4-q8",
+  "granite4-tiny": "granite4-tiny-4bit",
+  "hermes3-8b": "hermes3-8b-4bit",
+  "hermes4-70b": "hermes4-70b-4bit",
+  "kimi-48b": "kimi-48b-4bit",
+  "kimi-k2.5": "kimi-k2.5-3bit",
   "llama-3.1-8b-8bit": "llama-3.1-8b-8bit",
-  "llama3-1b-4bit": "llama3-1b-4bit",
-  "llama3-3b-4bit": "llama3-3b-4bit",
-  "minimax-m2.5-4bit": "minimax-m2.5-4bit",
-  "minimax-m2.7-mxfp4": "minimax-m2.7-mxfp4",
-  "ministral-3b-4bit": "ministral-3b-4bit",
-  "mistral-24b-4bit": "mistral-24b-4bit",
-  "nemotron-30b-4bit": "nemotron-30b-4bit",
-  "phi-4-14b-4bit": "phi-4-14b-4bit",
-  "phi-4-mini-4bit": "phi-4-mini-4bit",
+  "llama3-1b": "llama3-1b-4bit",
+  "llama3-3b": "llama3-3b-4bit",
+  "minimax-m2.5": "minimax-m2.5-4bit",
+  "minimax-m2.7": "minimax-m2.7-mxfp4",
+  "ministral-3b": "ministral-3b-4bit",
+  "mistral-24b": "mistral-24b-4bit",
+  "nemotron-30b": "nemotron-30b-4bit",
+  "nemotron-nano": "nemotron-30b-4bit",
+  "phi4-14b": "phi-4-14b-4bit",
   "qwen3-0.6b-8bit": "qwen3-0.6b-8bit",
   "qwen3-4b-8bit": "qwen3-4b-8bit",
   "qwen3-8b-8bit": "qwen3-8b-8bit",
-  "qwen3-coder-30b-4bit": "qwen3-coder-30b-4bit",
-  "qwen3-coder-4bit": "qwen3-coder-4bit",
-  "qwen3-vl-30b-4bit": "qwen3-vl-30b-4bit",
-  "qwen3-vl-4b-4bit": "qwen3-vl-4b-4bit",
-  "qwen3-vl-8b-4bit": "qwen3-vl-8b-4bit",
+  "qwen3-coder": "qwen3-coder-4bit",
+  "qwen3-coder-30b": "qwen3-coder-30b-4bit",
+  "qwen3-vl-30b": "qwen3-vl-30b-4bit",
+  "qwen3-vl-4b": "qwen3-vl-4b-4bit",
+  "qwen3-vl-8b": "qwen3-vl-8b-4bit",
+  "qwen3.5-122b": "qwen3.5-122b-mxfp4",
   "qwen3.5-122b-8bit": "qwen3.5-122b-8bit",
-  "qwen3.5-122b-mxfp4": "qwen3.5-122b-mxfp4",
-  "qwen3.5-27b-4bit": "qwen3.5-27b-4bit",
+  "qwen3.5-27b": "qwen3.5-27b-4bit",
   "qwen3.5-27b-8bit": "qwen3.5-27b-8bit",
+  "qwen3.5-35b": "qwen3.5-35b-8bit",
   "qwen3.5-35b-4bit": "qwen3.5-35b-4bit",
-  "qwen3.5-35b-8bit": "qwen3.5-35b-8bit",
-  "qwen3.5-4b-4bit": "qwen3.5-4b-4bit",
+  "qwen3.5-4b": "qwen3.5-4b-4bit",
   "qwen3.5-4b-8bit": "qwen3.5-4b-8bit",
-  "qwen3.5-9b-4bit": "qwen3.5-9b-4bit",
+  "qwen3.5-9b": "qwen3.5-9b-4bit",
   "qwen3.5-9b-8bit": "qwen3.5-9b-8bit",
-  "qwen3.6-27b-4bit": "qwen3.6-27b-4bit",
+  "qwen3.6-27b": "qwen3.6-27b-4bit",
   "qwen3.6-27b-8bit": "qwen3.6-27b-8bit",
   "qwen3.6-27b-ud": "qwen3.6-27b-ud",
-  "qwen3.6-35b-4bit": "qwen3.6-35b-4bit",
+  "qwen3.6-35b": "qwen3.6-35b-4bit",
   "qwen3.6-35b-6bit": "qwen3.6-35b-6bit",
   "qwen3.6-35b-8bit": "qwen3.6-35b-8bit",
   "qwen3.6-35b-dwq": "qwen3.6-35b-dwq",
   "qwen3.6-35b-ud": "qwen3.6-35b-ud",
-  "qwopus-27b-4bit": "qwopus-27b-4bit",
+  "qwopus-27b": "qwopus-27b-4bit",
   "qwopus-27b-8bit": "qwopus-27b-8bit",
-  "qwopus-9b-4bit": "qwopus-9b-4bit",
-  "smollm3-3b-4bit": "smollm3-3b-4bit"
+  "qwopus-9b": "qwopus-9b-4bit",
+  "smollm3-3b": "smollm3-3b-4bit"
 }

From 93c649591ca43cb0dc778c7caff7ac3e4eb15718 Mon Sep 17 00:00:00 2001
From: Raullen Chai <raullenchai@gmail.com>
Date: Tue, 9 Jun 2026 18:53:49 -0700
Subject: [PATCH 5/6] fix(tests/integrations): raise PydanticAI max_tokens for
 multi-turn / multi-tool

PydanticAI defaults max_tokens to ~1024. On verbose 4B-class models
(qwen3.5-4b-4bit) the multi-turn and sequential-tool-call test paths
spill past the cap and PydanticAI raises
``Model token limit (provider default) exceeded`` before any response
is generated.

That ceiling is a client-side default, not a rapid-mlx server contract,
so the SDK integration test should bypass it: pass
``model_settings={"max_tokens": 2048}`` on tests 5 and 6.

release-check-m3 G7 PydanticAI now 6/6 PASS on qwen3.5-4b-4bit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/integrations/test_pydantic_ai_full.py | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tests/integrations/test_pydantic_ai_full.py b/tests/integrations/test_pydantic_ai_full.py
index 8abe3524..96d7a551 100644
--- a/tests/integrations/test_pydantic_ai_full.py
+++ b/tests/integrations/test_pydantic_ai_full.py
@@ -102,7 +102,10 @@ def get_weather(city: str) -> str:
 # === 5. Multi-turn conversation ===
 print("\n=== Test 5: Multi-turn ===")
 try:
-    agent = Agent(model)
+    # PydanticAI defaults max_tokens to ~1024; on verbose 4B-class models the
+    # second turn can spill past that. Raise the cap so the SDK contract test
+    # checks rapid-mlx behaviour, not PydanticAI's default ceiling.
+    agent = Agent(model, model_settings={"max_tokens": 2048})
     r1 = agent.run_sync("My name is Bob. Remember this.")
     r2 = agent.run_sync("What is my name?", message_history=r1.all_messages())
     assert "bob" in r2.output.lower(), r2.output
@@ -115,7 +118,10 @@ def get_weather(city: str) -> str:
 # === 6. Multiple tools, sequential ===
 print("\n=== Test 6: Multiple tools ===")
 try:
-    agent = Agent(model)
+    # Sequential tool calls accumulate output across two tool-call round trips;
+    # PydanticAI's default max_tokens ceiling kicks in before the final answer
+    # on small models. Same fix as test 5.
+    agent = Agent(model, model_settings={"max_tokens": 2048})
 
     @agent.tool_plain
     def add(a: int, b: int) -> int:

From 244761d8c8d375eb3028aed5de348bca005282cc Mon Sep 17 00:00:00 2001
From: Raullen Chai <raullenchai@gmail.com>
Date: Tue, 9 Jun 2026 18:53:59 -0700
Subject: [PATCH 6/6] chore: bump version to 0.7.0
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Major release — every alias in ``vllm_mlx/aliases.json`` now carries
an explicit quantization suffix (``-4bit`` / ``-8bit`` / ``-mxfp4`` /
``-dwq`` / etc.), no implicit-quant short forms remain. The three
legacy codename aliases (``deepseek-v4-flash``, ``gemma4``,
``nemotron-nano``) were dropped; the ``phi4-14b`` schema bug (name
claimed 14B but hf_path pointed at phi-4-mini ~4B) was fixed by
renaming to ``phi-4-14b-4bit`` AND swapping hf_path to the real Phi-4
14B; ``phi-4-mini-4bit`` was added to preserve the small-model entry.

README now documents the 7-segment naming template
``<family>-<version>-<params>-<modality?>-<technique?>-<quant>`` and
the canonical quant-suffix table.

Total: 74 → 72 aliases. Old short names are not deprecated — they're
just gone, per user direction ("没有多少用户").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 pyproject.toml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pyproject.toml b/pyproject.toml
index 64e12c99..82e6a10d 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "rapid-mlx"
-version = "0.6.83"
+version = "0.7.0"
 description = "Rapid-MLX — AI inference for Apple Silicon. Drop-in OpenAI API, 2-4x faster than Ollama."
 readme = "README.md"
 license = {text = "Apache-2.0"}