From 8c424a0cfed15cb1286d83845640aa8b76bdc5c7 Mon Sep 17 00:00:00 2001 From: Raullen Chai Date: Tue, 9 Jun 2026 17:09:06 -0700 Subject: [PATCH 1/6] refactor(aliases): every alias now carries an explicit quantization suffix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit BREAKING — every legacy short alias has been renamed to its canonical explicit form. ``rapid-mlx serve qwen3.5-4b`` no longer works; use ``rapid-mlx serve qwen3.5-4b-4bit``. ``rapid-mlx models`` lists the 72 new names. Naming template (now documented in README "Naming convention"): ----- The quantization suffix is mandatory — mirrors LM Studio's ``…-MLX-4bit`` / ``…-MLX-8bit`` HuggingFace convention so the bit width is readable off the alias instead of hidden in ``hf_path``. Aliases renamed: 51 (e.g. ``qwen3.5-4b`` → ``qwen3.5-4b-4bit``, ``gemma-4-12b-qat`` → ``gemma-4-12b-qat-4bit``). Aliases already explicit: 23 (e.g. ``qwen3.5-4b-8bit``, ``deepseek-v4-flash-2bit``). Aliases added: 1 (``phi-4-mini-4bit``, separating the 4B mini from the real Phi-4 14B — see phi4-14b fix below). Codename aliases dropped: 3 * ``deepseek-v4-flash`` — duplicate hf_path of ``deepseek-v4-flash-8bit``; references now resolve to that name. * ``gemma4`` — duplicate hf_path of ``gemma-4-12b-qat-4bit``; references resolve to that name. (The ``gemma4`` *parser* identifier is untouched — that's the tool/reasoning parser family, not an alias.) * ``nemotron-nano`` — duplicate hf_path of ``nemotron-30b-4bit``; references resolve to that name. Schema bug fixed: ``phi4-14b`` was pointing at ``mlx-community/phi-4-mini-instruct-4bit`` (a ~4B model). It is now ``phi-4-14b-4bit`` pointing at ``mlx-community/phi-4-4bit`` (the real 14B). The old mini target moves to ``phi-4-mini-4bit``. Non-bit-width quant suffixes formalised: ``-mxfp4``, ``-mxfp4-q8``, ``-dwq``, ``-ud``, ``-3bit``, ``-6bit``, ``-unpacked`` (Bonsai's no-quantization tier). Picked from HF community / mlx-community / LM Studio conventions. Scope of the sweep (107 files): * vllm_mlx/aliases.json — the 51 renames + 3 drops + 1 new. * README.md — new "Naming convention" section + family lineup table rewritten with the explicit names. * 100 active source files — tests, docs, scripts, install.sh, Issue templates, harness/scorecard/latest.md. * tests/fixtures/generation_configs/*.json — file basenames renamed alongside. * harness/baselines/full-qwen3.5-35b.json → -8bit; .6-35b → -4bit. * vllm_mlx/cli.py: Alias column widened 22 → 24 to fit the longest new name (``deepseek-v4-flash-8bit``). Deliberately NOT swept (historical snapshots whose ``model`` field records the alias under which the benchmark was originally run — rewriting them is rewriting history): * evals/results/*.json (83 files) * harness/runs/** (timestamped doctor harness runs) * reports/mhi/*.json (timestamped MHI reports) * reports/benchmarks/** (README-refresh bench snapshots) Tools introduced for the rename (gitted so the operation is reproducible and future renames can reuse the machinery): * scripts/rename_aliases.py — generates the new aliases.json + scripts/rename_map.json from a small declarative rule set. * scripts/sweep_alias_refs.py — applies the rename map to every active source file, with the EXCLUDED prefixes above. Tests: 4793 passed, 11 skipped, 7 xfailed (no regressions). Lint: ``ruff check`` clean. Format: ``ruff format`` clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/ISSUE_TEMPLATE/benchmark_report.yml | 2 +- .github/ISSUE_TEMPLATE/bug_report.yml | 4 +- CONTRIBUTING.md | 4 +- README.md | 105 ++++++--- benchmark_all_prompt_lookup.py | 2 +- docs/benchmarks/README.md | 4 +- docs/benchmarks/image.md | 2 +- docs/benchmarks/llm.md | 8 +- docs/development/contributing.md | 4 +- docs/development/pr_merge_sop.md | 8 +- docs/development/releasing.md | 6 +- docs/getting-started/installation.md | 4 +- docs/getting-started/quickstart.md | 18 +- docs/guides/continuous-batching.md | 6 +- docs/guides/mcp-tools.md | 2 +- docs/guides/multimodal.md | 4 +- docs/guides/server.md | 4 +- docs/reference/cli.md | 34 +-- harness/README.md | 20 +- ....5-35b.json => full-qwen3.5-35b-8bit.json} | 2 +- ....6-35b.json => full-qwen3.6-35b-4bit.json} | 2 +- harness/scorecard/latest.md | 68 +++--- install.sh | 8 +- pyproject.toml | 2 +- scripts/bench_all_models.sh | 4 +- scripts/bench_engine_parity.py | 8 +- scripts/bench_readme_refresh.py | 16 +- scripts/bench_vs_ollama.py | 6 +- scripts/local_bench_vs_ollama.py | 6 +- scripts/mhi_batch.sh | 4 +- scripts/mhi_eval.py | 2 +- scripts/pr_validate/golden_models.yaml | 8 +- scripts/rename_aliases.py | 167 ++++++++++++++ scripts/rename_map.json | 76 +++++++ scripts/run_dogfood_mvp.sh | 4 +- scripts/sweep_alias_refs.py | 215 ++++++++++++++++++ ...vstral-24b.json => devstral-24b-4bit.json} | 0 ...mma-3n-e4b.json => gemma-3n-e4b-4bit.json} | 0 .../{gemma3-27b.json => gemma3-27b-4bit.json} | 0 .../{glm4.5-air.json => glm4.5-air-4bit.json} | 0 .../{glm4.7-9b.json => glm4.7-9b-4bit.json} | 0 .../test_issue_444_harmony_tool_call_leak.py | 6 +- ...t_issue_448_hermes_function_prefix_leak.py | 2 +- ...sue_455_harmony_commentary_tool_channel.py | 2 +- ...8_tool_choice_required_harmony_compound.py | 2 +- ...est_issue_513_harmony_streamable_parser.py | 16 +- tests/test_aliases_contract.py | 84 +++---- tests/test_anthropic_stop_sequences.py | 2 +- tests/test_api_validation_bundle.py | 18 +- tests/test_batched_engine_output_router.py | 2 +- tests/test_bench_vs_ollama.py | 34 +-- tests/test_chat_logprobs_channel_routing.py | 2 +- tests/test_chat_route_tool_tag_leak.py | 2 +- tests/test_chat_streaming_spec.py | 6 +- tests/test_cli_argcomplete.py | 26 +-- tests/test_cli_chat.py | 50 ++-- tests/test_cli_info.py | 8 +- tests/test_cli_models.py | 26 +-- tests/test_dflash_eligibility.py | 10 +- tests/test_dflash_integration.py | 8 +- tests/test_doctor_baseline.py | 2 +- tests/test_doctor_runner.py | 2 +- tests/test_engine_router_non_stream.py | 2 +- tests/test_finalize_harmony_raw_text.py | 2 +- tests/test_harmony_parsers.py | 12 +- tests/test_memory_capacity_check.py | 6 +- tests/test_model_aliases.py | 51 ++--- tests/test_model_profiles_ssot.py | 48 ++-- tests/test_postprocessor.py | 2 +- tests/test_prefix_boundary_path_parity.py | 4 +- tests/test_sampling_params_passthrough.py | 12 +- tests/test_share_cli.py | 50 ++-- tests/test_smoke_matrix.sh | 2 +- tests/test_stop_string_enforcement.py | 2 +- tests/test_suffix_bench_methodology.py | 2 +- tests/test_suffix_decoding_tier.py | 4 +- tests/test_telemetry_emit.py | 20 +- tests/test_telemetry_redact.py | 4 +- vllm_mlx/_completion.py | 2 +- vllm_mlx/_download_gate.py | 2 +- vllm_mlx/agents/__init__.py | 2 +- vllm_mlx/agents/profiles/aider.yaml | 8 +- vllm_mlx/agents/profiles/cline.yaml | 6 +- vllm_mlx/agents/profiles/codex.yaml | 6 +- vllm_mlx/agents/profiles/generic.yaml | 6 +- vllm_mlx/agents/profiles/goose.yaml | 6 +- vllm_mlx/agents/profiles/hermes.yaml | 8 +- vllm_mlx/agents/profiles/langchain.yaml | 6 +- vllm_mlx/agents/profiles/openclaude.yaml | 8 +- vllm_mlx/agents/profiles/opencode.yaml | 8 +- vllm_mlx/agents/profiles/openhands.yaml | 6 +- vllm_mlx/agents/profiles/pydanticai.yaml | 6 +- vllm_mlx/agents/profiles/smolagents.yaml | 6 +- vllm_mlx/aliases.json | 135 +++++------ vllm_mlx/api/utils.py | 2 +- vllm_mlx/cli.py | 54 ++--- vllm_mlx/doctor/__init__.py | 4 +- vllm_mlx/doctor/baseline.py | 2 +- vllm_mlx/doctor/cli.py | 8 +- vllm_mlx/engine/batched.py | 8 +- vllm_mlx/model_aliases.py | 30 +-- vllm_mlx/model_auto_config.py | 2 +- vllm_mlx/output_router_harmony.py | 6 +- vllm_mlx/reasoning/harmony_parser.py | 2 +- vllm_mlx/routes/anthropic.py | 2 +- vllm_mlx/runtime/model_registry.py | 6 +- vllm_mlx/service/postprocessor.py | 4 +- vllm_mlx/share/cli.py | 8 +- vllm_mlx/telemetry/redact.py | 4 +- vllm_mlx/tool_parsers/harmony_tool_parser.py | 2 +- 110 files changed, 1098 insertions(+), 639 deletions(-) rename harness/baselines/{full-qwen3.5-35b.json => full-qwen3.5-35b-8bit.json} (95%) rename harness/baselines/{full-qwen3.6-35b.json => full-qwen3.6-35b-4bit.json} (95%) create mode 100644 scripts/rename_aliases.py create mode 100644 scripts/rename_map.json create mode 100644 scripts/sweep_alias_refs.py rename tests/fixtures/generation_configs/{devstral-24b.json => devstral-24b-4bit.json} (100%) rename tests/fixtures/generation_configs/{gemma-3n-e4b.json => gemma-3n-e4b-4bit.json} (100%) rename tests/fixtures/generation_configs/{gemma3-27b.json => gemma3-27b-4bit.json} (100%) rename tests/fixtures/generation_configs/{glm4.5-air.json => glm4.5-air-4bit.json} (100%) rename tests/fixtures/generation_configs/{glm4.7-9b.json => glm4.7-9b-4bit.json} (100%) diff --git a/.github/ISSUE_TEMPLATE/benchmark_report.yml b/.github/ISSUE_TEMPLATE/benchmark_report.yml index c10869e2..a4df4527 100644 --- a/.github/ISSUE_TEMPLATE/benchmark_report.yml +++ b/.github/ISSUE_TEMPLATE/benchmark_report.yml @@ -17,7 +17,7 @@ body: id: model attributes: label: Model - placeholder: "e.g., qwen3.5-9b or mlx-community/Qwen3.5-9B-4bit" + placeholder: "e.g., qwen3.5-9b-4bit or mlx-community/Qwen3.5-9B-4bit" validations: required: true - type: textarea diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml index e2963ccb..55f2377b 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.yml +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -44,7 +44,7 @@ body: id: model attributes: label: Model - placeholder: "e.g. qwen3.5-4b or mlx-community/Qwen3.5-4B-MLX-4bit" + placeholder: "e.g. qwen3.5-4b-4bit or mlx-community/Qwen3.5-4B-MLX-4bit" validations: required: true - type: input @@ -52,7 +52,7 @@ body: attributes: label: Full serve command description: Every flag matters — `--tool-call-parser`, `--reasoning-parser`, `--enable-prefix-cache`, etc. - placeholder: "rapid-mlx serve qwen3.5-4b --enable-auto-tool-choice --tool-call-parser qwen3 ..." + placeholder: "rapid-mlx serve qwen3.5-4b-4bit --enable-auto-tool-choice --tool-call-parser qwen3 ..." validations: required: true - type: dropdown diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 84cb8b27..69994968 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -14,7 +14,7 @@ pip install -e . pip install pytest ruff # dev tools for testing and linting # Start a dev server -rapid-mlx serve qwen3.5-4b --port 8000 +rapid-mlx serve qwen3.5-4b-4bit --port 8000 ``` **Requirements:** Python 3.11+, macOS with Apple Silicon (M1/M2/M3/M4). @@ -132,7 +132,7 @@ The easiest contribution — no model download needed! } ``` -That's it. Find the MLX model on [HuggingFace mlx-community](https://huggingface.co/mlx-community) and add the mapping. Convention: `-` in lowercase (e.g., `qwen3.5-9b`, `gemma-4-26b`). +That's it. Find the MLX model on [HuggingFace mlx-community](https://huggingface.co/mlx-community) and add the mapping. Convention: `-` in lowercase (e.g., `qwen3.5-9b-4bit`, `gemma-4-26b-4bit`). ## How to Add Parser Auto-Detection diff --git a/README.md b/README.md index 633e50bc..4403ba4d 100644 --- a/README.md +++ b/README.md @@ -82,15 +82,15 @@ curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash ```bash rapid-mlx chat ``` -Defaults to `qwen3.5-4b`. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type `/help` for slash commands, `/exit` to quit. Pass `--think` to surface chain-of-thought. +Defaults to `qwen3.5-4b-4bit`. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type `/help` for slash commands, `/exit` to quit. Pass `--think` to surface chain-of-thought. **Step 2b — Or serve a model for use from other apps:** ```bash -rapid-mlx serve qwen3.5-4b +rapid-mlx serve qwen3.5-4b-4bit ``` Same model, same download — but this starts an OpenAI-compatible HTTP server instead of a REPL. Wait for `Ready: http://localhost:8000/v1`. -> Want vision? `pip install 'rapid-mlx[vision]'` then `rapid-mlx serve gemma-4-26b` (~14 GB). +> Want vision? `pip install 'rapid-mlx[vision]'` then `rapid-mlx serve gemma-4-26b-4bit` (~14 GB). **Step 3 — Hit the API** (from a second terminal tab): ```bash @@ -113,7 +113,7 @@ The default chat surface is our hosted Big-AGI fork (tool calling, personas, voi > **Want a Claude Code-like TUI?** Rapid-MLX is the *backend* — pair it with an open-source agent CLI like [OpenCode](https://github.com/sst/opencode) or [codex](https://github.com/openai/codex) for the full slash-commands / tool-use / multi-turn experience. Run `rapid-mlx agents opencode --setup` (or `codex --setup`) to wire it up automatically. -> **Tip:** Run `rapid-mlx models` to see all available model aliases. For a smaller/faster model, try `rapid-mlx serve qwen3.5-9b` (~5 GB). +> **Tip:** Run `rapid-mlx models` to see all available model aliases. For a smaller/faster model, try `rapid-mlx serve qwen3.5-9b-4bit` (~5 GB).
More install options @@ -221,7 +221,7 @@ Run `rapid-mlx agents` to see all supported agents and `python3 scripts/mhi_eval ``` OpenAI API Base: http://localhost:8000/v1 API Key: not-needed -Model name: default (or qwen3.5-9b — either works) +Model name: default (or qwen3.5-9b-4bit — either works) ``` Cursor's agent/composer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed. @@ -405,27 +405,64 @@ The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monit > **4bit vs 8bit:** 4bit models are compressed to use less memory (recommended for most users). 8bit models are higher quality but need more RAM. "mxfp4" is a high-quality 4bit format. +### Naming convention + +Every alias follows the same template so you can read off the model family, parameter count, training technique, and quantization at a glance: + +`-----` + +| Segment | Meaning | Examples | +|---|---|---| +| **family** | Model family | `gemma`, `qwen`, `llama`, `mistral`, `deepseek`, `phi` | +| **version** | Major version | `-4`, `3.5`, `3.6`, `-r1`, `-v4-flash` | +| **params** | Parameter count (MoE includes the active count) | `12b`, `27b`, `35b-a3b` (35B total / 3B active) | +| **modality** *(optional)* | Non-text variants | `-vl` (vision), `-coder` (code) | +| **technique** *(optional)* | Training-time modifier | `-qat` (Quantization-Aware Training), `-distill`, `-thinking` | +| **quant** *(mandatory)* | Quantization tier (see below) | `-4bit`, `-8bit`, `-mxfp4`, `-qat-8bit`, … | + +The **quantization suffix is mandatory on every alias** — `qwen3.5-4b-4bit` not `qwen3.5-4b`, `gemma-4-12b-qat-8bit` not `gemma-4-12b-qat`. This mirrors LM Studio's `…-MLX-4bit` / `…-MLX-8bit` HuggingFace convention so you never have to guess the bit width. + +| Suffix | Meaning | +|---|---| +| `-4bit` | Standard MLX 4-bit (most common) | +| `-8bit` | Standard MLX 8-bit (higher quality, ~2× RAM) | +| `-2bit`, `-3bit`, `-6bit` | Other bit widths | +| `-mxfp4` | Microscaling FP4 (high-quality 4-bit) | +| `-mxfp4-q8` | MXFP4 weights + Q8 head (GPT-OSS style) | +| `-dwq` | Dynamic Weight Quantization (mlx-community) | +| `-ud` | Unsloth Dynamic (mixed-precision per-layer) | +| `-unpacked` | Original FP16 / BF16 weights, no quantization | + +`-qat` is a *technique* suffix, not a quant — it stacks before the quant. So a QAT-trained Gemma 4 12B in 4-bit is `gemma-4-12b-qat-4bit`, and the 8-bit variant is `gemma-4-12b-qat-8bit`. + +Decoded examples: + +- `gemma-4-12b-qat-4bit` = Gemma 4 · 12B params · QAT-trained · 4-bit quant +- `qwen3.5-35b-8bit` = Qwen 3.5 · 35B params (3B active MoE) · 8-bit quant +- `gpt-oss-20b-mxfp4-q8` = GPT-OSS · 20B params · MXFP4 weights + Q8 head +- `bonsai-1.7b-unpacked` = Bonsai · 1.7B params · no quantization + ### Full model lineup -66 short aliases across 13 families ship today. Run `rapid-mlx models` for the live list with quant tier, MoE / hybrid flags, and DFlash eligibility. +72 explicit aliases across 13 families ship today. Run `rapid-mlx models` for the live list with parser, hybrid / MoE flags, and DFlash eligibility.
-Show all 66 aliases by family +Show all 72 aliases by family | Family | Aliases | Notable | |---|---|---| -| **Qwen3.5** | `qwen3.5-4b`, `-4b-8bit`, `-9b`, `-9b-8bit`, `-27b`, `-27b-8bit` ✨, `-35b`, `-35b-4bit`, `-122b`, `-122b-8bit` | DeltaNet hybrid; **27b-8bit DFlash-eligible** | -| **Qwen3.6** | `qwen3.6-27b`, `-27b-8bit` ✨, `-27b-ud`, `-35b`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud` | 262K ctx, 256 MoE experts; **27b-8bit DFlash-eligible** | -| **Qwen3** | `qwen3-0.6b-8bit`, `-4b-8bit`, `-8b-8bit`, `qwen3-coder`, `qwen3-coder-30b`, `qwen3-vl-4b`, `-8b`, `-30b` | Coding + vision | -| **Qwopus** | `qwopus-9b`, `qwopus-27b`, `qwopus-27b-8bit` | 92 MHI on tool calling | -| **DeepSeek** | `deepseek-r1-8b`, `-32b`, `deepseek-v4-flash` (2/4/8-bit) | R1 reasoning + V4 Flash 158B-A13B day-0 | -| **Gemma** | `gemma-3n-e4b`, `gemma-4-26b`, `-31b`, `-31b-8bit`, `gemma3-1b`, `-12b`, `-27b` | Vision-capable (gemma-4) | -| **Llama / Hermes** | `llama3-1b`, `-3b`, `llama-3.1-8b-8bit`, `hermes3-8b`, `hermes4-70b` | | -| **GLM** | `glm4.5-air`, `glm4.7-9b` | | -| **GPT-OSS** | `gpt-oss-20b` | Harmony native | -| **MiniMax / Kimi** | `minimax-m2.5`, `minimax-m2.7`, `kimi-48b`, `kimi-k2.5` | | -| **Mistral / Devstral** | `mistral-24b`, `devstral-24b`, `devstral-v2-24b`, `ministral-3b` | | -| **Other** | `phi4-14b`, `smollm3-3b`, `nemotron-30b` / `-nano`, `bonsai-1.7b/4b/8b`, `granite4-tiny` | | +| **Qwen3.5** | `qwen3.5-4b-4bit`, `-4b-8bit`, `-9b-4bit`, `-9b-8bit`, `-27b-4bit`, `-27b-8bit` ✨, `-35b-4bit`, `-35b-8bit`, `-122b-mxfp4`, `-122b-8bit` | DeltaNet hybrid; **27b-8bit DFlash-eligible** | +| **Qwen3.6** | `qwen3.6-27b-4bit`, `-27b-8bit` ✨, `-27b-ud`, `-35b-4bit`, `-35b-6bit`, `-35b-8bit`, `-35b-dwq`, `-35b-ud` | 262K ctx, 256 MoE experts; **27b-8bit DFlash-eligible** | +| **Qwen3** | `qwen3-0.6b-8bit`, `-4b-8bit`, `-8b-8bit`, `qwen3-coder-4bit`, `qwen3-coder-30b-4bit`, `qwen3-vl-4b-4bit`, `-8b-4bit`, `-30b-4bit` | Coding + vision | +| **Qwopus** | `qwopus-9b-4bit`, `qwopus-27b-4bit`, `qwopus-27b-8bit` | 92 MHI on tool calling | +| **DeepSeek** | `deepseek-r1-8b-4bit`, `-32b-4bit`, `deepseek-v4-flash-2bit`, `-4bit`, `-8bit` | R1 reasoning + V4 Flash 158B-A13B day-0 | +| **Gemma** | `gemma-3n-e4b-4bit`, `gemma-4-12b-4bit`, `-12b-qat-4bit`, `-12b-qat-8bit`, `-26b-4bit`, `-26b-qat-4bit`, `-31b-4bit`, `-31b-8bit`, `-31b-qat-4bit`, `-31b-qat-8bit`, `gemma3-1b-4bit`, `-12b-4bit`, `-27b-4bit` | Vision-capable; QAT variants | +| **Llama / Hermes** | `llama3-1b-4bit`, `-3b-4bit`, `llama-3.1-8b-8bit`, `hermes3-8b-4bit`, `hermes4-70b-4bit` | | +| **GLM** | `glm4.5-air-4bit`, `glm4.7-9b-4bit` | | +| **GPT-OSS** | `gpt-oss-20b-mxfp4-q8` | Harmony native | +| **MiniMax / Kimi** | `minimax-m2.5-4bit`, `minimax-m2.7-mxfp4`, `kimi-48b-4bit`, `kimi-k2.5-3bit` | | +| **Mistral / Devstral** | `mistral-24b-4bit`, `devstral-24b-4bit`, `devstral-v2-24b-4bit`, `ministral-3b-4bit` | | +| **Other** | `phi-4-14b-4bit`, `phi-4-mini-4bit`, `smollm3-3b-4bit`, `nemotron-30b-4bit`, `bonsai-1.7b-unpacked`, `-4b-unpacked`, `-8b-unpacked`, `granite4-tiny-4bit` | | ✨ = DFlash speculative decoding supported (opt in with `--enable-dflash`). `rapid-mlx info ` shows per-alias capabilities. @@ -433,38 +470,38 @@ The model has to fit in your Mac's RAM. If your Mac slows down or Activity Monit ### Copy-paste commands -Pick the one that matches your Mac. Short aliases work — run `rapid-mlx models` to see all available models. +Pick the one that matches your Mac. Run `rapid-mlx models` to see all available aliases. ```bash # 16 GB — lightweight, fast -rapid-mlx serve qwen3.5-4b --port 8000 +rapid-mlx serve qwen3.5-4b-4bit --port 8000 # 24 GB — best small model -rapid-mlx serve qwen3.5-9b --port 8000 +rapid-mlx serve qwen3.5-9b-4bit --port 8000 # 32 GB — solid coding model -rapid-mlx serve qwen3.5-27b --port 8000 +rapid-mlx serve qwen3.5-27b-4bit --port 8000 # 32 GB — Gemma 4 12B (vision-capable, 64 tok/s) -rapid-mlx serve gemma-4-12b --port 8000 +rapid-mlx serve gemma-4-12b-4bit --port 8000 # 32 GB — GPT-OSS 20B (harmony-native, 100% tool calling, 119 tok/s) -rapid-mlx serve gpt-oss-20b --port 8000 +rapid-mlx serve gpt-oss-20b-mxfp4-q8 --port 8000 # 32+ GB — Qwen 3.6 35B-A3B (256 experts, 262K context, 93 tok/s) -rapid-mlx serve qwen3.6-35b --port 8000 +rapid-mlx serve qwen3.6-35b-4bit --port 8000 # 48+ GB — sweet spot (Qwen3.5-35B-A3B 8bit, 80 tok/s) -rapid-mlx serve qwen3.5-35b --prefill-step-size 8192 --port 8000 # faster first response +rapid-mlx serve qwen3.5-35b-8bit --prefill-step-size 8192 --port 8000 # faster first response # 96+ GB — frontier (Qwen3.5-122B mxfp4) -rapid-mlx serve qwen3.5-122b --prefill-step-size 8192 --port 8000 +rapid-mlx serve qwen3.5-122b-mxfp4 --prefill-step-size 8192 --port 8000 # Coding agent — fast MoE, great for Claude Code / Cursor -rapid-mlx serve qwen3-coder --prefill-step-size 8192 --port 8000 # MoE = only uses part of the model, so it's fast +rapid-mlx serve qwen3-coder-4bit --prefill-step-size 8192 --port 8000 # MoE = only uses part of the model, so it's fast # Vision — image understanding (see note below) -rapid-mlx serve qwen3-vl-4b --mllm --port 8000 +rapid-mlx serve qwen3-vl-4b-4bit --mllm --port 8000 ``` > **Vision deps:** Install into the same environment where rapid-mlx lives: @@ -530,7 +567,7 @@ Reproduce the throughput table: ```bash python3.12 scripts/bench_readme_refresh.py \ - --models qwen3.5-4b,qwen3.5-9b,qwen3.5-27b,gemma-4-12b,gpt-oss-20b,qwen3.6-35b,qwen3.5-35b \ + --models qwen3.5-4b-4bit,qwen3.5-9b-4bit,qwen3.5-27b-4bit,gemma-4-12b-4bit,gpt-oss-20b-mxfp4-q8,qwen3.6-35b-4bit,qwen3.5-35b-8bit \ --engines rapid-mlx,mlx-lm,ollama ``` @@ -800,7 +837,7 @@ Rapid-MLX **can** send anonymous usage data to help us prioritise the right mode ### What we collect (only if you opt in) - Subcommand names (`serve` / `chat` / `agents` / `bench` / `doctor`) -- Model alias names (`qwen3.5-9b`) or canonical HF repo IDs (`mlx-community/...`) — local paths are redacted to `` +- Model alias names (`qwen3.5-9b-4bit`) or canonical HF repo IDs (`mlx-community/...`) — local paths are redacted to `` - Bucketed counts: prompt/completion tokens, TTFT, tokens/sec — never exact values - Error categories + a hash fingerprint of the failure site (exception class name + per-frame `file:function:lineno` only — never the message text or absolute paths) - OS, arch, Apple chip name, RAM (rounded to GB), Python major.minor @@ -828,8 +865,8 @@ rapid-mlx telemetry reset # delete consent + client-id files (re-prompts on Either of these always wins, regardless of stored consent: ```bash -RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b -rapid-mlx --no-telemetry serve qwen3.5-9b +RAPID_MLX_TELEMETRY=0 rapid-mlx serve qwen3.5-9b-4bit +rapid-mlx --no-telemetry serve qwen3.5-9b-4bit ``` There is intentionally **no env-var equivalent for force-on** — opting in must be an explicit one-time `rapid-mlx telemetry enable`. CI agents will never silently contribute. diff --git a/benchmark_all_prompt_lookup.py b/benchmark_all_prompt_lookup.py index 597a8120..ea0ba06a 100644 --- a/benchmark_all_prompt_lookup.py +++ b/benchmark_all_prompt_lookup.py @@ -48,7 +48,7 @@ False, ), ( - "gpt-oss-20b", + "gpt-oss-20b-mxfp4-q8", "/Users/raullenstudio/.lmstudio/models/mlx-community/gpt-oss-20b-MXFP4-Q8", False, ), diff --git a/docs/benchmarks/README.md b/docs/benchmarks/README.md index 089bf163..13f88756 100644 --- a/docs/benchmarks/README.md +++ b/docs/benchmarks/README.md @@ -12,7 +12,7 @@ Performance benchmarks for rapid-mlx on Apple Silicon. ```bash # LLM benchmark — short aliases work -rapid-mlx bench qwen3.5-4b +rapid-mlx bench qwen3.5-4b-4bit # Or by full HF repo (vision/multimodal benches live in scripts/ — they are # dev-only and not shipped with `pip install rapid-mlx`) @@ -55,7 +55,7 @@ Results will vary on different Apple Silicon chips. If you have a different Apple Silicon chip, please share your results: ```bash -rapid-mlx bench qwen3.5-4b | tee results.txt +rapid-mlx bench qwen3.5-4b-4bit | tee results.txt ``` Open an issue with your results at [GitHub Issues](https://github.com/raullenchai/Rapid-MLX/issues). diff --git a/docs/benchmarks/image.md b/docs/benchmarks/image.md index 2ace9771..4c027f78 100644 --- a/docs/benchmarks/image.md +++ b/docs/benchmarks/image.md @@ -8,7 +8,7 @@ scripts under `scripts/` (not packaged with `pip install rapid-mlx`) — clone the repo if you want to reproduce them. ```bash -rapid-mlx serve gemma-4-26b --mllm --port 8000 # then exercise the VLM via /v1/chat/completions +rapid-mlx serve gemma-4-26b-4bit --mllm --port 8000 # then exercise the VLM via /v1/chat/completions ``` ## Results - Qwen3-VL-8B-Instruct-4bit (M4 Max, 128GB) diff --git a/docs/benchmarks/llm.md b/docs/benchmarks/llm.md index b9d105a0..9c337a27 100644 --- a/docs/benchmarks/llm.md +++ b/docs/benchmarks/llm.md @@ -3,7 +3,7 @@ ## Running LLM Benchmarks ```bash -rapid-mlx bench qwen3.5-4b --num-prompts 5 --max-tokens 256 +rapid-mlx bench qwen3.5-4b-4bit --num-prompts 5 --max-tokens 256 ``` ## Results (M4 Max, 128GB) @@ -271,13 +271,13 @@ The streaming detokenizer is **not currently viable** for per-request usage due ```bash # Basic benchmark — short alias works -rapid-mlx bench qwen3.5-4b +rapid-mlx bench qwen3.5-4b-4bit # With more prompts -rapid-mlx bench qwen3.5-4b --num-prompts 10 +rapid-mlx bench qwen3.5-4b-4bit --num-prompts 10 # Save results -rapid-mlx bench qwen3.5-4b | tee results.txt +rapid-mlx bench qwen3.5-4b-4bit | tee results.txt # Continuous batching test python tests/test_continuous_batching.py diff --git a/docs/development/contributing.md b/docs/development/contributing.md index 1c096706..b3a91e10 100644 --- a/docs/development/contributing.md +++ b/docs/development/contributing.md @@ -73,7 +73,7 @@ ruff format --check . ```bash # LLM benchmark — short alias works -rapid-mlx bench qwen3.5-4b +rapid-mlx bench qwen3.5-4b-4bit # Or by full HF repo rapid-mlx bench mlx-community/Qwen3.5-9B-4bit @@ -109,7 +109,7 @@ See [Architecture](architecture.md) for details on the codebase structure. If you have access to different Apple Silicon chips (M1, M2, M3, M4), benchmark results are valuable: ```bash -rapid-mlx bench qwen3.5-4b | tee results_m4.txt +rapid-mlx bench qwen3.5-4b-4bit | tee results_m4.txt ``` ## Questions? diff --git a/docs/development/pr_merge_sop.md b/docs/development/pr_merge_sop.md index 516e76f7..65bdd803 100644 --- a/docs/development/pr_merge_sop.md +++ b/docs/development/pr_merge_sop.md @@ -158,7 +158,7 @@ Skip rule: - **Touch inference code** → run, even if it takes ~10 min: ```bash - # make check runs against the default model (qwen3.5-4b) — ~10 min + # make check runs against the default model (qwen3.5-4b-4bit) — ~10 min make check # make full runs across multiple models (~1-2 hr) — only when changes affect generation correctness make full @@ -166,7 +166,7 @@ Skip rule: python3 -m vllm_mlx.cli doctor check --model ``` -The bar is **0 regressions vs the per-model baseline in `harness/baselines/`** *for models that have committed baselines* (currently `qwen3.5-35b` and `qwen3.6-35b`). For models without baselines, document the chosen ad-hoc reference (e.g., "compared against output on commit X", "manual eyeball vs main"). Pre-existing fails (Test 10 streaming usage, `<|im_end|>` leak, thinking-toggle on qwen3.5-4b) are documented; new fails block merge. +The bar is **0 regressions vs the per-model baseline in `harness/baselines/`** *for models that have committed baselines* (currently `qwen3.5-35b-8bit` and `qwen3.6-35b-4bit`). For models without baselines, document the chosen ad-hoc reference (e.g., "compared against output on commit X", "manual eyeball vs main"). Pre-existing fails (Test 10 streaming usage, `<|im_end|>` leak, thinking-toggle on qwen3.5-4b-4bit) are documented; new fails block merge. ## Step 9 — Anthropic-compat round-trip (gated on parser/router PRs) @@ -174,11 +174,11 @@ If the diff touches `vllm_mlx/parsers/`, `vllm_mlx/reasoning/`, `vllm_mlx/routes ```bash # in one shell: -rapid-mlx serve qwen3.5-4b +rapid-mlx serve qwen3.5-4b-4bit # in another: curl -s http://localhost:8000/anthropic/v1/messages \ -H 'content-type: application/json' \ - -d '{"model":"qwen3.5-4b","max_tokens":64,"messages":[{"role":"user","content":"say hi"}]}' + -d '{"model":"qwen3.5-4b-4bit","max_tokens":64,"messages":[{"role":"user","content":"say hi"}]}' ``` Output must be a non-empty Anthropic-shaped response, no `!!!!!!` token-id-0 corruption, no streaming-think misroute. The `/anthropic` surface shares router-level code with `/v1/chat/completions` but diverges at the streaming-think router; multiple historical regressions (#288, #289) shipped with green OpenAI-compat smoke and broken `/anthropic`. diff --git a/docs/development/releasing.md b/docs/development/releasing.md index a2eb9fd5..9a1ae5d4 100644 --- a/docs/development/releasing.md +++ b/docs/development/releasing.md @@ -2,7 +2,7 @@ This page documents the **end-to-end release flow** and the **safety nets** that catch the common failure modes. -The historical pain point: between v0.6.14 (2026-05-05) and v0.6.16, several PRs added 30+ new model aliases (`granite4-tiny`, `smollm3-3b`, `deepseek-v4-flash`, `qwen3.6-*`, etc), but no version was bumped — leaving brew/PyPI users with a stale `rapid-mlx models` list. The safety nets below are designed to make that exact failure impossible to repeat without explicit human override. +The historical pain point: between v0.6.14 (2026-05-05) and v0.6.16, several PRs added 30+ new model aliases (`granite4-tiny-4bit`, `smollm3-3b-4bit`, `deepseek-v4-flash-8bit`, `qwen3.6-*`, etc), but no version was bumped — leaving brew/PyPI users with a stale `rapid-mlx models` list. The safety nets below are designed to make that exact failure impossible to repeat without explicit human override. ## Quick reference @@ -170,8 +170,8 @@ This is the rule. No exceptions. CI doesn't fake-inference with a tiny model on ### M3 local — one command before pushing the bump commit ```bash -make release-check-m3 # uses MODEL=qwen3.5-4b (default) -MODEL=qwen3.6-27b make release-check-m3 # override +make release-check-m3 # uses MODEL=qwen3.5-4b-4bit (default) +MODEL=qwen3.6-27b-4bit make release-check-m3 # override ``` Wrapped by [`scripts/release_check_m3.sh`](../../scripts/release_check_m3.sh). It boots `rapid-mlx serve` once on port 8000, then runs G5 (stress) + G7 (anthropic + pydantic_ai + smolagents) + G6 (parallel-tool-call cap repro) + G9 (10-seq latency) + G8b (parser microbench, M3 perf baseline) sequentially. The server is killed on exit. diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md index 56ca5e39..2512f28d 100644 --- a/docs/getting-started/installation.md +++ b/docs/getting-started/installation.md @@ -58,7 +58,7 @@ rapid-mlx version rapid-mlx doctor # Smallest interactive smoke test (downloads ~2.5 GB on first run) -rapid-mlx chat qwen3.5-4b +rapid-mlx chat qwen3.5-4b-4bit ``` ## Troubleshooting @@ -81,7 +81,7 @@ huggingface-cli login Use a smaller quantized model: ```bash -rapid-mlx serve qwen3.5-4b +rapid-mlx serve qwen3.5-4b-4bit ``` ### `brew install` fails with `Operation not permitted` diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index dae94199..d50b19a1 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -3,12 +3,12 @@ ## Option 1: Interactive Chat (fastest first taste) The shortest path to talking to a model — `chat` spawns its own server, -downloads the model on first run (~2.5 GB for the default `qwen3.5-4b`), and +downloads the model on first run (~2.5 GB for the default `qwen3.5-4b-4bit`), and drops you into a REPL. ```bash -rapid-mlx chat # defaults to qwen3.5-4b -rapid-mlx chat qwen3.5-9b # a larger model (5 GB) +rapid-mlx chat # defaults to qwen3.5-4b-4bit +rapid-mlx chat qwen3.5-9b-4bit # a larger model (5 GB) rapid-mlx chat --think # surface chain-of-thought reasoning ``` @@ -21,7 +21,7 @@ In-REPL: `/help`, `/reset`, `/save `, `/model `, `/exit`. Type Start the server: ```bash -rapid-mlx serve qwen3.5-4b --port 8000 +rapid-mlx serve qwen3.5-4b-4bit --port 8000 ``` Use with the OpenAI Python SDK: @@ -61,7 +61,7 @@ For image / video understanding, use a VLM (requires the `[vision]` extra — `pip install 'rapid-mlx[vision]'`): ```bash -rapid-mlx serve gemma-4-26b --mllm --port 8000 +rapid-mlx serve gemma-4-26b-4bit --mllm --port 8000 ``` ```python @@ -85,7 +85,7 @@ chain-of-thought into a separate `reasoning_content` field, leaving `content` clean. ```bash -rapid-mlx serve qwen3.5-9b --port 8000 # qwen3 reasoning parser auto-detected +rapid-mlx serve qwen3.5-9b-4bit --port 8000 # qwen3 reasoning parser auto-detected ``` ```python @@ -103,7 +103,7 @@ Generate text embeddings for semantic search and RAG (install the `[embeddings]` extra first): ```bash -rapid-mlx serve qwen3.5-4b --embedding-model mlx-community/multilingual-e5-small-mlx +rapid-mlx serve qwen3.5-4b-4bit --embedding-model mlx-community/multilingual-e5-small-mlx ``` ```python @@ -119,13 +119,13 @@ Tool/function calling is on by default for supported model families (Qwen3.x, GLM-4.7, GPT-OSS, Llama, Mistral, etc.) — the right parser is auto-detected: ```bash -rapid-mlx serve qwen3.5-9b --port 8000 +rapid-mlx serve qwen3.5-9b-4bit --port 8000 ``` If you need to pin the parser manually: ```bash -rapid-mlx serve devstral-24b \ +rapid-mlx serve devstral-24b-4bit \ --enable-auto-tool-choice --tool-call-parser hermes ``` diff --git a/docs/guides/continuous-batching.md b/docs/guides/continuous-batching.md index 3f517432..5ae6b87e 100644 --- a/docs/guides/continuous-batching.md +++ b/docs/guides/continuous-batching.md @@ -7,7 +7,7 @@ for back-compat but is a no-op. ## Default Behaviour ```bash -rapid-mlx serve qwen3.5-4b +rapid-mlx serve qwen3.5-4b-4bit ``` ## With Paged Cache @@ -15,7 +15,7 @@ rapid-mlx serve qwen3.5-4b For memory-efficient prefix sharing: ```bash -rapid-mlx serve qwen3.5-4b --use-paged-cache +rapid-mlx serve qwen3.5-4b-4bit --use-paged-cache ``` ## How It Works @@ -164,7 +164,7 @@ python tests/test_prefix_cache.py ## Production Setup ```bash -rapid-mlx serve qwen3.5-9b \ +rapid-mlx serve qwen3.5-9b-4bit \ --use-paged-cache \ --port 8000 ``` diff --git a/docs/guides/mcp-tools.md b/docs/guides/mcp-tools.md index 68f0f6c9..b906972e 100644 --- a/docs/guides/mcp-tools.md +++ b/docs/guides/mcp-tools.md @@ -51,7 +51,7 @@ Create `mcp.json`: ### 2. Start Server with MCP ```bash -rapid-mlx serve qwen3.5-4b --mcp-config mcp.json +rapid-mlx serve qwen3.5-4b-4bit --mcp-config mcp.json ``` ### 3. Verify MCP Status diff --git a/docs/guides/multimodal.md b/docs/guides/multimodal.md index abeb1cfc..a861db04 100644 --- a/docs/guides/multimodal.md +++ b/docs/guides/multimodal.md @@ -232,7 +232,7 @@ benches live in the dev-only `scripts/` directory (source checkout only). For a quick text-only sanity bench against a VLM, you can still run: ```bash -rapid-mlx bench qwen3-vl-4b +rapid-mlx bench qwen3-vl-4b-4bit ``` ## MLLM Cache @@ -307,7 +307,7 @@ For an interactive multimodal session, start a server and use any OpenAI- compatible web UI (Open WebUI, LibreChat, etc.) pointed at it: ```bash -rapid-mlx serve qwen3-vl-4b --mllm --port 8000 +rapid-mlx serve qwen3-vl-4b-4bit --mllm --port 8000 ``` The shipped `rapid-mlx chat` REPL is text-only. The optional Gradio web UI diff --git a/docs/guides/server.md b/docs/guides/server.md index 77765a12..12bd1e45 100644 --- a/docs/guides/server.md +++ b/docs/guides/server.md @@ -9,7 +9,7 @@ a no-op for back-compat. ### Default ```bash -rapid-mlx serve qwen3.5-4b --port 8000 +rapid-mlx serve qwen3.5-4b-4bit --port 8000 ``` Short aliases (see `rapid-mlx models`) work everywhere a model name is @@ -20,7 +20,7 @@ accepted. Full HuggingFace repo IDs (`mlx-community/...`) work too. Memory-efficient caching for production / shared system prompts: ```bash -rapid-mlx serve qwen3.5-9b --port 8000 --use-paged-cache +rapid-mlx serve qwen3.5-9b-4bit --port 8000 --use-paged-cache ``` ## Server Options diff --git a/docs/reference/cli.md b/docs/reference/cli.md index 47e07540..ca962b83 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -63,37 +63,37 @@ rapid-mlx serve [options] ```bash # Default — continuous batching is on by default; short aliases work -rapid-mlx serve qwen3.5-4b +rapid-mlx serve qwen3.5-4b-4bit # A larger general-purpose model (5 GB) -rapid-mlx serve qwen3.5-9b --port 8000 +rapid-mlx serve qwen3.5-9b-4bit --port 8000 # Paged KV cache (memory-efficient prefix sharing) -rapid-mlx serve qwen3.5-9b --use-paged-cache --port 8000 +rapid-mlx serve qwen3.5-9b-4bit --use-paged-cache --port 8000 # With MCP tools -rapid-mlx serve qwen3.5-9b --mcp-config mcp.json +rapid-mlx serve qwen3.5-9b-4bit --mcp-config mcp.json # Multimodal (vision) model — requires the [vision] extra -rapid-mlx serve gemma-4-26b --mllm +rapid-mlx serve gemma-4-26b-4bit --mllm # Reasoning model — parser is auto-detected, but you can pin it -rapid-mlx serve qwen3.5-9b --reasoning-parser qwen3 +rapid-mlx serve qwen3.5-9b-4bit --reasoning-parser qwen3 # DeepSeek reasoning model -rapid-mlx serve deepseek-r1-8b --reasoning-parser deepseek_r1 +rapid-mlx serve deepseek-r1-8b-4bit --reasoning-parser deepseek_r1 # Tool calling with Mistral/Devstral -rapid-mlx serve devstral-24b --enable-auto-tool-choice --tool-call-parser hermes +rapid-mlx serve devstral-24b-4bit --enable-auto-tool-choice --tool-call-parser hermes # DFlash speculative decoding (single-user, single supported alias) rapid-mlx serve qwen3.5-27b-8bit --enable-dflash --port 8000 # API key authentication -rapid-mlx serve qwen3.5-9b --api-key your-secret-key +rapid-mlx serve qwen3.5-9b-4bit --api-key your-secret-key # Production setup with security options -rapid-mlx serve qwen3.5-9b \ +rapid-mlx serve qwen3.5-9b-4bit \ --api-key your-secret-key \ --rate-limit 60 \ --timeout 120 @@ -137,7 +137,7 @@ rapid-mlx bench [options] | Option | Description | Default | |--------|-------------|---------| -| `` | Model alias (e.g. `qwen3.5-4b`) or HF repo (positional) | *(required)* | +| `` | Model alias (e.g. `qwen3.5-4b-4bit`) or HF repo (positional) | *(required)* | | `--num-prompts` | Number of prompts | 5 | | `--max-tokens` | Max tokens per prompt | 256 | | `--enable-prefix-cache` / `--disable-prefix-cache` | Toggle prefix caching | enabled | @@ -150,7 +150,7 @@ Run `rapid-mlx bench --help` for the full list (memory limits, batch sizes, etc. ```bash # Quick LLM benchmark using a short alias -rapid-mlx bench qwen3.5-4b +rapid-mlx bench qwen3.5-4b-4bit # Bench a vision-language model by full HF repo rapid-mlx bench mlx-community/Qwen3-VL-8B-Instruct-4bit @@ -172,7 +172,7 @@ rapid-mlx chat [model] [options] | Option | Description | Default | |--------|-------------|---------| -| `model` | Model alias or HF repo (positional, optional) | `qwen3.5-4b` | +| `model` | Model alias or HF repo (positional, optional) | `qwen3.5-4b-4bit` | | `--system` | System prompt prepended to the conversation | *(none)* | | `--think` / `--no-think` | Enable / disable reasoning output in the REPL | off | | `--max-tokens` | Max tokens per assistant response | 2048 | @@ -189,18 +189,18 @@ rapid-mlx chat [model] [options] ### Examples ```bash -# Fastest path — defaults to qwen3.5-4b, spawns its own server +# Fastest path — defaults to qwen3.5-4b-4bit, spawns its own server rapid-mlx chat # A reasoning model with thinking surfaced -rapid-mlx chat qwen3.5-9b --think +rapid-mlx chat qwen3.5-9b-4bit --think # Attach to a server you're already running on :8000 -rapid-mlx serve qwen3.5-27b --port 8000 & +rapid-mlx serve qwen3.5-27b-4bit --port 8000 & rapid-mlx chat --port 8000 # Pin a system prompt -rapid-mlx chat qwen3.5-4b --system "You are a terse, friendly Mac shell tutor." +rapid-mlx chat qwen3.5-4b-4bit --system "You are a terse, friendly Mac shell tutor." ``` In-REPL slash commands: `/help`, `/reset` (alias `/clear`), `/model `, diff --git a/harness/README.md b/harness/README.md index acbf04fb..75ebf703 100644 --- a/harness/README.md +++ b/harness/README.md @@ -4,7 +4,7 @@ A four-tier "code health checkup" for Rapid-MLX: ``` rapid-mlx doctor smoke # ~2 min, no model — pre-commit -rapid-mlx doctor check # ~15 min, qwen3.5-35b — pre-PR / big change +rapid-mlx doctor check # ~15 min, qwen3.5-35b-8bit — pre-PR / big change rapid-mlx doctor full # ~2-3 hr, 3 models — pre-release / refactor rapid-mlx doctor benchmark # overnight, all models — periodic / promo material ``` @@ -27,7 +27,7 @@ it needs `tests/`, `harness/`, and `pyproject.toml`): # Pre-commit — no model required make smoke # or: rapid-mlx doctor smoke -# Pre-PR — boots qwen3.5-35b, runs API + perf checks, diffs vs baseline. +# Pre-PR — boots qwen3.5-35b-8bit, runs API + perf checks, diffs vs baseline. # 35B 8-bit is the smallest model we trust to ~never err on the eval # suite, so failures cleanly attribute to rapid-mlx bugs. HF_HUB_CACHE=... make check # or: rapid-mlx doctor check @@ -77,9 +77,9 @@ Designed to be invoked from a pre-commit hook or `make` target. | `cli_sanity` | `rapid-mlx --help / models / agents` actually run | | `pytest` | Full unit suite (~45s, ~2070 tests) excluding `tests/integrations/` and `test_event_loop.py` | -### `check` (~15 min, qwen3.5-35b) +### `check` (~15 min, qwen3.5-35b-8bit) -Spins up a real server with `qwen3.5-35b` (Qwen3.5-35B-A3B-8bit — A3B +Spins up a real server with `qwen3.5-35b-8bit` (Qwen3.5-35B-A3B-8bit — A3B MoE so decode is fast despite the 35B param count), runs API + perf checks, diffs against `harness/baselines/check-qwen3.5-35b.json`. @@ -96,11 +96,11 @@ the old default and made bug triage ambiguous. | `autoresearch` | `scripts/autoresearch_bench.py --json` (13 perf metrics) | | `baseline_diff` | Compare metrics, flag regressions per `harness/thresholds.yaml` | -Override the model with `--model qwen3.6-35b` (will need its own baseline). +Override the model with `--model qwen3.6-35b-4bit` (will need its own baseline). ### `full` (~2-3 hr, 3 models × 11 agent profiles) -Loops the check tier across `qwen3.5-35b` and `qwen3.6-35b` +Loops the check tier across `qwen3.5-35b-8bit` and `qwen3.6-35b-4bit` (real-capacity Qwen lines — both 8-bit, both go through the Hermes parser path that most users hit). For each model, also runs all 11 agent profiles' auto-generated test plans. @@ -115,7 +115,7 @@ agent profiles' auto-generated test plans. Override the model list: ```bash -rapid-mlx doctor full --models qwen3.5-35b,qwen3.6-35b +rapid-mlx doctor full --models qwen3.5-35b-8bit,qwen3.6-35b-4bit ``` ### `benchmark` (overnight, all local models) @@ -128,7 +128,7 @@ scorecard markdown: HF_HUB_CACHE=... rapid-mlx doctor benchmark # Or be explicit (forces inclusion even if cache probe misses): -rapid-mlx doctor benchmark --models qwen3.5-35b,qwen3.6-35b +rapid-mlx doctor benchmark --models qwen3.5-35b-8bit,qwen3.6-35b-4bit ``` Output: @@ -165,7 +165,7 @@ Baseline file shape: { "captured_at": "2026-04-15T21:36:32", "rapid_mlx_version": "0.5.1", - "model": "qwen3.5-35b", + "model": "qwen3.5-35b-8bit", "metrics": { "decode_tps": 49.67, "cold_ttft_ms": 313.63, @@ -198,7 +198,7 @@ git diff harness/baselines/ # 3. If the change is justified, commit; otherwise revert + investigate git commit harness/baselines/check-qwen3.5-35b.json -m \ - "chore(doctor): bump qwen3.5-35b decode_tps baseline (mlx 0.31 SDPA gains)" + "chore(doctor): bump qwen3.5-35b-8bit decode_tps baseline (mlx 0.31 SDPA gains)" ``` ## Thresholds diff --git a/harness/baselines/full-qwen3.5-35b.json b/harness/baselines/full-qwen3.5-35b-8bit.json similarity index 95% rename from harness/baselines/full-qwen3.5-35b.json rename to harness/baselines/full-qwen3.5-35b-8bit.json index 124faf60..0bbab958 100644 --- a/harness/baselines/full-qwen3.5-35b.json +++ b/harness/baselines/full-qwen3.5-35b-8bit.json @@ -1,7 +1,7 @@ { "captured_at": "2026-05-05T06:16:05", "rapid_mlx_version": "0.6.10", - "model": "qwen3.5-35b", + "model": "qwen3.5-35b-8bit", "metrics": { "cold_ttft_ms": 123.91637498512864, "cold_tps": 69.78493344229089, diff --git a/harness/baselines/full-qwen3.6-35b.json b/harness/baselines/full-qwen3.6-35b-4bit.json similarity index 95% rename from harness/baselines/full-qwen3.6-35b.json rename to harness/baselines/full-qwen3.6-35b-4bit.json index 0c4f3e5b..743c3178 100644 --- a/harness/baselines/full-qwen3.6-35b.json +++ b/harness/baselines/full-qwen3.6-35b-4bit.json @@ -1,7 +1,7 @@ { "captured_at": "2026-05-05T06:21:25", "rapid_mlx_version": "0.6.10", - "model": "qwen3.6-35b", + "model": "qwen3.6-35b-4bit", "metrics": { "cold_ttft_ms": 167.38729202188551, "cold_tps": 89.2618750454324, diff --git a/harness/scorecard/latest.md b/harness/scorecard/latest.md index ef51f998..54793d67 100644 --- a/harness/scorecard/latest.md +++ b/harness/scorecard/latest.md @@ -4,42 +4,42 @@ _Generated: 2026-04-16T07:10:38_ | Model | Decode TPS | Cold TTFT | Cached TTFT | Tool % | Score | Status | | --- | ---: | ---: | ---: | ---: | ---: | --- | -| deepseek-r1-32b | 8.6 | 1111ms | 418ms | 0% | 51.8 | OK | -| llama3-3b | 34.9 | 258ms | 189ms | 0% | 130.1 | OK | -| qwen3-vl-8b | 12.2 | 456ms | 505ms | 100% | 59.3 | OK | -| qwen3.5-27b | — | — | — | — | — | FAIL — server boot failed: server exited with code 1 before becoming healthy | -| qwen3.5-35b | 10.9 | 1091ms | 1063ms | 0% | 26.5 | OK | -| qwen3.5-4b | 25.1 | 448ms | 460ms | 100% | 78.5 | OK | -| qwen3.5-9b | 20.7 | 539ms | 563ms | 100% | 68.1 | OK | -| qwopus-27b | 8.8 | 1165ms | 1145ms | 100% | 41.9 | OK | +| deepseek-r1-32b-4bit | 8.6 | 1111ms | 418ms | 0% | 51.8 | OK | +| llama3-3b-4bit | 34.9 | 258ms | 189ms | 0% | 130.1 | OK | +| qwen3-vl-8b-4bit | 12.2 | 456ms | 505ms | 100% | 59.3 | OK | +| qwen3.5-27b-4bit | — | — | — | — | — | FAIL — server boot failed: server exited with code 1 before becoming healthy | +| qwen3.5-35b-8bit | 10.9 | 1091ms | 1063ms | 0% | 26.5 | OK | +| qwen3.5-4b-4bit | 25.1 | 448ms | 460ms | 100% | 78.5 | OK | +| qwen3.5-9b-4bit | 20.7 | 539ms | 563ms | 100% | 68.1 | OK | +| qwopus-27b-4bit | 8.8 | 1165ms | 1145ms | 100% | 41.9 | OK | | qwopus-27b-8bit | — | — | — | — | — | FAIL — server boot failed: server exited with code 1 before becoming healthy | ## Skipped -- **deepseek-r1-8b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **devstral-24b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **devstral-v2-24b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gemma-3n-e4b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gemma-4-26b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gemma-4-31b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gemma3-12b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gemma3-1b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gemma3-27b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **glm4.5-air** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **glm4.7-9b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **gpt-oss-20b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **hermes3-8b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **hermes4-70b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **kimi-48b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **kimi-k2.5** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **minimax-m2.5** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **ministral-3b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **mistral-24b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **phi4-14b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **qwen3-coder** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **qwen3-coder-30b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **qwen3-vl-30b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **qwen3-vl-4b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **qwen3.5-122b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **deepseek-r1-8b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **devstral-24b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **devstral-v2-24b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gemma-3n-e4b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gemma-4-26b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gemma-4-31b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gemma3-12b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gemma3-1b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gemma3-27b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **glm4.5-air-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **glm4.7-9b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **gpt-oss-20b-mxfp4-q8** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **hermes3-8b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **hermes4-70b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **kimi-48b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **kimi-k2.5-3bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **minimax-m2.5-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **ministral-3b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **mistral-24b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **phi-4-14b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **qwen3-coder-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **qwen3-coder-30b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **qwen3-vl-30b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **qwen3-vl-4b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **qwen3.5-122b-mxfp4** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio - **qwen3.5-122b-8bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio -- **qwopus-9b** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio +- **qwopus-9b-4bit** — not found in HF_HUB_CACHE / ~/.cache/huggingface / ~/.lmstudio diff --git a/install.sh b/install.sh index 9a0a6f7c..dd43768c 100755 --- a/install.sh +++ b/install.sh @@ -85,10 +85,10 @@ fi # ── 2. Detect RAM → recommend model ────────────────────────────────────────── RAM_GB=$(sysctl -n hw.memsize 2>/dev/null | awk '{printf "%d", $1/1073741824}') -if [ "$RAM_GB" -ge 96 ]; then RECOMMENDED_MODEL="qwen3.5-122b"; RAM_TIER="96+ GB" -elif [ "$RAM_GB" -ge 48 ]; then RECOMMENDED_MODEL="qwen3.5-35b"; RAM_TIER="48-95 GB" -elif [ "$RAM_GB" -ge 24 ]; then RECOMMENDED_MODEL="qwen3.5-9b"; RAM_TIER="24-47 GB" -else RECOMMENDED_MODEL="qwen3.5-4b"; RAM_TIER="8-23 GB" +if [ "$RAM_GB" -ge 96 ]; then RECOMMENDED_MODEL="qwen3.5-122b-mxfp4"; RAM_TIER="96+ GB" +elif [ "$RAM_GB" -ge 48 ]; then RECOMMENDED_MODEL="qwen3.5-35b-8bit"; RAM_TIER="48-95 GB" +elif [ "$RAM_GB" -ge 24 ]; then RECOMMENDED_MODEL="qwen3.5-9b-4bit"; RAM_TIER="24-47 GB" +else RECOMMENDED_MODEL="qwen3.5-4b-4bit"; RAM_TIER="8-23 GB" fi dim "macOS $(sw_vers -productVersion) · Apple Silicon · ${RAM_GB} GB RAM" diff --git a/pyproject.toml b/pyproject.toml index 85c75ce7..64e12c99 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -80,7 +80,7 @@ dependencies = [ # Required for any model with vision input. Text-only models work without this. # 0.5.0+ also unlocks DFlash speculative decoding (see [dflash] extras). vision = [ - "mlx-vlm>=0.6.1", # 0.6.1: gemma4_unified architecture (gemma-4-12b/12b-8bit aliases require it). 0.5.0 baseline: gemma4 multi-image + tool-parser fixes, TurboQuant race-condition fix, continuous-batching guard. Also unlocks DFlash spec-decode hooks (see [dflash] extras). + "mlx-vlm>=0.6.1", # 0.6.1: gemma4_unified architecture (gemma-4-12b-4bit/12b-8bit aliases require it). 0.5.0 baseline: gemma4 multi-image + tool-parser fixes, TurboQuant race-condition fix, continuous-batching guard. Also unlocks DFlash spec-decode hooks (see [dflash] extras). "opencv-python>=4.8.0", "torch>=2.3.0", "torchvision>=0.18.0", diff --git a/scripts/bench_all_models.sh b/scripts/bench_all_models.sh index b357ca45..2dadfa27 100755 --- a/scripts/bench_all_models.sh +++ b/scripts/bench_all_models.sh @@ -16,8 +16,8 @@ mkdir -p "$RESULTS_DIR" MODELS=( "phi4-mini-14b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Phi-4-mini-reasoning-MLX-4bit|hermes|" "mistral-small-24b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX-4bit|hermes|" - "gemma3-12b|/Users/raullenstudio/.lmstudio/models/mlx-community/gemma-3-12b-it-qat-4bit|hermes|" - "gpt-oss-20b|/Users/raullenstudio/.lmstudio/models/mlx-community/gpt-oss-20b-MXFP4-Q8|seed_oss|" + "gemma3-12b-4bit|/Users/raullenstudio/.lmstudio/models/mlx-community/gemma-3-12b-it-qat-4bit|hermes|" + "gpt-oss-20b-mxfp4-q8|/Users/raullenstudio/.lmstudio/models/mlx-community/gpt-oss-20b-MXFP4-Q8|seed_oss|" "glm47-9b|/Users/raullenstudio/.lmstudio/models/mlx-community/GLM-4.7-4bit|glm47|" "qwen35-122b-a10b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Qwen3.5-122B-A10B-Text-mxfp4-mlx|hermes|qwen3" "qwen3-coder-next-80b|/Users/raullenstudio/.lmstudio/models/lmstudio-community/Qwen3-Coder-Next-MLX-6bit|hermes|" diff --git a/scripts/bench_engine_parity.py b/scripts/bench_engine_parity.py index 4258bf21..f2d6404e 100644 --- a/scripts/bench_engine_parity.py +++ b/scripts/bench_engine_parity.py @@ -8,10 +8,10 @@ Usage: # Start server with SimpleEngine (default) - rapid-mlx serve qwen3.5-4b --port 8000 + rapid-mlx serve qwen3.5-4b-4bit --port 8000 # Start server with BatchedEngine - rapid-mlx serve qwen3.5-4b --port 8001 --continuous-batching + rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching # Run benchmark python3 scripts/bench_engine_parity.py @@ -289,8 +289,8 @@ def main(): except Exception as e: print(f" {name} ({url}): NOT AVAILABLE — {e}") print(f"\nPlease start both servers:") - print(f" Terminal 1: rapid-mlx serve qwen3.5-4b --port 8000") - print(f" Terminal 2: rapid-mlx serve qwen3.5-4b --port 8001 --continuous-batching") + print(f" Terminal 1: rapid-mlx serve qwen3.5-4b-4bit --port 8000") + print(f" Terminal 2: rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching") sys.exit(1) model_simple = detect_model(SIMPLE_URL) diff --git a/scripts/bench_readme_refresh.py b/scripts/bench_readme_refresh.py index dcbf942d..50a2fb32 100644 --- a/scripts/bench_readme_refresh.py +++ b/scripts/bench_readme_refresh.py @@ -22,7 +22,7 @@ Usage: python3.12 scripts/bench_readme_refresh.py # full sweep - python3.12 scripts/bench_readme_refresh.py --models qwen3.5-4b # one model + python3.12 scripts/bench_readme_refresh.py --models qwen3.5-4b-4bit # one model python3.12 scripts/bench_readme_refresh.py --engines rapid-mlx,mlx-lm """ @@ -84,40 +84,40 @@ class ModelSpec: MODELS: list[ModelSpec] = [ ModelSpec( - "qwen3.5-4b", + "qwen3.5-4b-4bit", "mlx-community/Qwen3.5-4B-MLX-4bit", "qwen3:4b", "Ollama Qwen3 (not Qwen3.5; DeltaNet arch unavailable on llama.cpp)", ), ModelSpec( - "qwen3.5-9b", + "qwen3.5-9b-4bit", "mlx-community/Qwen3.5-9B-4bit", "qwen3:8b", "Ollama Qwen3 8B (not Qwen3.5 9B; closest available)", ), ModelSpec( - "qwen3.5-27b", + "qwen3.5-27b-4bit", "mlx-community/Qwen3.5-27B-4bit", "qwen3:32b", "Ollama Qwen3 32B Q4_K_M (closest dense 27-32B; Qwen3.5 DeltaNet not on llama.cpp; Unsloth Qwen3.6-27B GGUF fails to load in Ollama 0.24)", ), ModelSpec( - "gemma-4-12b", + "gemma-4-12b-4bit", "mlx-community/gemma-4-12B-it-4bit", "gemma3:12b", "Ollama Gemma 3 12B (Gemma 4 not yet on llama.cpp)", ), ModelSpec( - "gpt-oss-20b", "mlx-community/gpt-oss-20b-MXFP4-Q8", "gpt-oss:20b", "Same arch" + "gpt-oss-20b-mxfp4-q8", "mlx-community/gpt-oss-20b-MXFP4-Q8", "gpt-oss:20b", "Same arch" ), ModelSpec( - "qwen3.6-35b", + "qwen3.6-35b-4bit", "mlx-community/Qwen3.6-35B-A3B-4bit", "qwen3:30b-a3b", "Ollama Qwen3 30B-A3B (not Qwen3.6; closest MoE A3B)", ), ModelSpec( - "qwen3.5-35b", + "qwen3.5-35b-8bit", "mlx-community/Qwen3.5-35B-A3B-8bit", "qwen3:30b-a3b", "Ollama Qwen3 30B-A3B 4bit (not Qwen3.5-35B 8bit; closest MoE)", diff --git a/scripts/bench_vs_ollama.py b/scripts/bench_vs_ollama.py index 78696146..a717fa5b 100644 --- a/scripts/bench_vs_ollama.py +++ b/scripts/bench_vs_ollama.py @@ -8,7 +8,7 @@ Manual usage: python scripts/bench_vs_ollama.py - python scripts/bench_vs_ollama.py --model-pair qwen3.5-4b=qwen3.5:4b --runs 1 + python scripts/bench_vs_ollama.py --model-pair qwen3.5-4b-4bit=qwen3.5:4b --runs 1 python scripts/bench_vs_ollama.py --no-pull --no-download --runs 1 """ @@ -93,8 +93,8 @@ class ParsedStream: def default_model_pairs() -> list[ModelPair]: return [ - ModelPair("qwen3.5-4b", "qwen3.5:4b"), - ModelPair("qwen3.5-9b", "qwen3.5:9b"), + ModelPair("qwen3.5-4b-4bit", "qwen3.5:4b"), + ModelPair("qwen3.5-9b-4bit", "qwen3.5:9b"), ] diff --git a/scripts/local_bench_vs_ollama.py b/scripts/local_bench_vs_ollama.py index c1483125..e8dd23cf 100644 --- a/scripts/local_bench_vs_ollama.py +++ b/scripts/local_bench_vs_ollama.py @@ -96,7 +96,7 @@ def ollama_model_name(model: str) -> str: if ":" in model: return model known = { - "qwen3.5-4b": "qwen3:4b", + "qwen3.5-4b-4bit": "qwen3:4b", "qwen3.5-8b": "qwen3:8b", "qwen3.5-14b": "qwen3:14b", "qwen3.5-32b": "qwen3:32b", @@ -106,7 +106,7 @@ def ollama_model_name(model: str) -> str: "phi4-4b": "phi4:4b", "phi4-mini": "phi4-mini:latest", "gemma3-4b": "gemma3:4b", - "gemma3-12b": "gemma3:12b", + "gemma3-12b-4bit": "gemma3:12b", } if model in known: return known[model] @@ -683,7 +683,7 @@ def summary_row(metric: str, ratio: float, desc: str) -> None: # ── Main ────────────────────────────────────────────────────────────────────── def main() -> int: parser = argparse.ArgumentParser(description="Benchmark Rapid-MLX vs Ollama") - parser.add_argument("--model", default="qwen3.5-4b", help="Rapid-MLX model name") + parser.add_argument("--model", default="qwen3.5-4b-4bit", help="Rapid-MLX model name") parser.add_argument( "--ollama-model", default=None, diff --git a/scripts/mhi_batch.sh b/scripts/mhi_batch.sh index c3c9d6cc..cb4219dd 100644 --- a/scripts/mhi_batch.sh +++ b/scripts/mhi_batch.sh @@ -9,8 +9,8 @@ cd "$PROJECT_DIR" # Model definitions: name|path|tool_parser MODELS=( - "qwopus-27b|/Users/raullenstudio/.cache/huggingface/hub/models--Jackrong--MLX-Qwopus3.5-27B-v3-4bit/snapshots/d399209470abffa6b45678c53a910f869b18b2f2|hermes" - "deepseek-r1-32b|/Users/raullenstudio/.cache/huggingface/hub/models--mlx-community--DeepSeek-R1-Distill-Qwen-32B-4bit/snapshots/4e0d3848a0ad8f9fb54638891e4928f04fcca978|hermes" + "qwopus-27b-4bit|/Users/raullenstudio/.cache/huggingface/hub/models--Jackrong--MLX-Qwopus3.5-27B-v3-4bit/snapshots/d399209470abffa6b45678c53a910f869b18b2f2|hermes" + "deepseek-r1-32b-4bit|/Users/raullenstudio/.cache/huggingface/hub/models--mlx-community--DeepSeek-R1-Distill-Qwen-32B-4bit/snapshots/4e0d3848a0ad8f9fb54638891e4928f04fcca978|hermes" "llama-70b|/Volumes/Extreme SSD/Models/Llama-3.3-70B-Instruct-4bit|llama" ) diff --git a/scripts/mhi_eval.py b/scripts/mhi_eval.py index 7e55460b..77f7843f 100644 --- a/scripts/mhi_eval.py +++ b/scripts/mhi_eval.py @@ -17,7 +17,7 @@ python3 scripts/mhi_eval.py --base-url http://localhost:8000/v1 --suite tau # Custom model name - python3 scripts/mhi_eval.py --base-url http://localhost:8000/v1 --model qwopus-27b + python3 scripts/mhi_eval.py --base-url http://localhost:8000/v1 --model qwopus-27b-4bit """ import argparse diff --git a/scripts/pr_validate/golden_models.yaml b/scripts/pr_validate/golden_models.yaml index 033e3aaf..cd5d1afa 100644 --- a/scripts/pr_validate/golden_models.yaml +++ b/scripts/pr_validate/golden_models.yaml @@ -127,7 +127,7 @@ families: # absent from this matrix when Bug A shipped, so no PR could have # caught the regression at the integration layer. Gemma 4 remains # excluded (see note below — model-side instruction-following - # weaknesses produce agent-test flake), but gpt-oss-20b + # weaknesses produce agent-test flake), but gpt-oss-20b-mxfp4-q8 # (Harmony format) does NOT have those issues and gives us live # OutputRouter coverage for the second allowlist family. # @@ -182,8 +182,8 @@ families: # single fail (pydantic_ai single tool call) is a model-quality # edge case (off-topic Java/Spring response). Not flaky enough # to block merges, but adding here would gate every PR on it. -# Both families remain available via ``rapid-mlx serve smollm3-3b`` -# and ``rapid-mlx serve granite4-tiny`` (auto-detected parsers). +# Both families remain available via ``rapid-mlx serve smollm3-3b-4bit`` +# and ``rapid-mlx serve granite4-tiny-4bit`` (auto-detected parsers). # Per-model overrides: extra CLI args (e.g. parser overrides for # tool-calling tests). Keep small — most models don't need this. @@ -211,7 +211,7 @@ overrides: args: ["--enable-auto-tool-choice", "--tool-call-parser", "hermes"] # Harmony parser handles both <|channel|>commentary tool-call blocks and # the final-channel content for gpt-oss-20b. The auto-detect registry - # picks it via gpt-oss-20b alias profile, but the explicit override + # picks it via gpt-oss-20b-mxfp4-q8 alias profile, but the explicit override # documents intent at the validation surface (mirrors the qwen3.5/3.6 # convention above). "mlx-community/gpt-oss-20b-MXFP4-Q8": diff --git a/scripts/rename_aliases.py b/scripts/rename_aliases.py new file mode 100644 index 00000000..a33fa8a5 --- /dev/null +++ b/scripts/rename_aliases.py @@ -0,0 +1,167 @@ +#!/usr/bin/env python3.12 +"""One-shot rename of every alias in vllm_mlx/aliases.json to the canonical +explicit form ``-----``. + +Drops three legacy short-form codename aliases that violate the spec +(``deepseek-v4-flash``, ``gemma4``, ``nemotron-nano``) and fixes the +``phi4-14b`` schema bug where the alias name claimed 14B but the hf_path +pointed at phi-4-mini (~4B). + +Also dumps ``rename_map.json`` so the repo-wide reference sweep can +mechanically rewrite occurrences in tests, docs, scripts. + +Run from repo root: + + python3.12 scripts/rename_aliases.py +""" + +import json +import re +from collections import OrderedDict +from pathlib import Path + +ROOT = Path(__file__).resolve().parents[1] +ALIASES_PATH = ROOT / "vllm_mlx" / "aliases.json" +RENAME_MAP_PATH = ROOT / "scripts" / "rename_map.json" + + +def detect_quant(hf: str) -> str | None: + """Inspect an hf_path and return the canonical quant suffix. + + Order matters: longer / more specific markers first so we don't + misclassify ``mxfp4-q8`` as ``mxfp4`` or ``-4bit-DWQ`` as ``-4bit``. + """ + h = hf.lower() + if "mxfp4-q8" in h or "mxfp4_q8" in h: + return "mxfp4-q8" + if "mxfp4" in h: + return "mxfp4" + if "dwq" in h: + return "dwq" + if "ud-mlx" in h or "-ud-" in h: + return "ud" + if m := re.search(r"-(\d+)bit", h): + return f"{m.group(1)}bit" + if "unpacked" in h: + return "unpacked" + if "bf16" in h: + return "bf16" + if "fp16" in h: + return "fp16" + return None + + +# 1. Aliases whose hf_path is unambiguous and only need the quant suffix added. +# These are determined automatically by detect_quant. + +# 2. Manual overrides — aliases that need name changes beyond just adding a suffix, +# or that need their hf_path corrected. ``drop_to`` is the equivalent +# explicit alias the repo-wide sweep should rewrite references to. +MANUAL: dict[str, dict[str, str] | None] = { + # Codename aliases — duplicate hf_path of an explicit entry. Drop entirely, + # but tell the sweep where to redirect references. + "deepseek-v4-flash": {"drop": True, "redirect_to": "deepseek-v4-flash-8bit"}, + "gemma4": {"drop": True, "redirect_to": "gemma-4-12b-qat-4bit"}, + "nemotron-nano": {"drop": True, "redirect_to": "nemotron-30b-4bit"}, + # phi4-14b: schema bug. hf_path was pointing at phi-4-mini (~4B). + # Fix: rename to phi-4-14b-4bit AND swap hf_path to the real Phi-4 14B. + # The 4B mini variant moves to its own new alias (added below). + "phi4-14b": { + "new_name": "phi-4-14b-4bit", + "new_hf_path": "mlx-community/phi-4-4bit", + }, +} + +# 3. Brand-new aliases to add at the end (preserves old phi-4-mini coverage). +NEW_ALIASES: list[tuple[str, dict]] = [ + ( + "phi-4-mini-4bit", + { + "hf_path": "mlx-community/phi-4-mini-instruct-4bit", + "tool_call_parser": "hermes", + "reasoning_parser": None, + "is_hybrid": False, + "is_moe": False, + "supports_spec_decode": True, + "suffix_decoding_tier": "unknown", + }, + ), +] + + +def compute_new_name(old: str, hf: str) -> str: + quant = detect_quant(hf) + if quant is None: + raise ValueError(f"alias {old!r}: cannot detect quant from {hf!r}") + # If the old name already ends in a known quant suffix, replace it. + stem = re.sub( + r"-(2bit|3bit|4bit|6bit|8bit|mxfp4-q8|mxfp4|dwq|ud|unpacked|bf16|fp16)$", + "", + old, + ) + return f"{stem}-{quant}" + + +def main() -> None: + with open(ALIASES_PATH) as fp: + data = json.load(fp, object_pairs_hook=OrderedDict) + + new_data: OrderedDict[str, object] = OrderedDict() + rename_map: dict[str, str | None] = {} + + for old, profile in data.items(): + # Handle manual overrides first. + if old in MANUAL: + spec = MANUAL[old] + if spec is None: + # Drop entirely (legacy form — kept for the type checker). + rename_map[old] = None + continue + if spec.get("drop"): + # Codename alias — drop from aliases.json but tell the sweep + # where to point references. + rename_map[old] = spec["redirect_to"] + continue + new_name = spec["new_name"] + if "new_hf_path" in spec: + profile = OrderedDict(profile) + profile["hf_path"] = spec["new_hf_path"] + new_data[new_name] = profile + rename_map[old] = new_name + continue + + # Default path: keep the entry, rewrite the key. + hf = profile["hf_path"] if isinstance(profile, dict) else profile + new_name = compute_new_name(old, hf) + if new_name in new_data: + raise ValueError(f"rename collision: {old!r} -> {new_name!r} already used") + new_data[new_name] = profile + rename_map[old] = new_name + + # Append brand-new aliases (skip if already present so the script is + # idempotent — useful when iterating on the rename rules). + for name, profile in NEW_ALIASES: + if name not in new_data: + new_data[name] = profile + + # Write back. + with open(ALIASES_PATH, "w") as fp: + json.dump(new_data, fp, indent=2) + fp.write("\n") + + with open(RENAME_MAP_PATH, "w") as fp: + json.dump(rename_map, fp, indent=2, sort_keys=True) + fp.write("\n") + + renamed = sum(1 for o, n in rename_map.items() if n and o != n) + dropped = sum(1 for n in rename_map.values() if n is None) + kept = sum(1 for o, n in rename_map.items() if n and o == n) + print(f" renamed: {renamed}") + print(f" dropped: {dropped}") + print(f" kept (already explicit): {kept}") + print(f" new aliases added: {len(NEW_ALIASES)}") + print(f" total: {len(new_data)}") + + +if __name__ == "__main__": + main() diff --git a/scripts/rename_map.json b/scripts/rename_map.json new file mode 100644 index 00000000..810983c0 --- /dev/null +++ b/scripts/rename_map.json @@ -0,0 +1,76 @@ +{ + "bonsai-1.7b": "bonsai-1.7b-unpacked", + "bonsai-4b": "bonsai-4b-unpacked", + "bonsai-8b": "bonsai-8b-unpacked", + "deepseek-r1-32b": "deepseek-r1-32b-4bit", + "deepseek-r1-8b": "deepseek-r1-8b-4bit", + "deepseek-v4-flash": "deepseek-v4-flash-8bit", + "deepseek-v4-flash-2bit": "deepseek-v4-flash-2bit", + "deepseek-v4-flash-4bit": "deepseek-v4-flash-4bit", + "deepseek-v4-flash-8bit": "deepseek-v4-flash-8bit", + "devstral-24b": "devstral-24b-4bit", + "devstral-v2-24b": "devstral-v2-24b-4bit", + "gemma-3n-e4b": "gemma-3n-e4b-4bit", + "gemma-4-12b": "gemma-4-12b-4bit", + "gemma-4-12b-8bit": "gemma-4-12b-8bit", + "gemma-4-12b-qat": "gemma-4-12b-qat-4bit", + "gemma-4-12b-qat-8bit": "gemma-4-12b-qat-8bit", + "gemma-4-26b": "gemma-4-26b-4bit", + "gemma-4-26b-qat": "gemma-4-26b-qat-4bit", + "gemma-4-31b": "gemma-4-31b-4bit", + "gemma-4-31b-8bit": "gemma-4-31b-8bit", + "gemma-4-31b-qat": "gemma-4-31b-qat-4bit", + "gemma-4-31b-qat-8bit": "gemma-4-31b-qat-8bit", + "gemma3-12b": "gemma3-12b-4bit", + "gemma3-1b": "gemma3-1b-4bit", + "gemma3-27b": "gemma3-27b-4bit", + "gemma4": "gemma-4-12b-qat-4bit", + "glm4.5-air": "glm4.5-air-4bit", + "glm4.7-9b": "glm4.7-9b-4bit", + "gpt-oss-20b": "gpt-oss-20b-mxfp4-q8", + "granite4-tiny": "granite4-tiny-4bit", + "hermes3-8b": "hermes3-8b-4bit", + "hermes4-70b": "hermes4-70b-4bit", + "kimi-48b": "kimi-48b-4bit", + "kimi-k2.5": "kimi-k2.5-3bit", + "llama-3.1-8b-8bit": "llama-3.1-8b-8bit", + "llama3-1b": "llama3-1b-4bit", + "llama3-3b": "llama3-3b-4bit", + "minimax-m2.5": "minimax-m2.5-4bit", + "minimax-m2.7": "minimax-m2.7-mxfp4", + "ministral-3b": "ministral-3b-4bit", + "mistral-24b": "mistral-24b-4bit", + "nemotron-30b": "nemotron-30b-4bit", + "nemotron-nano": "nemotron-30b-4bit", + "phi4-14b": "phi-4-14b-4bit", + "qwen3-0.6b-8bit": "qwen3-0.6b-8bit", + "qwen3-4b-8bit": "qwen3-4b-8bit", + "qwen3-8b-8bit": "qwen3-8b-8bit", + "qwen3-coder": "qwen3-coder-4bit", + "qwen3-coder-30b": "qwen3-coder-30b-4bit", + "qwen3-vl-30b": "qwen3-vl-30b-4bit", + "qwen3-vl-4b": "qwen3-vl-4b-4bit", + "qwen3-vl-8b": "qwen3-vl-8b-4bit", + "qwen3.5-122b": "qwen3.5-122b-mxfp4", + "qwen3.5-122b-8bit": "qwen3.5-122b-8bit", + "qwen3.5-27b": "qwen3.5-27b-4bit", + "qwen3.5-27b-8bit": "qwen3.5-27b-8bit", + "qwen3.5-35b": "qwen3.5-35b-8bit", + "qwen3.5-35b-4bit": "qwen3.5-35b-4bit", + "qwen3.5-4b": "qwen3.5-4b-4bit", + "qwen3.5-4b-8bit": "qwen3.5-4b-8bit", + "qwen3.5-9b": "qwen3.5-9b-4bit", + "qwen3.5-9b-8bit": "qwen3.5-9b-8bit", + "qwen3.6-27b": "qwen3.6-27b-4bit", + "qwen3.6-27b-8bit": "qwen3.6-27b-8bit", + "qwen3.6-27b-ud": "qwen3.6-27b-ud", + "qwen3.6-35b": "qwen3.6-35b-4bit", + "qwen3.6-35b-6bit": "qwen3.6-35b-6bit", + "qwen3.6-35b-8bit": "qwen3.6-35b-8bit", + "qwen3.6-35b-dwq": "qwen3.6-35b-dwq", + "qwen3.6-35b-ud": "qwen3.6-35b-ud", + "qwopus-27b": "qwopus-27b-4bit", + "qwopus-27b-8bit": "qwopus-27b-8bit", + "qwopus-9b": "qwopus-9b-4bit", + "smollm3-3b": "smollm3-3b-4bit" +} diff --git a/scripts/run_dogfood_mvp.sh b/scripts/run_dogfood_mvp.sh index da638b58..24ab50d4 100755 --- a/scripts/run_dogfood_mvp.sh +++ b/scripts/run_dogfood_mvp.sh @@ -11,7 +11,7 @@ # scripts/run_dogfood_mvp.sh status # # Env vars: -# MODEL alias to serve (default: qwen3.5-35b) +# MODEL alias to serve (default: qwen3.5-35b-8bit) # PORT local port (default: 8765) # API_KEY bearer token (default: random 24 hex bytes) # RAPID_MLX_CMD serve command (default: auto — editable `python3.12 -m @@ -114,7 +114,7 @@ fi # Pick the serve command. Prefer the editable repo CLI when we're inside # vllm-mlx — the brew-installed `rapid-mlx` ships an older aliases.json -# and won't see recent additions like `minimax-m2.7`. +# and won't see recent additions like `minimax-m2.7-mxfp4`. if [ -z "${RAPID_MLX_CMD:-}" ]; then if [ -f "$(git rev-parse --show-toplevel 2>/dev/null)/vllm_mlx/cli.py" ] \ && python3.12 -c "import vllm_mlx" >/dev/null 2>&1; then diff --git a/scripts/sweep_alias_refs.py b/scripts/sweep_alias_refs.py new file mode 100644 index 00000000..918f2ca7 --- /dev/null +++ b/scripts/sweep_alias_refs.py @@ -0,0 +1,215 @@ +#!/usr/bin/env python3.12 +"""Repo-wide rewrite of legacy alias names to canonical explicit names. + +Loads ``scripts/rename_map.json`` (generated by ``rename_aliases.py``) and +applies the mapping inside every active source file. + +EXCLUDED on purpose: + * ``evals/results/*.json`` — historical benchmark snapshots; their + ``model`` field records *which alias the bench was run under at the + time*. Rewriting these is rewriting history. + * ``CHANGELOG.md`` (if present) — historical release notes; same logic. + * ``.git/``, ``.build/``, ``.venv/``, ``__pycache__/``, ``node_modules/`` + * ``vllm_mlx/aliases.json`` — already correctly written by + ``rename_aliases.py``. + * ``scripts/rename_aliases.py``, ``scripts/rename_map.json``, + ``scripts/sweep_alias_refs.py`` — these files document the rename + itself; legacy names must appear in them by construction. + +Run from repo root: + + python3.12 scripts/sweep_alias_refs.py [--dry-run] +""" + +import argparse +import json +import re +import sys +from pathlib import Path + +ROOT = Path(__file__).resolve().parents[1] +RENAME_MAP_PATH = ROOT / "scripts" / "rename_map.json" + +# File extensions we sweep. +EXTENSIONS = { + ".py", + ".md", + ".sh", + ".yaml", + ".yml", + ".toml", + ".cfg", + ".rst", + ".mdx", + ".json", + ".txt", +} + +# Path prefixes to ignore entirely. +SKIP_DIR_PARTS = { + ".git", + ".build", + ".venv", + "venv", + "__pycache__", + "node_modules", + ".pytest_cache", + ".mypy_cache", + ".ruff_cache", + "dist", + "build", + ".tox", + ".claude", +} + +# Specific files to leave alone — see module docstring. +SKIP_FILES = { + ROOT / "vllm_mlx" / "aliases.json", + ROOT / "scripts" / "rename_aliases.py", + ROOT / "scripts" / "rename_map.json", + ROOT / "scripts" / "sweep_alias_refs.py", + ROOT / "CHANGELOG.md", +} + +# Specific subtrees to leave alone. +SKIP_TREES = ( + ROOT / "evals" / "results", + # Historical doctor harness runs — directory names are + # ``YYYY-MM-DD--`` and the contents are point-in-time + # snapshots of what each alias produced. Rewriting them is + # rewriting history. The ``harness/scorecard/latest.md`` pointer + # OUTSIDE this tree is intentionally not skipped — it's the live + # dashboard and will be regenerated on the next ``make check``. + ROOT / "harness" / "runs", + # Historical README-refresh benchmark snapshots — captured to + # justify a release-time README edit; rewriting changes what we + # claimed at the time. + ROOT / "reports" / "benchmarks", + # Historical Model Harness Index reports — filenames have a + # timestamp suffix (``alias_YYYYMMDD_HHMMSS.json``); rewriting + # the ``model`` field inside changes which alias the historical + # MHI score was attributed to. + ROOT / "reports" / "mhi", +) + + +def should_skip(path: Path) -> bool: + if path.is_dir(): + return any(part in SKIP_DIR_PARTS for part in path.parts) + if path in SKIP_FILES: + return True + if any(part in SKIP_DIR_PARTS for part in path.parts): + return True + for tree in SKIP_TREES: + try: + path.relative_to(tree) + return True + except ValueError: + pass + if path.suffix not in EXTENSIONS: + return True + return False + + +def build_pattern(rename_map: dict[str, str]) -> re.Pattern: + """Build one big regex matching any legacy alias on a word boundary. + + Sort by length descending so longer names match before any prefix + of theirs (e.g. ``qwen3.5-122b`` must match before ``qwen3.5-1``). + """ + keys = sorted(rename_map.keys(), key=len, reverse=True) + # Custom boundary: the names contain ``-`` and ``.`` so the stdlib + # ``\b`` doesn't apply cleanly. Match negative-lookahead/lookbehind + # against ``[A-Za-z0-9._-]`` — i.e. don't extend on either side. + parts = [re.escape(k) for k in keys] + return re.compile( + r"(? tuple[bool, int]: + try: + original = path.read_text(encoding="utf-8") + except (UnicodeDecodeError, PermissionError): + return False, 0 + count = 0 + + def sub(m: re.Match) -> str: + nonlocal count + count += 1 + return rename_map[m.group(1)] + + new = pattern.sub(sub, original) + if new != original: + return True, count, new + return False, 0, original + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument( + "--dry-run", + action="store_true", + help="report files that would change, don't write", + ) + args = parser.parse_args() + + rename_map = json.loads(RENAME_MAP_PATH.read_text()) + # Filter to entries with a real target (drops + renames). + rename_map = {o: n for o, n in rename_map.items() if n} + # High-ambiguity legacy aliases that collide with non-alias usage + # elsewhere in the codebase. ``gemma4`` is the parser ID + # (registered in ``gemma4_tool_parser.py``, referenced from + # ``model_auto_config.py``, ``output_router.py``, etc.) — auto- + # rewriting it would corrupt the parser registry. These are handled + # by a hand-written pass below. + HAND_HANDLED = {"gemma4"} + rename_map = {o: n for o, n in rename_map.items() if o not in HAND_HANDLED} + + pattern = build_pattern(rename_map) + + changed_files = [] + total_replacements = 0 + + for path in ROOT.rglob("*"): + if should_skip(path): + continue + if not path.is_file(): + continue + try: + original = path.read_text(encoding="utf-8") + except (UnicodeDecodeError, PermissionError): + continue + count = 0 + + def sub(m: re.Match) -> str: + nonlocal count + count += 1 + return rename_map[m.group(1)] + + new = pattern.sub(sub, original) + if new != original: + changed_files.append((path, count)) + total_replacements += count + if not args.dry_run: + path.write_text(new, encoding="utf-8") + + rel_root = ROOT + print(f" files changed: {len(changed_files)}") + print(f" total replacements: {total_replacements}") + if args.dry_run: + print(" (dry-run — no files written)") + print() + for path, count in sorted(changed_files, key=lambda x: -x[1])[:20]: + rel = path.relative_to(rel_root) + print(f" {count:4d} {rel}") + if len(changed_files) > 20: + print(f" ... +{len(changed_files) - 20} more") + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tests/fixtures/generation_configs/devstral-24b.json b/tests/fixtures/generation_configs/devstral-24b-4bit.json similarity index 100% rename from tests/fixtures/generation_configs/devstral-24b.json rename to tests/fixtures/generation_configs/devstral-24b-4bit.json diff --git a/tests/fixtures/generation_configs/gemma-3n-e4b.json b/tests/fixtures/generation_configs/gemma-3n-e4b-4bit.json similarity index 100% rename from tests/fixtures/generation_configs/gemma-3n-e4b.json rename to tests/fixtures/generation_configs/gemma-3n-e4b-4bit.json diff --git a/tests/fixtures/generation_configs/gemma3-27b.json b/tests/fixtures/generation_configs/gemma3-27b-4bit.json similarity index 100% rename from tests/fixtures/generation_configs/gemma3-27b.json rename to tests/fixtures/generation_configs/gemma3-27b-4bit.json diff --git a/tests/fixtures/generation_configs/glm4.5-air.json b/tests/fixtures/generation_configs/glm4.5-air-4bit.json similarity index 100% rename from tests/fixtures/generation_configs/glm4.5-air.json rename to tests/fixtures/generation_configs/glm4.5-air-4bit.json diff --git a/tests/fixtures/generation_configs/glm4.7-9b.json b/tests/fixtures/generation_configs/glm4.7-9b-4bit.json similarity index 100% rename from tests/fixtures/generation_configs/glm4.7-9b.json rename to tests/fixtures/generation_configs/glm4.7-9b-4bit.json diff --git a/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py b/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py index 6f4d902e..9b18c71e 100644 --- a/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py +++ b/tests/parsers/regressions/test_issue_444_harmony_tool_call_leak.py @@ -9,7 +9,7 @@ raw body ``...to=functions.get_weather...{"city": "Tokyo"}...`` leaks into ``delta.content``. - * #480 (2026-05-28, gpt-oss-20b, ``tool_choice="auto"``): raw body + * #480 (2026-05-28, gpt-oss-20b-mxfp4-q8, ``tool_choice="auto"``): raw body ``commentary to=functions.get_weather json{"city":"Paris"}`` leaks into ``delta.content``. User-facing duplicate of #444 with a different prompt — covered here to ensure the fix applies @@ -31,7 +31,7 @@ Test cases sourced verbatim from the issue body's repro section. -Scope caveat (post-live-verification on gpt-oss-20b 2026-06-04): +Scope caveat (post-live-verification on gpt-oss-20b-mxfp4-q8 2026-06-04): this file exercises the ``HarmonyToolParser`` streaming entry point in isolation against the FULL markered text (`<|channel|>commentary to=functions.X<|message|>{body}<|call|>`). The parser-level fix @@ -85,7 +85,7 @@ class _Case: # Verbatim from the repro in issue #444 (and adjacent harmony commentary -# formats observed on gpt-oss-20b). Each case is the FULL model output +# formats observed on gpt-oss-20b-mxfp4-q8). Each case is the FULL model output # from the ``<|channel|>commentary`` token through the closing ``<|call|>``, # i.e. what the model emits when invoking exactly one tool. TEST_CASES: list[_Case] = [ diff --git a/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py b/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py index b8734751..212bdfde 100644 --- a/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py +++ b/tests/parsers/regressions/test_issue_448_hermes_function_prefix_leak.py @@ -2,7 +2,7 @@ """Regression guard for #448 — hermes streaming leaks `{...}`` wire format. The hermes non-stream regex handles both ``...`` and bare `` str: "Issue #468 (router-level portion) — compound analysis + " "commentary sequence leaks the commentary block as CONTENT " "text. Same family-wide gap as #455. Live verification on " - "gpt-oss-20b (2026-06-04) confirmed the symptom AND surfaced " + "gpt-oss-20b-mxfp4-q8 (2026-06-04) confirmed the symptom AND surfaced " "the deeper constraint that breaks naive single-token fixes: " "production ``commentary`` is two tokens (``comment``+``ary``). " "Eventual fix must lookahead-decode the channel-type word or " diff --git a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py index b322bcb5..0657adab 100644 --- a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py +++ b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py @@ -8,7 +8,7 @@ ``commentary`` + ``to=functions.`` + optional ``<|constrain|>json`` + body + ``<|call|>``. PR #514 confirmed ``commentary`` is multi-token (``comment``+``ary``) on production -gpt-oss-20b, which the custom token-ID-match state machine could +gpt-oss-20b-mxfp4-q8, which the custom token-ID-match state machine could never identify. PR #515 lands the SOTA fix: delegate harmony state tracking to @@ -80,7 +80,7 @@ def router(encoding): def _encode(encoding, text: str) -> list[int]: """Wrap encode with allowed_special=all so structural markers (``<|channel|>`` etc.) round-trip as single token IDs the way - a real gpt-oss-20b would emit them. + a real gpt-oss-20b-mxfp4-q8 would emit them. """ return encoding.encode(text, allowed_special="all") @@ -699,7 +699,7 @@ def encode(self, text, add_special_tokens=False): "my-not-gpt-oss-20b", "notgpt-oss-fake", "some-user/gpt-oss-remapped", - "evil-org/gpt-oss-20b", + "evil-org/gpt-oss-20b-mxfp4-q8", "anonymous/gpt-oss", ) for name in rejected_names: @@ -713,15 +713,15 @@ class _Fake(_CompatTokenizerBase): ) accepted_names = ( - "openai/gpt-oss-20b", + "openai/gpt-oss-20b-mxfp4-q8", "mlx-community/gpt-oss-20b-MXFP4-Q8", "unsloth/gpt-oss-20b-MLX-8bit", - "gpt-oss-20b", + "gpt-oss-20b-mxfp4-q8", "gpt-oss", - "/models/gpt-oss-20b", - "~/lmstudio-models/gpt-oss-20b", + "/models/gpt-oss-20b-mxfp4-q8", + "~/lmstudio-models/gpt-oss-20b-mxfp4-q8", "./gpt-oss-20b-quantized", - "../models/gpt-oss-20b", + "../models/gpt-oss-20b-mxfp4-q8", ) for name in accepted_names: diff --git a/tests/test_aliases_contract.py b/tests/test_aliases_contract.py index e4784d4b..c007dfe4 100644 --- a/tests/test_aliases_contract.py +++ b/tests/test_aliases_contract.py @@ -436,27 +436,27 @@ def test_negative_control_dflash_missing_drafter_is_caught() -> None: def test_audit_batch_reasoning_parser_wirings() -> None: """Pin the Model Onboarding SOP audit fixes for reasoning_parser - on nemotron / kimi-k2.5 / hermes4 aliases. Each was previously + on nemotron / kimi-k2.5-3bit / hermes4 aliases. Each was previously ``null`` despite the model emitting ````/```` blocks — without the parser, those blocks leak into ``message.content``. Parser choice rationale: - - nemotron-30b/nano + kimi-k2.5 use a Qwen3-style template that + - nemotron-30b-4bit/nano + kimi-k2.5-3bit use a Qwen3-style template that INJECTS ```` into the prompt (gated by ``enable_thinking`` / ``thinking`` flag). ``qwen3`` parser's ``finalize_streaming`` correction handles the "no ever appeared → emit as content" case correctly. - - hermes4-70b: the chat template does NOT inject ````; + - hermes4-70b-4bit: the chat template does NOT inject ````; the model decides autonomously. Same contract as GLM-4 → reuse ``glm4`` parser (no-tags-yet → content semantics). """ profiles = list_profiles() expected = { - "nemotron-30b": "qwen3", - "nemotron-nano": "qwen3", - "kimi-k2.5": "qwen3", - "hermes4-70b": "glm4", + "nemotron-30b-4bit": "qwen3", + "nemotron-30b-4bit": "qwen3", + "kimi-k2.5-3bit": "qwen3", + "hermes4-70b-4bit": "glm4", } for alias, parser in expected.items(): assert alias in profiles, f"{alias} missing from aliases.json" @@ -482,7 +482,7 @@ def test_bonsai_family_wires_glm4_reasoning_parser() -> None: zero behavioural downside for non-thinking turns. """ profiles = list_profiles() - for alias in ("bonsai-1.7b", "bonsai-4b", "bonsai-8b"): + for alias in ("bonsai-1.7b-unpacked", "bonsai-4b-unpacked", "bonsai-8b-unpacked"): assert alias in profiles, f"{alias} missing from aliases.json" assert profiles[alias].reasoning_parser == "glm4", ( f"{alias}: reasoning_parser must be 'glm4' per audit. " @@ -499,7 +499,7 @@ def test_audit_batch_bonsai_tool_call_parser_wired() -> None: https://huggingface.co/prism-ml/Bonsai-1.7B-unpacked. """ profiles = list_profiles() - for alias in ("bonsai-1.7b", "bonsai-4b", "bonsai-8b"): + for alias in ("bonsai-1.7b-unpacked", "bonsai-4b-unpacked", "bonsai-8b-unpacked"): assert alias in profiles, f"{alias} missing from aliases.json" assert profiles[alias].tool_call_parser == "hermes", ( f"{alias}: tool_call_parser must be 'hermes' per audit. " @@ -519,7 +519,7 @@ def test_deepseek_v4_flash_family_wires_deepseek_r1_reasoning_parser() -> None: """ profiles = list_profiles() family = [ - "deepseek-v4-flash", + "deepseek-v4-flash-8bit", "deepseek-v4-flash-2bit", "deepseek-v4-flash-4bit", "deepseek-v4-flash-8bit", @@ -545,44 +545,44 @@ def test_aliases_with_known_broken_hf_paths_stay_fixed() -> None: change" commit doesn't quietly restore the broken path. """ profiles = list_profiles() - # qwen3-vl-4b: stale ``-MLX-`` suffix not used by upstream uploads - assert "MLX-4bit" not in profiles["qwen3-vl-4b"].hf_path, ( - "qwen3-vl-4b previously pointed at " + # qwen3-vl-4b-4bit: stale ``-MLX-`` suffix not used by upstream uploads + assert "MLX-4bit" not in profiles["qwen3-vl-4b-4bit"].hf_path, ( + "qwen3-vl-4b-4bit previously pointed at " "mlx-community/Qwen3-VL-4B-Instruct-MLX-4bit which 404s; the " "current upload is Qwen3-VL-4B-Instruct-4bit (no '-MLX-' suffix)." ) - # devstral-24b: ``2503`` snapshot was never re-uploaded as MLX-4bit; + # devstral-24b-4bit: ``2503`` snapshot was never re-uploaded as MLX-4bit; # 2505/2507 are the canonical Devstral-Small v1 releases. - assert "2503" not in profiles["devstral-24b"].hf_path, ( - "devstral-24b previously pointed at Devstral-Small-2503-MLX-4bit " + assert "2503" not in profiles["devstral-24b-4bit"].hf_path, ( + "devstral-24b-4bit previously pointed at Devstral-Small-2503-MLX-4bit " "which 404s. Use the 2507 (or 2505) MLX 4-bit upload." ) - # glm4.5-air: ``-0111-`` date suffix was a community-only tag that + # glm4.5-air-4bit: ``-0111-`` date suffix was a community-only tag that # got rolled into the default release. - assert "0111" not in profiles["glm4.5-air"].hf_path, ( - "glm4.5-air previously pointed at GLM-4.5-Air-0111-4bit which " + assert "0111" not in profiles["glm4.5-air-4bit"].hf_path, ( + "glm4.5-air-4bit previously pointed at GLM-4.5-Air-0111-4bit which " "404s. The current canonical upload is GLM-4.5-Air-4bit." ) - # glm4.7-9b previously pointed at the full GLM-4.7 (355B MoE, + # glm4.7-9b-4bit previously pointed at the full GLM-4.7 (355B MoE, # ~185 GB at 4-bit) — the alias name implies a 9B model. The # correct upload is the Flash variant (~16 GB). - assert "Flash" in profiles["glm4.7-9b"].hf_path, ( - "glm4.7-9b must point at the GLM-4.7-Flash upload, not the full " + assert "Flash" in profiles["glm4.7-9b-4bit"].hf_path, ( + "glm4.7-9b-4bit must point at the GLM-4.7-Flash upload, not the full " "GLM-4.7 (355B MoE) which is ~12x larger and won't fit on most " "user disks." ) - # gpt-oss-20b previously pointed at mlx-community/GPT-OSS-20B-4bit + # gpt-oss-20b-mxfp4-q8 previously pointed at mlx-community/GPT-OSS-20B-4bit # which 404s; the canonical mlx-community release uses the # MXFP4-Q8 hybrid quantization. - assert profiles["gpt-oss-20b"].hf_path != "mlx-community/GPT-OSS-20B-4bit", ( - "gpt-oss-20b must not regress to the 404 path; current canonical " + assert profiles["gpt-oss-20b-mxfp4-q8"].hf_path != "mlx-community/GPT-OSS-20B-4bit", ( + "gpt-oss-20b-mxfp4-q8 must not regress to the 404 path; current canonical " "upload is mlx-community/gpt-oss-20b-MXFP4-Q8." ) - # kimi-48b previously pointed at mlx-community/Kimi-K2-Instruct-Q4_0-MLX + # kimi-48b-4bit previously pointed at mlx-community/Kimi-K2-Instruct-Q4_0-MLX # (404). The replacement Kimi-K2-Instruct-4bit is large # (~540 GB) but is the actual mlx-community Kimi K2 Instruct release. - assert "Q4_0" not in profiles["kimi-48b"].hf_path, ( - "kimi-48b must not regress to the Q4_0 path which 404s." + assert "Q4_0" not in profiles["kimi-48b-4bit"].hf_path, ( + "kimi-48b-4bit must not regress to the Q4_0 path which 404s." ) @@ -603,46 +603,46 @@ def test_aliases_with_known_broken_hf_paths_stay_fixed() -> None: # Devstral 1.x — Mistral code-tuned model card example uses 0.15 # for interactive coding (see model card on huggingface.co/mistralai). # Devstral 2.x ships the same empty stub; same pattern applies. - "devstral-24b": {"temperature": 0.15}, - "devstral-v2-24b": {"temperature": 0.15}, + "devstral-24b-4bit": {"temperature": 0.15}, + "devstral-v2-24b-4bit": {"temperature": 0.15}, # Gemma 3 family — Google's Gemma docs recommend # (temperature=1.0, top_p=0.95, top_k=64) for the chat-tuned models. # All of gemma-3-1b / gemma-3-12b / gemma-3-27b ship an empty stub # locally (`_from_model_config: true` plus eos/pad tokens only). - "gemma3-1b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, - "gemma3-12b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, - "gemma3-27b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma3-1b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma3-12b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma3-27b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, # gemma-3n-E4B ships top_p=0.95 and top_k=64 upstream but no # temperature. We bake in the full triple anyway (matches the # rest of the Gemma family) so a future mlx-community re-quant # that drops generation_config.json doesn't silently regress to # the framework fallback (0.7 / 0.9). - "gemma-3n-e4b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-3n-e4b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, # Gemma 4 — official Google sampling guidance hasn't been # published yet at the time of writing; we extrapolate from the # Gemma 3 family card. Revisit when an official Gemma 4 doc lands. - "gemma-4-12b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-4-12b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, "gemma-4-12b-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, - "gemma-4-26b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, - "gemma-4-31b": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-4-26b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-4-31b-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, "gemma-4-31b-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, # Gemma 4 QAT variants — same sampling as PTQ siblings. QAT changes # weight distribution (training with simulated quantization) not the # decoding distribution, so Google's chat sampling guidance applies # unchanged. - "gemma-4-12b-qat": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-4-12b-qat-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, "gemma-4-12b-qat-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, - "gemma-4-26b-qat": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, - "gemma-4-31b-qat": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-4-26b-qat-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, + "gemma-4-31b-qat-4bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, "gemma-4-31b-qat-8bit": {"temperature": 1.0, "top_p": 0.95, "top_k": 64.0}, # GLM-4.5-Air — THUDM publishes two recommendations: temperature=0.6 # for *thinking* mode, ~1.0 for non-thinking. The alias has # reasoning_parser=glm4 → thinking IS the default response path, # so 0.6 is the right pick. (Users who want non-thinking can pass # temperature explicitly per-request.) - "glm4.5-air": {"temperature": 0.6, "top_p": 0.95}, + "glm4.5-air-4bit": {"temperature": 0.6, "top_p": 0.95}, # GLM-4.7-Flash ships temperature=1.0 upstream; we add only top_p. - "glm4.7-9b": {"top_p": 0.95}, + "glm4.7-9b-4bit": {"top_p": 0.95}, } diff --git a/tests/test_anthropic_stop_sequences.py b/tests/test_anthropic_stop_sequences.py index ece2579b..e8ea6223 100644 --- a/tests/test_anthropic_stop_sequences.py +++ b/tests/test_anthropic_stop_sequences.py @@ -6,7 +6,7 @@ NOT ``stop`` — so ``stop_sequences`` from the request flowed through ``anthropic_to_openai`` into ``openai_request.stop`` and then died at the route boundary. Engine ran uncapped, model emitted past the user's stop -tokens. Surfaced by the iter8 onboarding sweep on gpt-oss-20b: same +tokens. Surfaced by the iter8 onboarding sweep on gpt-oss-20b-mxfp4-q8: same prompt + ``stop_sequences:["STOPHERE"]`` returned full text including "STOPHERE", finish_reason=end_turn. Identical prompt via /v1/chat/ completions with ``stop:["STOPHERE"]`` stopped correctly. Fix: include diff --git a/tests/test_api_validation_bundle.py b/tests/test_api_validation_bundle.py index 440b76dd..595f7aa0 100644 --- a/tests/test_api_validation_bundle.py +++ b/tests/test_api_validation_bundle.py @@ -541,7 +541,7 @@ def test_array_of_strings_still_works(self): class TestPsCommandPortParsing: """``rapid-mlx ps`` used to break on the first positional argument, - so ``serve qwen3.5-4b --port 8005`` showed port=8000 (the default). + so ``serve qwen3.5-4b-4bit --port 8005`` showed port=8000 (the default). Verify the parser keeps scanning for flags after capturing the positional model.""" @@ -575,28 +575,28 @@ def _parse_serve(self, cmd_words): def test_port_after_positional_model(self): model, port = self._parse_serve( - ["rapid-mlx", "serve", "qwen3.5-4b", "--port", "8005"] + ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--port", "8005"] ) - assert model == "qwen3.5-4b" + assert model == "qwen3.5-4b-4bit" assert port == "8005" def test_port_before_positional_model(self): model, port = self._parse_serve( - ["rapid-mlx", "serve", "--port", "8005", "qwen3.5-4b"] + ["rapid-mlx", "serve", "--port", "8005", "qwen3.5-4b-4bit"] ) - assert model == "qwen3.5-4b" + assert model == "qwen3.5-4b-4bit" assert port == "8005" def test_port_equals_form(self): model, port = self._parse_serve( - ["rapid-mlx", "serve", "qwen3.5-4b", "--port=9000"] + ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--port=9000"] ) - assert model == "qwen3.5-4b" + assert model == "qwen3.5-4b-4bit" assert port == "9000" def test_no_port_uses_default(self): - model, port = self._parse_serve(["rapid-mlx", "serve", "qwen3.5-4b"]) - assert model == "qwen3.5-4b" + model, port = self._parse_serve(["rapid-mlx", "serve", "qwen3.5-4b-4bit"]) + assert model == "qwen3.5-4b-4bit" assert port == "8000" diff --git a/tests/test_batched_engine_output_router.py b/tests/test_batched_engine_output_router.py index 41a4c5ab..348d7bba 100644 --- a/tests/test_batched_engine_output_router.py +++ b/tests/test_batched_engine_output_router.py @@ -426,7 +426,7 @@ async def test_router_tool_call_body_preserved_single_token_flush(family): chunks reuse the scheduler's detokenized ``output.new_text`` would, if applied to TOOL_CALL events, override the accumulated body with just the end-marker token's text, silently dropping the call body. Caught - post-v0.6.61 on gemma-4-26b — non-stream extracted a valid tool call + post-v0.6.61 on gemma-4-26b-4bit — non-stream extracted a valid tool call from the same generation that streaming returned as bare content. Parametrized over every router-allowlist family that emits TOOL_CALL diff --git a/tests/test_bench_vs_ollama.py b/tests/test_bench_vs_ollama.py index d37c9fcb..9a0d8b29 100644 --- a/tests/test_bench_vs_ollama.py +++ b/tests/test_bench_vs_ollama.py @@ -29,8 +29,8 @@ def test_default_model_pairs(): pairs = bench.default_model_pairs() assert pairs == [ - bench.ModelPair("qwen3.5-4b", "qwen3.5:4b"), - bench.ModelPair("qwen3.5-9b", "qwen3.5:9b"), + bench.ModelPair("qwen3.5-4b-4bit", "qwen3.5:4b"), + bench.ModelPair("qwen3.5-9b-4bit", "qwen3.5:9b"), ] @@ -46,7 +46,7 @@ def test_parse_model_pair_rejects_missing_separator(): bench = load_bench_module() with pytest.raises(ValueError, match="RAPID=OLLAMA"): - bench.parse_model_pair("qwen3.5-9b") + bench.parse_model_pair("qwen3.5-9b-4bit") def test_parse_args_replaces_default_model_pairs(): @@ -285,13 +285,13 @@ def test_build_rapid_mlx_payload_is_deterministic_no_thinking(): bench = load_bench_module() payload = bench.build_rapid_mlx_payload( - model="qwen3.5-9b", + model="qwen3.5-9b-4bit", messages=[{"role": "user", "content": "hi"}], max_tokens=32, stream=True, ) - assert payload["model"] == "qwen3.5-9b" + assert payload["model"] == "qwen3.5-9b-4bit" assert payload["temperature"] == 0 assert payload["enable_thinking"] is False assert payload["stream"] is True @@ -414,7 +414,7 @@ def test_render_markdown_includes_model_table_and_speedups(): }, "model_pairs": [ { - "rapid_mlx_model": "qwen3.5-9b", + "rapid_mlx_model": "qwen3.5-9b-4bit", "ollama_model": "qwen3.5:9b", "rapid-mlx": { "summary": { @@ -437,7 +437,7 @@ def test_render_markdown_includes_model_table_and_speedups(): markdown = bench.render_markdown(result) assert "# Rapid-MLX vs Ollama Benchmark" in markdown - assert "qwen3.5-9b vs qwen3.5:9b" in markdown + assert "qwen3.5-9b-4bit vs qwen3.5:9b" in markdown assert "| Decode tok/s | 120.0 | 40.0 | 3.00x |" in markdown assert "| TTFT | 100.0 ms | 250.0 ms | 2.50x |" in markdown assert "- Rapid-MLX: `rapid-mlx 0.2.0`" in markdown @@ -454,7 +454,7 @@ def test_render_markdown_surfaces_engine_errors(): "config": {"runs": 1, "concurrency": [1]}, "model_pairs": [ { - "rapid_mlx_model": "qwen3.5-9b", + "rapid_mlx_model": "qwen3.5-9b-4bit", "ollama_model": "qwen3.5:9b", "rapid-mlx": {"error": "boom"}, "ollama": {"summary": {"stream": {"decode_tok_s": 40.0}}}, @@ -474,7 +474,7 @@ def test_render_markdown_surfaces_workload_errors(): "config": {"runs": 1, "concurrency": [1]}, "model_pairs": [ { - "rapid_mlx_model": "qwen3.5-9b", + "rapid_mlx_model": "qwen3.5-9b-4bit", "ollama_model": "qwen3.5:9b", "rapid-mlx": { "errors": [ @@ -515,7 +515,7 @@ def test_render_markdown_tolerates_none_engine_payloads(): "config": {"runs": 1, "concurrency": [1]}, "model_pairs": [ { - "rapid_mlx_model": "qwen3.5-9b", + "rapid_mlx_model": "qwen3.5-9b-4bit", "ollama_model": "qwen3.5:9b", "rapid-mlx": None, "ollama": {"error": "ollama down"}, @@ -560,13 +560,13 @@ def test_build_rapid_mlx_command_includes_explicit_benchmark_settings(): bench = load_bench_module() cmd = bench.build_rapid_mlx_command( - "qwen3.5-9b", 9123, ["--prefill-step-size", "4096"] + "qwen3.5-9b-4bit", 9123, ["--prefill-step-size", "4096"] ) assert cmd == [ "rapid-mlx", "serve", - "qwen3.5-9b", + "qwen3.5-9b-4bit", "--host", "127.0.0.1", "--port", @@ -642,9 +642,9 @@ def test_build_engine_success_result_shape(): result = bench.build_engine_success_result( engine="rapid-mlx", - model="qwen3.5-9b", + model="qwen3.5-9b-4bit", port=9123, - command=["rapid-mlx", "serve", "qwen3.5-9b"], + command=["rapid-mlx", "serve", "qwen3.5-9b-4bit"], raw_runs={"stream": [{"ttft_ms": 100.0}]}, summary={"stream": {"ttft_ms": 100.0}}, errors=[], @@ -653,13 +653,13 @@ def test_build_engine_success_result_shape(): ) assert result["engine"] == "rapid-mlx" - assert result["model"] == "qwen3.5-9b" + assert result["model"] == "qwen3.5-9b-4bit" assert result["port"] == 9123 - assert result["command"] == ["rapid-mlx", "serve", "qwen3.5-9b"] + assert result["command"] == ["rapid-mlx", "serve", "qwen3.5-9b-4bit"] assert result["server"]["url"] == "http://127.0.0.1:9123" assert result["runtime"]["prepared"] is True assert result["metadata"]["engine"] == "rapid-mlx" - assert result["metadata"]["model"] == "qwen3.5-9b" + assert result["metadata"]["model"] == "qwen3.5-9b-4bit" assert result["raw_runs"]["stream"] == [{"ttft_ms": 100.0}] assert result["summary"]["stream"] == {"ttft_ms": 100.0} assert result["errors"] == [] diff --git a/tests/test_chat_logprobs_channel_routing.py b/tests/test_chat_logprobs_channel_routing.py index de24d761..a3b68f9d 100644 --- a/tests/test_chat_logprobs_channel_routing.py +++ b/tests/test_chat_logprobs_channel_routing.py @@ -11,7 +11,7 @@ back to the text-regex parser, which leaked analysis-channel content into ``message.content`` and dropped ``reasoning_content`` entirely. -Surfaced by the iter7 onboarding sweep on gpt-oss-20b: identical request +Surfaced by the iter7 onboarding sweep on gpt-oss-20b-mxfp4-q8: identical request with vs without ``logprobs:true`` produced different channel routing — same shape as #442 (PR #443) but on the logprobs codepath instead of truncated output. Fix: accumulate ``new_text`` by ``channel`` while iterating the diff --git a/tests/test_chat_route_tool_tag_leak.py b/tests/test_chat_route_tool_tag_leak.py index 14049964..e9f9e9e3 100644 --- a/tests/test_chat_route_tool_tag_leak.py +++ b/tests/test_chat_route_tool_tag_leak.py @@ -133,7 +133,7 @@ def test_no_tool_call_path_preserves_content(self): _assert_no_leak(content) def test_parser_finds_nothing_preserves_existing_cleaned_text(self): - # Regression for the v0.6.64 gpt-oss-20b empty-TextBlock bug: + # Regression for the v0.6.64 gpt-oss-20b-mxfp4-q8 empty-TextBlock bug: # ``engine.generate()`` runs ``clean_output_text`` on harmony # output, which strips channel markup and returns just the # final-channel content ("4"). The non-streaming route then diff --git a/tests/test_chat_streaming_spec.py b/tests/test_chat_streaming_spec.py index b600e3c1..37182f9f 100644 --- a/tests/test_chat_streaming_spec.py +++ b/tests/test_chat_streaming_spec.py @@ -4,7 +4,7 @@ Companion to ``tests/test_chat_streaming_guided.py``. PR #422 pinned these invariants on the guided helper (``stream_chat_completion_guided``) but the regular ``stream_chat_completion`` path retained two spec violations -that the 2026-05-20 ≥20B onboarding sweep caught on qwen3.5-35b +that the 2026-05-20 ≥20B onboarding sweep caught on qwen3.5-35b-8bit (see knowledge/guided_generation_gaps_2026-05-20.md, "Bug B"): 1. **``created`` drift** — content chunks share one timestamp but the @@ -130,7 +130,7 @@ def test_non_guided_streaming_pins_single_created_timestamp(monkeypatch): without ``created=...`` and inherited the default factory's fresh ``time.time()`` per construction. On slow MoE models the gap between first content chunk and finish chunk was 5-7s (Agent A7, - qwen3.5-35b, 2026-05-20 sweep). + qwen3.5-35b-8bit, 2026-05-20 sweep). Patches ``time.time`` to advance one second per call so the bug is deterministically observable in a unit test (real wall-clock under @@ -264,7 +264,7 @@ class _GapStreamParser(ToolParser): call (returns plain content), but the non-stream ``extract_tool_calls`` catches it via a fallback pattern. - Mirrors the gemma-4-26b case from the 2026-05-20 sweep where + Mirrors the gemma-4-26b-4bit case from the 2026-05-20 sweep where streaming dropped a tool call that the non-stream parser handled. """ diff --git a/tests/test_cli_argcomplete.py b/tests/test_cli_argcomplete.py index ac4f6be9..b034c1cd 100644 --- a/tests/test_cli_argcomplete.py +++ b/tests/test_cli_argcomplete.py @@ -10,7 +10,7 @@ 2. ``alias_completer`` returns aliases filtered by prefix (the actual contract argcomplete invokes per keystroke). 3. ``alias_csv_completer`` correctly carries the comma-separated - prefix forward so ``--models qwen3.5-4b,gem`` expands the + prefix forward so ``--models qwen3.5-4b-4bit,gem`` expands the trailing token only. """ @@ -69,10 +69,10 @@ def test_alias_completer_filters_by_prefix() -> None: # membership catches a silent regression where the QAT entries get # dropped from aliases.json without the test failing. qat_aliases = { - "gemma-4-12b-qat", + "gemma-4-12b-qat-4bit", "gemma-4-12b-qat-8bit", - "gemma-4-26b-qat", - "gemma-4-31b-qat", + "gemma-4-26b-qat-4bit", + "gemma-4-31b-qat-4bit", "gemma-4-31b-qat-8bit", } missing = qat_aliases - set(result) @@ -120,13 +120,13 @@ def test_alias_csv_completer_first_token() -> None: def test_alias_csv_completer_appends_to_existing_csv() -> None: - """``--models qwen3.5-4b,gem`` should expand only the + """``--models qwen3.5-4b-4bit,gem`` should expand only the trailing token but emit the full re-assembled value so the shell - inserts ``qwen3.5-4b,gemma-4-12b`` rather than dropping the head.""" - result = alias_csv_completer("qwen3.5-4b,gemma-4-") - assert all(m.startswith("qwen3.5-4b,gemma-4-") for m in result), ( + inserts ``qwen3.5-4b-4bit,gemma-4-12b-4bit`` rather than dropping the head.""" + result = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-") + assert all(m.startswith("qwen3.5-4b-4bit,gemma-4-") for m in result), ( f"csv completer dropped the head before the comma: " - f"{[m for m in result if not m.startswith('qwen3.5-4b,')]}" + f"{[m for m in result if not m.startswith('qwen3.5-4b-4bit,')]}" ) assert len(result) >= 5, "should match at least 5 gemma-4-* tokens" @@ -135,8 +135,8 @@ def test_alias_csv_completer_multiple_commas() -> None: """``--models a,b,c`` only completes ``c``; ``a,b,`` is carried through unchanged. Lock this in because rpartition vs partition is an easy-to-flip bug.""" - result = alias_csv_completer("qwen3.5-4b,gemma-4-12b,qwen3.6-") - assert all(m.startswith("qwen3.5-4b,gemma-4-12b,qwen3.6-") for m in result), ( + result = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") + assert all(m.startswith("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") for m in result), ( "csv completer must preserve all prior csv tokens" ) @@ -156,8 +156,8 @@ def test_alias_csv_completer_handles_whitespace_after_comma() -> None: must match that contract so users who naturally type the human-friendly ``a, b, c`` shape get suggestions instead of an empty list.""" - spaced = alias_csv_completer("qwen3.5-4b, gemma-4-") - tight = alias_csv_completer("qwen3.5-4b,gemma-4-") + spaced = alias_csv_completer("qwen3.5-4b-4bit, gemma-4-") + tight = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-") assert len(spaced) == len(tight), ( "csv completer must produce the same number of matches whether " "the user typed `a,b` or `a, b`" diff --git a/tests/test_cli_chat.py b/tests/test_cli_chat.py index 3aa4e781..3718449e 100644 --- a/tests/test_cli_chat.py +++ b/tests/test_cli_chat.py @@ -107,8 +107,8 @@ def test_chat_no_model_defaults_to_qwen35_4b(): # signals the default plumbed through. The canonical alias is the one # we documented as the default; confirm via the round-trip name. assert ( - args.model == "qwen3.5-4b" - or getattr(args, "_original_alias", None) == "qwen3.5-4b" + args.model == "qwen3.5-4b-4bit" + or getattr(args, "_original_alias", None) == "qwen3.5-4b-4bit" ) @@ -116,15 +116,15 @@ def test_chat_with_alias_overrides_default(): """`rapid-mlx chat ` uses the user-supplied alias, not the default.""" captured: list = [] with ( - patch.object(sys, "argv", ["rapid-mlx", "chat", "smollm3-3b"]), + patch.object(sys, "argv", ["rapid-mlx", "chat", "smollm3-3b-4bit"]), patch.object(cli, "chat_command", side_effect=captured.append), ): cli.main() assert len(captured) == 1 args = captured[0] assert ( - args.model == "smollm3-3b" - or getattr(args, "_original_alias", None) == "smollm3-3b" + args.model == "smollm3-3b-4bit" + or getattr(args, "_original_alias", None) == "smollm3-3b-4bit" ) @@ -209,7 +209,7 @@ def test_chat_command_repl_multi_turn(monkeypatch, capsys): ns.temperature = 0.0 ns.ready_timeout = 5 ns.response_timeout = 5 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" cli.chat_command(ns) @@ -236,7 +236,7 @@ def test_chat_command_system_prompt_prepended(monkeypatch): ns.temperature = 0.0 ns.ready_timeout = 5 ns.response_timeout = 5 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" cli.chat_command(ns) assert payloads[0]["messages"][0] == {"role": "system", "content": "be terse"} assert payloads[0]["messages"][1] == {"role": "user", "content": "q1"} @@ -246,7 +246,7 @@ def test_chat_command_default_thinking_off_sends_enable_thinking_false(monkeypat """Chat REPL defaults to thinking OFF. Reasoning models like Qwen3.5 otherwise leak raw chain-of-thought into - the user-visible REPL output, and on the default qwen3.5-4b model + the user-visible REPL output, and on the default qwen3.5-4b-4bit model degenerate into infinite repetition until max-tokens — producing zero usable output for a brand-new user. Pinning the default here so a refactor doesn't silently restore the broken behavior shipped in 0.6.26. @@ -264,7 +264,7 @@ def test_chat_command_default_thinking_off_sends_enable_thinking_false(monkeypat ns.temperature = 0.0 ns.ready_timeout = 5 ns.response_timeout = 5 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" cli.chat_command(ns) assert payloads[0].get("enable_thinking") is False # The unsupported nested form must NOT be present. @@ -288,7 +288,7 @@ def test_chat_command_explicit_think_omits_enable_thinking_field(monkeypatch): ns.temperature = 0.0 ns.ready_timeout = 5 ns.response_timeout = 5 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" cli.chat_command(ns) assert "enable_thinking" not in payloads[0] @@ -330,7 +330,7 @@ def test_chat_command_survives_connection_failure(monkeypatch, capsys): ns.temperature = 0.0 ns.ready_timeout = 1 ns.response_timeout = 2 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" # Should not raise — REPL prints "Request failed" and continues to "exit". cli.chat_command(ns) captured = capsys.readouterr() @@ -387,7 +387,7 @@ def _capture(self): ns.temperature = 0.0 ns.ready_timeout = 5 ns.response_timeout = 5 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" cli.chat_command(ns) _ErrHandler.do_POST = orig # type: ignore[assignment] @@ -412,7 +412,7 @@ def _ns_for_chat(port: int, **overrides) -> object: ns.temperature = 0.0 ns.ready_timeout = 5 ns.response_timeout = 5 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" for k, v in overrides.items(): setattr(ns, k, v) return ns @@ -756,7 +756,7 @@ def test_chat_command_save_refuses_on_empty_conversation(monkeypatch, tmp_path, def test_stream_chat_response_aborts_on_no_whitespace_repetition(monkeypatch): - """The new char-level guard must fire on the qwen3.5-4b regression + """The new char-level guard must fire on the qwen3.5-4b-4bit regression where the model emits ``BarleyBarleyBarley...`` with NO whitespace separator. The whitespace-token guard cannot catch this — a single chunk of 6000 chars splits to one token whose count is 1. @@ -1105,7 +1105,7 @@ def _fake_wait(base_url, proc, timeout_s): port = fake_port inputs = iter(["first turn", "/model bogus", "second turn", "exit"]) monkeypatch.setattr("builtins.input", lambda _p="": next(inputs)) - ns = _ns_for_chat(fake_port, model="qwen3.5-4b") + ns = _ns_for_chat(fake_port, model="qwen3.5-4b-4bit") ns.base_url = None ns.port = None cli.chat_command(ns) @@ -1142,7 +1142,7 @@ def test_chat_command_slash_command_dispatch_uses_exact_match( inputs = iter( [ f"/savefoo {target}", - "/modelfoo qwen3.5-4b", + "/modelfoo qwen3.5-4b-4bit", "exit", ] ) @@ -1221,12 +1221,12 @@ def test_stream_chat_response_repetition_truncates_at_cutoff_in_one_chunk( def test_chat_think_bumps_max_tokens_default_to_4096(): """``--think`` with no explicit ``--max-tokens`` raises the default from 2048 to 4096 so reasoning + final answer fit a small-model - budget. Round-1 finding: ``chat qwen3.5-4b --think`` consumed the + budget. Round-1 finding: ``chat qwen3.5-4b-4bit --think`` consumed the full 2048 budget with reasoning alone and emitted an empty answer with ``finish_reason='length'``.""" captured: list = [] with ( - patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b", "--think"]), + patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b-4bit", "--think"]), patch.object(cli, "chat_command", side_effect=captured.append), ): cli.main() @@ -1392,7 +1392,7 @@ def test_chat_port_unbound_exits_with_friendly_error(capsys, monkeypatch): ns.temperature = 0.0 ns.ready_timeout = 1 ns.response_timeout = 1 - ns.model = "qwen3.5-4b" + ns.model = "qwen3.5-4b-4bit" with pytest.raises(SystemExit) as exc: cli.chat_command(ns) assert exc.value.code == 1 @@ -1411,7 +1411,7 @@ def test_run_is_alias_for_chat(monkeypatch): same args as ``rapid-mlx chat ``.""" captured: list = [] with ( - patch.object(sys, "argv", ["rapid-mlx", "run", "qwen3.5-4b"]), + patch.object(sys, "argv", ["rapid-mlx", "run", "qwen3.5-4b-4bit"]), patch.object(cli, "chat_command", side_effect=captured.append), ): cli.main() @@ -1432,7 +1432,7 @@ def test_run_alias_accepts_chat_flags(): patch.object( sys, "argv", - ["rapid-mlx", "run", "qwen3.5-4b", "--think", "--max-tokens", "1024"], + ["rapid-mlx", "run", "qwen3.5-4b-4bit", "--think", "--max-tokens", "1024"], ), patch.object(cli, "chat_command", side_effect=captured.append), ): @@ -1534,7 +1534,7 @@ def test_serve_accepts_no_think_as_alias_for_no_thinking(): ``no_thinking=True`` destination as ``serve --no-thinking``.""" captured: list = [] with ( - patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b", "--no-think"]), + patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-think"]), patch.object(cli, "serve_command", side_effect=captured.append), ): cli.main() @@ -1944,7 +1944,7 @@ def getsockname(self): monkeypatch.setattr("subprocess.Popen", _FakePopen) log_path = tmp_path / "fake.log" - proc, base_url = cli._spawn_chat_server("qwen3.5-4b", str(log_path)) + proc, base_url = cli._spawn_chat_server("qwen3.5-4b-4bit", str(log_path)) assert captured["env"] is not None assert captured["env"].get("RAPID_MLX_CHAT_SPAWN") == "1" @@ -2235,7 +2235,7 @@ def test_chat_allow_abbrev_disabled_rejects_ambiguous_no_thi(capsys): With ``allow_abbrev=False`` argparse must reject the ambiguous form instead of silently resolving it to whichever flag was added first.""" with ( - patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b", "--no-thi"]), + patch.object(sys, "argv", ["rapid-mlx", "chat", "qwen3.5-4b-4bit", "--no-thi"]), pytest.raises(SystemExit), ): cli.main() @@ -2247,7 +2247,7 @@ def test_serve_allow_abbrev_disabled_rejects_ambiguous_no_thi(capsys): """Same as the chat case — ``serve`` also got the hidden cross-alias and the same ambiguity must be reported, not silently resolved.""" with ( - patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b", "--no-thi"]), + patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-thi"]), pytest.raises(SystemExit), ): cli.main() diff --git a/tests/test_cli_info.py b/tests/test_cli_info.py index 13e0e359..aff94c38 100644 --- a/tests/test_cli_info.py +++ b/tests/test_cli_info.py @@ -28,11 +28,11 @@ def _run_info(model_name: str) -> str: def test_info_resolves_alias_to_hf_path() -> None: - """Typing the alias ``qwen3.5-4b`` must show the resolution arrow - ``qwen3.5-4b → mlx-community/Qwen3.5-4B-MLX-4bit``. A refactor that + """Typing the alias ``qwen3.5-4b-4bit`` must show the resolution arrow + ``qwen3.5-4b-4bit → mlx-community/Qwen3.5-4B-MLX-4bit``. A refactor that bypasses ``resolve_model`` would drop the alias signal.""" - out = _run_info("qwen3.5-4b") - assert "qwen3.5-4b" in out + out = _run_info("qwen3.5-4b-4bit") + assert "qwen3.5-4b-4bit" in out assert "→" in out, f"expected alias-resolution arrow in output, got:\n{out}" assert "mlx-community/Qwen3.5-4B-MLX-4bit" in out, ( f"expected resolved HF path in output, got:\n{out}" diff --git a/tests/test_cli_models.py b/tests/test_cli_models.py index b9e2c1c9..6bd4c2c2 100644 --- a/tests/test_cli_models.py +++ b/tests/test_cli_models.py @@ -41,20 +41,20 @@ def test_models_command_shows_capability_columns(): def test_models_command_renders_hybrid_marker_for_qwen35(): - """Hybrid models (e.g. qwen3.5-4b) must show '✗ hybrid' + tier 'n/a'. + """Hybrid models (e.g. qwen3.5-4b-4bit) must show '✗ hybrid' + tier 'n/a'. The point of the column is to spare users an `info` round-trip when deciding whether spec-decode/suffix-decode will help. Trust the gate. """ out = _capture_models_output() profiles = list_profiles() - qwen35_4b = profiles.get("qwen3.5-4b") - assert qwen35_4b is not None, "qwen3.5-4b alias missing — fixture drift" - assert qwen35_4b.is_hybrid, "qwen3.5-4b should still be is_hybrid=True" + qwen35_4b = profiles.get("qwen3.5-4b-4bit") + assert qwen35_4b is not None, "qwen3.5-4b-4bit alias missing — fixture drift" + assert qwen35_4b.is_hybrid, "qwen3.5-4b-4bit should still be is_hybrid=True" - # Find the qwen3.5-4b row and confirm the hybrid markers. - matches = [line for line in out.splitlines() if "qwen3.5-4b " in line] - assert matches, "no row found for qwen3.5-4b" + # Find the qwen3.5-4b-4bit row and confirm the hybrid markers. + matches = [line for line in out.splitlines() if "qwen3.5-4b-4bit " in line] + assert matches, "no row found for qwen3.5-4b-4bit" row = matches[0] assert "✗ hybrid" in row, f"expected '✗ hybrid' marker in row: {row!r}" assert "n/a" in row, f"expected suffix tier 'n/a' in row: {row!r}" @@ -67,15 +67,15 @@ def test_models_command_renders_parser_for_hermes3_8b(): alias registry, (2) the suffix-tier cell shows the tier currently recorded in ``aliases.json``. Reading the expected tier from the registry (not hardcoding it) means a future bench re-sweep that - reclassifies hermes3-8b doesn't break this test, while a *display* + reclassifies hermes3-8b-4bit doesn't break this test, while a *display* regression (tier dropped from the row entirely) still does. """ out = _capture_models_output() - matches = [line for line in out.splitlines() if "hermes3-8b " in line] - assert matches, "no row found for hermes3-8b" + matches = [line for line in out.splitlines() if "hermes3-8b-4bit " in line] + assert matches, "no row found for hermes3-8b-4bit" row = matches[0] - profile = list_profiles().get("hermes3-8b") - assert profile is not None, "hermes3-8b alias missing — fixture drift" + profile = list_profiles().get("hermes3-8b-4bit") + assert profile is not None, "hermes3-8b-4bit alias missing — fixture drift" assert (profile.tool_call_parser or "") in row, ( f"expected tool parser {profile.tool_call_parser!r} in row: {row!r}" ) @@ -194,7 +194,7 @@ def test_models_default_view_unchanged(monkeypatch, capsys): def test_cached_view_renders_alias_for_known_repo(tmp_path, monkeypatch, capsys): """A cached HF repo whose path matches an alias should render under - the alias name (e.g. ``qwen3.5-4b``), not the raw HF path.""" + the alias name (e.g. ``qwen3.5-4b-4bit``), not the raw HF path.""" from vllm_mlx.model_aliases import list_profiles profiles = list_profiles() diff --git a/tests/test_dflash_eligibility.py b/tests/test_dflash_eligibility.py index 0d1458bc..7a4abf29 100644 --- a/tests/test_dflash_eligibility.py +++ b/tests/test_dflash_eligibility.py @@ -103,7 +103,7 @@ def test_check_rejects_4bit_main_model() -> None: dflash_draft_model="z-lab/Qwen3.5-27B-DFlash", ) with pytest.raises(DFlashUnavailable, match="4-bit"): - check(p, alias="qwen3.5-27b") + check(p, alias="qwen3.5-27b-4bit") def test_check_message_lists_eligible_aliases() -> None: @@ -133,7 +133,7 @@ def test_report_collects_all_failures() -> None: is_moe=True, # supports_dflash=False (default) → 3 reasons total ) - r = report(bad, alias="qwen3.6-35b") + r = report(bad, alias="qwen3.6-35b-4bit") assert len(r.reasons) == 3, f"expected 3 reasons, got: {r.reasons}" joined = " ".join(r.reasons) assert "MoE" in joined @@ -168,20 +168,20 @@ def test_qwen3_5_27b_8bit_alias_passes_check() -> None: def test_default_qwen3_5_27b_alias_fails_check_with_4bit_reason() -> None: - """The default ``qwen3.5-27b`` alias points at the 4-bit variant — + """The default ``qwen3.5-27b-4bit`` alias points at the 4-bit variant — eligibility must reject it with a clear 4-bit hint, not the generic 'not enabled' message (since supports_dflash=False). Confirms users get the right pointer when they pick the wrong quantization.""" from vllm_mlx.model_aliases import resolve_profile - profile = resolve_profile("qwen3.5-27b") + profile = resolve_profile("qwen3.5-27b-4bit") assert profile is not None # Match-string: capture both reasons (4-bit + not-opted-in). The # bare ``raises`` would pass even if the gate silently degraded to # the generic message, defeating the point of this regression test. with pytest.raises(DFlashUnavailable) as excinfo: - check(profile, alias="qwen3.5-27b") + check(profile, alias="qwen3.5-27b-4bit") msg = str(excinfo.value) assert "4-bit" in msg, ( f"4-bit hint missing from DFlashUnavailable message; got:\n{msg}" diff --git a/tests/test_dflash_integration.py b/tests/test_dflash_integration.py index c9cfacd1..f5b5ae06 100644 --- a/tests/test_dflash_integration.py +++ b/tests/test_dflash_integration.py @@ -95,11 +95,11 @@ def test_info_dflash_block_skipped_for_unknown_alias(capsys) -> None: def test_info_dflash_marks_4bit_alias_ineligible(capsys) -> None: - """The default ``qwen3.5-27b`` alias points at the 4-bit variant and + """The default ``qwen3.5-27b-4bit`` alias points at the 4-bit variant and must surface as ineligible with the right gate failing.""" from vllm_mlx.cli import info_command - args = type("Args", (), {"model": "qwen3.5-27b"})() + args = type("Args", (), {"model": "qwen3.5-27b-4bit"})() info_command(args) captured = capsys.readouterr() assert "DFlash eligibility" in captured.out @@ -153,10 +153,10 @@ def test_models_listing_renders_dflash_column(capsys) -> None: # A non-DFlash alias renders — in the DFlash column. ineligible_row = next( - (line for line in lines if "qwen3.5-4b " in line), + (line for line in lines if "qwen3.5-4b-4bit " in line), None, ) - assert ineligible_row is not None, "qwen3.5-4b row missing" + assert ineligible_row is not None, "qwen3.5-4b-4bit row missing" assert "—" in ineligible_row, f"DFlash column should be —: {ineligible_row!r}" diff --git a/tests/test_doctor_baseline.py b/tests/test_doctor_baseline.py index 777e23f1..4fa35e13 100644 --- a/tests/test_doctor_baseline.py +++ b/tests/test_doctor_baseline.py @@ -195,7 +195,7 @@ def test_distinct_inputs_produce_distinct_slugs(self): def test_simple_aliases_unchanged(self): # Common case: an alias with no special chars stays human-readable. - assert safe_model_slug("qwen3.5-4b") == "qwen3.5-4b" + assert safe_model_slug("qwen3.5-4b-4bit") == "qwen3.5-4b-4bit" def test_hf_path_round_trips_via_unquote(self): import urllib.parse diff --git a/tests/test_doctor_runner.py b/tests/test_doctor_runner.py index 83f747fc..69d92edc 100644 --- a/tests/test_doctor_runner.py +++ b/tests/test_doctor_runner.py @@ -221,7 +221,7 @@ def crashing(): class TestDefaultBootTimeout: """Single generous default beats heuristics that miss models like - 'qwen3-coder' (80B, no param-count hint in alias).""" + 'qwen3-coder-4bit' (80B, no param-count hint in alias).""" def test_default_is_generous(self): from vllm_mlx.doctor.cli import DEFAULT_BOOT_TIMEOUT_S diff --git a/tests/test_engine_router_non_stream.py b/tests/test_engine_router_non_stream.py index b6af4af0..b7566141 100644 --- a/tests/test_engine_router_non_stream.py +++ b/tests/test_engine_router_non_stream.py @@ -55,7 +55,7 @@ def decode(self, ids): return "".join(self._id_to_text.get(i, f"") for i in ids) -# Harmony token IDs from openai/gpt-oss-20b (same constants the real +# Harmony token IDs from openai/gpt-oss-20b-mxfp4-q8 (same constants the real # router reads). Keep in sync with tests/test_output_router.py. _HARMONY_VOCAB = { "<|return|>": 200002, diff --git a/tests/test_finalize_harmony_raw_text.py b/tests/test_finalize_harmony_raw_text.py index f719c50d..5d589828 100644 --- a/tests/test_finalize_harmony_raw_text.py +++ b/tests/test_finalize_harmony_raw_text.py @@ -32,7 +32,7 @@ from vllm_mlx.reasoning.qwen3_parser import Qwen3ReasoningParser from vllm_mlx.service.helpers import _finalize_content_and_reasoning -# A realistic gpt-oss-20b harmony non-stream response: analysis channel +# A realistic gpt-oss-20b-mxfp4-q8 harmony non-stream response: analysis channel # (CoT) followed by final channel (answer), terminated with <|return|>. _HARMONY_RAW = ( "<|channel|>analysis<|message|>" diff --git a/tests/test_harmony_parsers.py b/tests/test_harmony_parsers.py index 77b6cda4..3d06ba58 100644 --- a/tests/test_harmony_parsers.py +++ b/tests/test_harmony_parsers.py @@ -263,8 +263,8 @@ def test_extract_analysis_and_final(self, parser): def test_extract_final_terminated_by_end_token(self, parser): """Final channel terminated by ``<|end|>`` (not ``<|return|>``). - Regression for the v0.6.64 gpt-oss-20b empty-TextBlock flake: - gpt-oss-20b emits ``<|end|>`` after the final channel for a + Regression for the v0.6.64 gpt-oss-20b-mxfp4-q8 empty-TextBlock flake: + gpt-oss-20b-mxfp4-q8 emits ``<|end|>`` after the final channel for a sizeable fraction of non-streaming responses, and the prior ``<|return|>``-only regex silently dropped that content. The streaming path already accepts both terminators; the @@ -306,7 +306,7 @@ def test_literal_end_token_in_content_end_only_terminator(self, parser): DeepSeek round-2 follow-up: the ``<|return|>``-first preference only covers the case where the real terminator is ``<|return|>``. When the model emits ``<|end|>`` as the - message terminator (gpt-oss-20b's common case) AND the + message terminator (gpt-oss-20b-mxfp4-q8's common case) AND the answer contains a literal ``<|end|>``, the non-greedy fallback would still truncate. Greedy ``(.*)`` in ``_FINAL_PATTERN_END`` now consumes up to the LAST @@ -696,7 +696,7 @@ def test_skips_non_function_types(self): class TestHarmonyEnginePipeline: """End-to-end through the engine layer: clean_output_text → tool parser. - Reproduces the v0.6.64 bug where gpt-oss-20b's commentary-only tool + Reproduces the v0.6.64 bug where gpt-oss-20b-mxfp4-q8's commentary-only tool calls came back as plain text instead of structured ``tool_calls``. Root cause: ``_clean_gpt_oss_output`` in ``api/utils.py`` only matched a ``<|channel|>final<|message|>`` block; commentary-only output fell @@ -712,7 +712,7 @@ class TestHarmonyEnginePipeline: """ def test_commentary_only_output_extracts_tool_call(self): - """Real gpt-oss-20b output for a single tool call. + """Real gpt-oss-20b-mxfp4-q8 output for a single tool call. Captured verbatim from ``mlx-community/gpt-oss-20b-MXFP4-Q8`` via ``/v1/chat/completions`` with ``tools=[get_weather]`` (2026-05-22). @@ -826,7 +826,7 @@ def test_tool_parser_unterminated_call_is_now_parsed(self): """Commentary block without trailing ``<|call|>`` IS parsed now. Earlier behavior treated a missing ``<|call|>`` terminator as - "incomplete". Empirically (gpt-oss-20b via /v1/chat/completions, + "incomplete". Empirically (gpt-oss-20b-mxfp4-q8 via /v1/chat/completions, 2026-05-22) ``<|call|>`` is part of the harmony stop-token set, so the engine consumes it and ``output_text`` ends with the JSON args alone. Refusing to parse meant zero tool calls diff --git a/tests/test_memory_capacity_check.py b/tests/test_memory_capacity_check.py index ecbcdc40..54b67a7a 100644 --- a/tests/test_memory_capacity_check.py +++ b/tests/test_memory_capacity_check.py @@ -62,7 +62,7 @@ def test_hard_warning_fires_on_24gb_mac_with_14gb_model_realistic_load( actually hit this gets the strongest message, not a soft hint.""" _patch_size_bytes(monkeypatch, size_gb=14.0) with patch.dict("sys.modules", {"psutil": _fake_psutil(24.0, used_gb=6.0)}): - _check_memory_capacity("/local/path/to/gemma-4-26b") + _check_memory_capacity("/local/path/to/gemma-4-26b-4bit") out = capsys.readouterr().out assert "kernel panic" in out, ( f"the very case that filed the issue must hit the hard tier: {out!r}" @@ -77,7 +77,7 @@ def test_hard_warning_still_fires_at_fresh_boot_on_24gb_mac(monkeypatch, capsys) """ _patch_size_bytes(monkeypatch, size_gb=14.0) with patch.dict("sys.modules", {"psutil": _fake_psutil(24.0, used_gb=0.0)}): - _check_memory_capacity("/local/path/to/gemma-4-26b") + _check_memory_capacity("/local/path/to/gemma-4-26b-4bit") out = capsys.readouterr().out assert "Memory pressure" in out, f"expected warning, got: {out!r}" # 0 + 21 = 21 / 24 = 87.5% → HARD tier @@ -208,7 +208,7 @@ def test_warning_includes_actionable_recommendations(monkeypatch, capsys): Pins the actionability of the message.""" _patch_size_bytes(monkeypatch, size_gb=14.0) with patch.dict("sys.modules", {"psutil": _fake_psutil(24.0, used_gb=0.0)}): - _check_memory_capacity("/local/path/to/gemma-4-26b") + _check_memory_capacity("/local/path/to/gemma-4-26b-4bit") out = capsys.readouterr().out assert "--gpu-memory-utilization" in out diff --git a/tests/test_model_aliases.py b/tests/test_model_aliases.py index 75ded31d..1b32e6c6 100644 --- a/tests/test_model_aliases.py +++ b/tests/test_model_aliases.py @@ -9,8 +9,8 @@ def test_known_alias_resolves(): - assert resolve_model("qwen3.5-9b") == "mlx-community/Qwen3.5-9B-4bit" - assert resolve_model("llama3-3b") == "mlx-community/Llama-3.2-3B-Instruct-4bit" + assert resolve_model("qwen3.5-9b-4bit") == "mlx-community/Qwen3.5-9B-4bit" + assert resolve_model("llama3-3b-4bit") == "mlx-community/Llama-3.2-3B-Instruct-4bit" def test_full_path_passes_through(): @@ -24,12 +24,12 @@ def test_unknown_name_passes_through(): def test_local_path_takes_priority_over_alias(tmp_path): """A local directory matching an alias name should win.""" - local_dir = tmp_path / "qwen3.5-9b" + local_dir = tmp_path / "qwen3.5-9b-4bit" local_dir.mkdir() old_cwd = os.getcwd() try: os.chdir(tmp_path) - assert resolve_model("qwen3.5-9b") == "qwen3.5-9b" + assert resolve_model("qwen3.5-9b-4bit") == "qwen3.5-9b-4bit" finally: os.chdir(old_cwd) @@ -37,14 +37,14 @@ def test_local_path_takes_priority_over_alias(tmp_path): def test_list_aliases_nonempty(): aliases = list_aliases() assert len(aliases) >= 15 - assert "qwen3.5-9b" in aliases + assert "qwen3.5-9b-4bit" in aliases def test_hermes_alias_not_llama(): """Hermes-3 should be under its own name, not llama3-8b.""" aliases = list_aliases() assert "llama3-8b" not in aliases - assert "hermes3-8b" in aliases + assert "hermes3-8b-4bit" in aliases def test_suggest_similar_stays_within_family(): @@ -61,11 +61,11 @@ def test_suggest_similar_stays_within_family(): def test_suggest_similar_correctly_typo_for_close_size(): - """Typing ``qwen3.5-30b`` (typo for ``qwen3.5-35b``) should rank the + """Typing ``qwen3.5-30b`` (typo for ``qwen3.5-35b-8bit``) should rank the correct alias first.""" suggestions = suggest_similar("qwen3.5-30b") assert suggestions, "expected at least one suggestion" - assert suggestions[0] == "qwen3.5-35b", suggestions + assert suggestions[0] == "qwen3.5-35b-8bit", suggestions def test_suggest_similar_empty_for_nonsense(): @@ -89,10 +89,10 @@ def test_suggest_similar_one_letter_no_match(): def test_suggest_similar_matches_partial_family_token(): """A bare family name like ``hermes`` should suggest aliases that - share that prefix (``hermes3-8b``), not return [] just because there's + share that prefix (``hermes3-8b-4bit``), not return [] just because there's no exact ``hermes-foo`` separator pattern.""" suggestions = suggest_similar("hermes") - assert "hermes3-8b" in suggestions, suggestions + assert "hermes3-8b-4bit" in suggestions, suggestions # --- Letter-only fallback (separator-mismatched names) ---------------- @@ -101,8 +101,8 @@ def test_suggest_similar_matches_partial_family_token(): def test_suggest_similar_letter_fallback_handles_separator_mismatch(): """Real bug from the field: ``rapid-mlx chat gemma4-27b`` returned zero suggestions because the strict family parser sees ``gemma4`` and - no alias starts with ``gemma4`` (we have ``gemma-4-26b`` and - ``gemma3-27b``). The letter-only fallback must catch this — extract + no alias starts with ``gemma4`` (we have ``gemma-4-26b-4bit`` and + ``gemma3-27b-4bit``). The letter-only fallback must catch this — extract ``gemma`` and match the whole gemma family.""" suggestions = suggest_similar("gemma4-27b") assert suggestions, "letter-only fallback must produce gemma family suggestions" @@ -111,34 +111,17 @@ def test_suggest_similar_letter_fallback_handles_separator_mismatch(): assert s.startswith("gemma"), s -def test_suggest_similar_short_alias_does_not_shadow_sized_variants(): - """When a short alias (e.g. ``gemma4``) exists AND the user types - a size-qualified name (``gemma4-26b``), the strict-family pass must - NOT short-circuit to just ``[gemma4]``. Otherwise users hunting - for the 26B variant get bait-and-switched onto the 12B default. - Fall through to the letter-only pass so size-specific aliases - surface.""" - suggestions = suggest_similar("gemma4-26b") - assert suggestions, "size-qualified typo must produce suggestions" - # The 26B variant must be among the suggestions — that's what the - # user actually wanted. - assert "gemma-4-26b" in suggestions, suggestions - # Sanity: the bare ``gemma4`` short alias should NOT be the only - # suggestion (the whole point of this regression test). - assert suggestions != ["gemma4"], suggestions - - def test_suggest_similar_letter_fallback_collapsed_separator(): """User collapses our hyphen — ``mistral24b`` should still suggest - ``mistral-24b``, not return [].""" - assert "mistral-24b" in suggest_similar("mistral24b") + ``mistral-24b-4bit``, not return [].""" + assert "mistral-24b-4bit" in suggest_similar("mistral24b") def test_suggest_similar_letter_fallback_skips_legit_looking_names(): """When the input has no size/quant suffix tokens (i.e., looks structurally like a legit single-segment HF repo ID), suggest_similar - must return [] — not bait-and-switch ``gpt2`` to ``gpt-oss-20b`` or - ``qwen-coder`` to ``qwen3-coder``. The CLI layer's POPULAR_ALIASES + must return [] — not bait-and-switch ``gpt2`` to ``gpt-oss-20b-mxfp4-q8`` or + ``qwen-coder`` to ``qwen3-coder-4bit``. The CLI layer's POPULAR_ALIASES fallback handles those cases at presentation time.""" # ``gpt2`` has been pinned by test_suggest_similar_lets_legitimate_hf_ids_through; # this case adds the partial-family equivalent. @@ -152,7 +135,7 @@ def test_suggest_similar_letter_fallback_skips_legit_looking_names(): ("Gemma4-27b", "gemma"), # lowercased ("gemma_4-27b", "gemma"), # stops at non-letter ("mistral24b", "mistral"), - ("qwen3.5-4b", "qwen"), # stops at first digit + ("qwen3.5-4b-4bit", "qwen"), # stops at first digit ("123abc", ""), # leading non-letter → empty ("", ""), # empty input ("ab", "ab"), # short prefix (caller enforces ≥3 minimum) diff --git a/tests/test_model_profiles_ssot.py b/tests/test_model_profiles_ssot.py index 8866907f..49169c92 100644 --- a/tests/test_model_profiles_ssot.py +++ b/tests/test_model_profiles_ssot.py @@ -83,12 +83,12 @@ def test_orphan_aliases_now_covered() -> None: """Pin the 6 specific aliases that were orphans before this PR to catch a regression where someone deletes their profile.""" for orphan in ( - "bonsai-1.7b", - "bonsai-4b", - "bonsai-8b", - "ministral-3b", - "nemotron-30b", - "nemotron-nano", + "bonsai-1.7b-unpacked", + "bonsai-4b-unpacked", + "bonsai-8b-unpacked", + "ministral-3b-4bit", + "nemotron-30b-4bit", + "nemotron-30b-4bit", ): profile = resolve_profile(orphan) assert profile is not None, f"{orphan} regressed to orphan" @@ -108,14 +108,14 @@ def test_list_aliases_returns_legacy_string_view() -> None: aliases = list_aliases() assert len(aliases) >= 65 assert all(isinstance(p, str) for p in aliases.values()) - assert aliases["qwen3.5-4b"] == "mlx-community/Qwen3.5-4B-MLX-4bit" + assert aliases["qwen3.5-4b-4bit"] == "mlx-community/Qwen3.5-4B-MLX-4bit" assert aliases["qwen3-0.6b-8bit"] == "mlx-community/Qwen3-0.6B-8bit" def test_list_profiles_returns_rich_dataclass_view() -> None: profiles = list_profiles() assert len(profiles) >= 65 - p = profiles["qwen3.5-4b"] + p = profiles["qwen3.5-4b-4bit"] assert isinstance(p, AliasProfile) assert p.hf_path == "mlx-community/Qwen3.5-4B-MLX-4bit" assert p.tool_call_parser == "hermes" @@ -138,7 +138,7 @@ def test_list_profiles_returns_rich_dataclass_view() -> None: def test_resolve_model_unchanged_for_callers() -> None: """Existing callers of ``resolve_model`` must keep getting a string.""" - assert resolve_model("qwen3.5-4b") == "mlx-community/Qwen3.5-4B-MLX-4bit" + assert resolve_model("qwen3.5-4b-4bit") == "mlx-community/Qwen3.5-4B-MLX-4bit" assert ( resolve_model("mlx-community/Qwen3.5-4B-MLX-4bit") == "mlx-community/Qwen3.5-4B-MLX-4bit" @@ -150,7 +150,7 @@ def test_resolve_model_unchanged_for_callers() -> None: def test_resolve_profile_by_alias_name() -> None: - p = resolve_profile("qwen3.5-4b") + p = resolve_profile("qwen3.5-4b-4bit") assert p is not None assert p.tool_call_parser == "hermes" @@ -173,11 +173,11 @@ def test_resolve_profile_returns_none_for_unknown() -> None: def test_detect_model_config_prefers_alias_profile_over_regex() -> None: - """``qwen3.5-4b`` (alias) and the matching qwen3.5 regex pattern + """``qwen3.5-4b-4bit`` (alias) and the matching qwen3.5 regex pattern happen to agree today, but the alias path is the one we contract on — pin a known field that exists on the alias profile so a future regex change can't silently take over.""" - cfg = detect_model_config("qwen3.5-4b") + cfg = detect_model_config("qwen3.5-4b-4bit") assert cfg is not None assert cfg.tool_call_parser == "hermes" assert cfg.is_hybrid is True @@ -212,7 +212,7 @@ def test_detect_model_config_alias_wins_over_regex_when_they_disagree() -> None: import vllm_mlx.model_aliases as ma from vllm_mlx.model_aliases import AliasProfile - real = ma._aliases["qwen3.5-4b"] + real = ma._aliases["qwen3.5-4b-4bit"] forged = AliasProfile( hf_path=real.hf_path, tool_call_parser="ALIAS_WINS", # the regex would say "hermes" @@ -220,8 +220,8 @@ def test_detect_model_config_alias_wins_over_regex_when_they_disagree() -> None: is_hybrid=real.is_hybrid, supports_spec_decode=real.supports_spec_decode, ) - with patch.dict(ma._aliases, {"qwen3.5-4b": forged}): - cfg = detect_model_config("qwen3.5-4b") + with patch.dict(ma._aliases, {"qwen3.5-4b-4bit": forged}): + cfg = detect_model_config("qwen3.5-4b-4bit") assert cfg is not None assert cfg.tool_call_parser == "ALIAS_WINS", ( "regex shadowed the alias profile — alias-first lookup is broken" @@ -329,20 +329,20 @@ def test_per_alias_schema_allows_independent_overrides() -> None: even if they map to the same family. This is what we couldn't do before, and it's the architectural reason for the refactor.""" profiles = list_profiles() - p1 = profiles["qwen3.5-4b"] + p1 = profiles["qwen3.5-4b-4bit"] # Object identity check would be wrong; equality on a value-typed # dataclass is what we actually want — separate AliasProfile # instances per alias means we can mutate one without touching the # other. (Mutation isn't supported because the dataclass is frozen, # but a re-load with edited JSON would work.) - assert p1 is not profiles["qwen3.5-9b"] + assert p1 is not profiles["qwen3.5-9b-4bit"] # ---- Reverse-lookup behaviour with shared hf_paths ----------------------- def test_reverse_lookup_for_shared_hf_path_is_deterministic() -> None: - """Two aliases (``nemotron-30b`` and ``nemotron-nano``) point at the + """Two aliases (``nemotron-30b-4bit`` and ``nemotron-30b-4bit``) point at the same MLX repo. Reverse lookup by HF path should return the JSON-insertion-order-first alias's profile, deterministically. @@ -352,23 +352,23 @@ def test_reverse_lookup_for_shared_hf_path_is_deterministic() -> None: about who's the canonical alias). """ profiles = list_profiles() - nemotron_30b = profiles["nemotron-30b"] - nemotron_nano = profiles["nemotron-nano"] + nemotron_30b = profiles["nemotron-30b-4bit"] + nemotron_nano = profiles["nemotron-30b-4bit"] assert nemotron_30b.hf_path == nemotron_nano.hf_path - # nemotron-30b appears first in aliases.json, so reverse lookup - # by the shared HF path returns nemotron-30b's profile object. + # nemotron-30b-4bit appears first in aliases.json, so reverse lookup + # by the shared HF path returns nemotron-30b-4bit's profile object. via_path = resolve_profile(nemotron_30b.hf_path) assert via_path is not None assert via_path is nemotron_30b def test_reverse_lookup_handles_deepseek_v4_flash_duplicate() -> None: - """``deepseek-v4-flash`` and ``deepseek-v4-flash-8bit`` share + """``deepseek-v4-flash-8bit`` and ``deepseek-v4-flash-8bit`` share ``mlx-community/DeepSeek-V4-Flash-8bit`` — same regression guard pattern as the nemotron pair, different family.""" profiles = list_profiles() - flash = profiles["deepseek-v4-flash"] + flash = profiles["deepseek-v4-flash-8bit"] flash_8bit = profiles["deepseek-v4-flash-8bit"] assert flash.hf_path == flash_8bit.hf_path via_path = resolve_profile(flash.hf_path) diff --git a/tests/test_postprocessor.py b/tests/test_postprocessor.py index 00458485..8c3b5d4e 100644 --- a/tests/test_postprocessor.py +++ b/tests/test_postprocessor.py @@ -133,7 +133,7 @@ def test_channel_routed_accumulators_populated(self): emitted events to the client but never updated the per-processor accumulators that ``_build_usage`` reads to compute the reasoning/content split. Confirmed by parallel onboarding agents - on gemma-4-26b and gpt-oss-20b. + on gemma-4-26b-4bit and gpt-oss-20b. """ cfg = _make_cfg() pp = StreamingPostProcessor(cfg) diff --git a/tests/test_prefix_boundary_path_parity.py b/tests/test_prefix_boundary_path_parity.py index 4236e95d..cefffb2b 100644 --- a/tests/test_prefix_boundary_path_parity.py +++ b/tests/test_prefix_boundary_path_parity.py @@ -173,12 +173,12 @@ async def _drain(): def test_non_hybrid_model_skips_boundary_both_paths(monkeypatch): - """Pure Transformer models (gpt-oss-20b, qwen3-coder, etc.) must NOT + """Pure Transformer models (gpt-oss-20b-mxfp4-q8, qwen3-coder-4bit, etc.) must NOT take the boundary-split path even if a multi-message conversation would otherwise produce ``prefix_boundary > 0``. Why: ``BatchGenerator.insert_segments`` empirically corrupts harmony - tool-call channel state across multi-turn-with-tools on gpt-oss-20b + tool-call channel state across multi-turn-with-tools on gpt-oss-20b-mxfp4-q8 (pydantic_ai 6_multi_tool drops from 6/6 to 5/6 — agent loops on ``add(3,4)`` until ``request_limit`` exhausts). Pure Transformers don't need the boundary save anyway — trim+supersequence reuse diff --git a/tests/test_sampling_params_passthrough.py b/tests/test_sampling_params_passthrough.py index 1432ae98..7779aa73 100644 --- a/tests/test_sampling_params_passthrough.py +++ b/tests/test_sampling_params_passthrough.py @@ -27,7 +27,7 @@ # A realistic payload — Qwen3.6 published coding-tuned sampling. QWEN36_CODING_PAYLOAD = { - "model": "qwen3.6-35b", + "model": "qwen3.6-35b-4bit", "messages": [{"role": "user", "content": "hi"}], "temperature": 0.6, "top_p": 0.95, @@ -69,7 +69,7 @@ def test_chat_completion_request_defaults_to_none_when_unset(): 'client explicitly chose a value'. Mixing them would make us override SamplingParams defaults even when the client wanted defaults.""" req = ChatCompletionRequest( - model="qwen3.5-4b", + model="qwen3.5-4b-4bit", messages=[{"role": "user", "content": "hi"}], ) @@ -83,7 +83,7 @@ def test_chat_completion_request_defaults_to_none_when_unset(): def test_completion_request_preserves_extended_sampling_params(): """Mirror of the chat-request test for /v1/completions.""" payload = { - "model": "qwen3.6-35b", + "model": "qwen3.6-35b-4bit", "prompt": "hi", "temperature": 0.6, "top_p": 0.95, @@ -155,7 +155,7 @@ def test_chat_kwargs_omits_extended_params_when_client_silent(): NOT contain them — otherwise we'd override the engine's defaults with None and break the SamplingParams contract.""" req = ChatCompletionRequest( - model="qwen3.5-4b", + model="qwen3.5-4b-4bit", messages=[{"role": "user", "content": "hi"}], ) chat_kwargs = _build_chat_kwargs(req) @@ -198,7 +198,7 @@ async def stream_generate(self, **kw): yield _FakeOutput() req = CompletionRequest( - model="qwen3.6-35b", + model="qwen3.6-35b-4bit", prompt="hi", temperature=0.6, top_p=0.95, @@ -246,7 +246,7 @@ def test_completion_route_omits_extended_params_when_client_silent(): """Mirror of the chat-route variant: legacy /v1/completions clients that don't set these fields must not see them leaked as None into engine kwargs (which would override SamplingParams defaults).""" - req = CompletionRequest(model="qwen3.5-4b", prompt="hi") + req = CompletionRequest(model="qwen3.5-4b-4bit", prompt="hi") extended_kwargs: dict = {} for name in ( "top_k", diff --git a/tests/test_share_cli.py b/tests/test_share_cli.py index 4a58ea0e..6fb0227e 100644 --- a/tests/test_share_cli.py +++ b/tests/test_share_cli.py @@ -40,7 +40,7 @@ def _isolated_state_dir(tmp_path, monkeypatch): def _make_args(**overrides): defaults = dict( - model="qwen3.5-4b", + model="qwen3.5-4b-4bit", port=18765, # explicit so the env-var fallback path isn't exercised thinking=False, # default: forward --no-thinking to serve cors_origins=None, # None → CLI default allowlist @@ -124,7 +124,7 @@ def test_share_command_happy_path(capsys): patch.object(share_cli.ws_tunnel, "wait_for_public_url", return_value=True), patch.object(share_cli, "_pick_port", return_value=18765), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): @@ -227,9 +227,9 @@ def test_register_adds_share_to_subparsers(): parser = argparse.ArgumentParser() subparsers = parser.add_subparsers(dest="command") share_cli.register(subparsers) - args = parser.parse_args(["share", "qwen3.5-4b"]) + args = parser.parse_args(["share", "qwen3.5-4b-4bit"]) assert args.command == "share" - assert args.model == "qwen3.5-4b" + assert args.model == "qwen3.5-4b-4bit" def test_share_command_rejects_garbage_port_env(monkeypatch): @@ -277,7 +277,7 @@ def test_spawn_serve_passes_loopback_host(): access at ``http://:``.""" with patch("subprocess.Popen") as mock_popen: share_cli._spawn_serve( - alias="qwen3.5-4b", + alias="qwen3.5-4b-4bit", port=18765, api_key="K", log_path=MagicMock(), @@ -293,7 +293,7 @@ def test_spawn_serve_passes_api_key_via_env_not_argv(): appear in argv where ``ps`` / shell history would leak it.""" with patch("subprocess.Popen") as mock_popen: share_cli._spawn_serve( - alias="qwen3.5-4b", + alias="qwen3.5-4b-4bit", port=18765, api_key="SECRET_KEY_HERE", log_path=MagicMock(), @@ -466,7 +466,7 @@ def test_share_command_exits_nonzero_when_serve_crashes(capsys): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", return_value=None), pytest.raises(SystemExit) as exc_info, @@ -496,7 +496,7 @@ def test_share_command_exits_nonzero_when_serve_exits_cleanly(capsys): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", return_value=None), pytest.raises(SystemExit) as exc_info, @@ -533,7 +533,7 @@ def fake_sleep(*_a, **_k): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=fake_sleep), pytest.raises(SystemExit) as exc_info, @@ -558,7 +558,7 @@ def test_share_command_ctrl_c_keeps_exit_zero(capsys): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): @@ -591,7 +591,7 @@ def raise_sigterm_handler_once(*_a, **_k): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=raise_sigterm_handler_once), ): @@ -617,7 +617,7 @@ def test_register_share_cors_origins_accepts_multiple_values(): args = parser.parse_args( [ "share", - "qwen3.5-4b", + "qwen3.5-4b-4bit", "--cors-origins", "https://a.com", "https://b.com", @@ -647,7 +647,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args): # noqa: ARG001 patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): @@ -694,7 +694,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args): # noqa: ARG001 patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ] @@ -804,7 +804,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args): # noqa: ARG001 spawn_argv.append(alias) return serve_proc - args = _make_args(model="qwen3.5-4b") + args = _make_args(model="qwen3.5-4b-4bit") # Simulate argparse having rewritten the alias. args._original_alias = "Qwen3.5-4B" @@ -817,7 +817,7 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args): # noqa: ARG001 patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): @@ -845,12 +845,12 @@ def fake_spawn(*, alias, port, api_key, log_path, extra_args): # noqa: ARG001 patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): - share_cli.share_command(_make_args(model="qwen3.5-4b")) - assert spawn_argv == ["qwen3.5-4b"] + share_cli.share_command(_make_args(model="qwen3.5-4b-4bit")) + assert spawn_argv == ["qwen3.5-4b-4bit"] # ─────────────────────────── download-gate behavior ───────────────────────── @@ -873,7 +873,7 @@ def test_share_command_runs_download_gate_for_uncached_hf_repo(): def test_share_command_skips_download_gate_for_local_alias(): - """Aliases without ``/`` (e.g. ``qwen3.5-4b``) are NOT HF repo ids — + """Aliases without ``/`` (e.g. ``qwen3.5-4b-4bit``) are NOT HF repo ids — the gate short-circuits before any HF API call. Verified by asserting ``is_repo_cached`` was never called.""" with ( @@ -881,7 +881,7 @@ def test_share_command_skips_download_gate_for_local_alias(): patch("vllm_mlx._download_gate.is_repo_cached") as cached, pytest.raises(SystemExit), ): - share_cli.share_command(_make_args(model="qwen3.5-4b")) + share_cli.share_command(_make_args(model="qwen3.5-4b-4bit")) cached.assert_not_called() @@ -1104,7 +1104,7 @@ def test_register_share_chat_frontend_default_is_none(): parser = argparse.ArgumentParser() subparsers = parser.add_subparsers(dest="command") share_cli.register(subparsers) - args = parser.parse_args(["share", "qwen3.5-4b"]) + args = parser.parse_args(["share", "qwen3.5-4b-4bit"]) assert args.chat_frontend is None @@ -1113,7 +1113,7 @@ def test_register_share_chat_frontend_accepts_value(): subparsers = parser.add_subparsers(dest="command") share_cli.register(subparsers) args = parser.parse_args( - ["share", "qwen3.5-4b", "--chat-frontend", "https://my-fork.example"] + ["share", "qwen3.5-4b-4bit", "--chat-frontend", "https://my-fork.example"] ) assert args.chat_frontend == "https://my-fork.example" @@ -1150,7 +1150,7 @@ def test_share_command_forwards_chat_frontend_to_banner(capsys): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): @@ -1179,7 +1179,7 @@ def test_share_command_omits_chat_line_when_frontend_disabled(capsys): patch.object(share_cli, "_pick_port", return_value=18765), patch.object(share_cli, "_maybe_confirm_download"), patch.object( - share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b" + share_cli, "_resolve_served_model_name", return_value="qwen3.5-4b-4bit" ), patch("time.sleep", side_effect=_ctrl_c_in_monitor_loop()), ): diff --git a/tests/test_smoke_matrix.sh b/tests/test_smoke_matrix.sh index e5c45902..00befac9 100755 --- a/tests/test_smoke_matrix.sh +++ b/tests/test_smoke_matrix.sh @@ -72,7 +72,7 @@ print(text) # test 4 to verify the reasoning parser is actually splitting thinking # tokens into ``reasoning_content`` — combined length is unreliable because # some models compensate for disabled thinking by writing longer answers -# in ``content`` (e.g. qwen3.5-4b on simple math drops thinking ratio +# in ``content`` (e.g. qwen3.5-4b-4bit on simple math drops thinking ratio # below the previous 1.5x heuristic). stream_chat_split() { local body="$1" diff --git a/tests/test_stop_string_enforcement.py b/tests/test_stop_string_enforcement.py index 699121f9..5a923211 100644 --- a/tests/test_stop_string_enforcement.py +++ b/tests/test_stop_string_enforcement.py @@ -8,7 +8,7 @@ scheduler has to scan the decoded output itself. This was missing on the text path (MLLMScheduler had it, Scheduler did -not), which surfaced as 4 failing regression-suite tests on qwen3.5-4b: +not), which surfaced as 4 failing regression-suite tests on qwen3.5-4b-4bit: tests 1 (newline), 2 (literal word), 4 (Unicode), 5 (streaming). These unit tests exercise ``Scheduler._process_batch_responses`` directly with diff --git a/tests/test_suffix_bench_methodology.py b/tests/test_suffix_bench_methodology.py index e9c0fed9..5ad7d16c 100644 --- a/tests/test_suffix_bench_methodology.py +++ b/tests/test_suffix_bench_methodology.py @@ -53,7 +53,7 @@ def test_negative_tokens_rejected(self): def test_decode_time_floor_rejects_short_window(self): # 80 tokens generated in 0.04s → 2000 tok/s. This is the exact - # failure mode that burned smollm3-3b's code_edit run in v2 — + # failure mode that burned smollm3-3b-4bit's code_edit run in v2 — # technically >32 tokens (the old guard) but still meaningless. wr = bench._classify_run(completion_tokens=80, decode_time=0.04, total_time=0.6) assert wr.tps is None diff --git a/tests/test_suffix_decoding_tier.py b/tests/test_suffix_decoding_tier.py index da1309d0..14bb64d3 100644 --- a/tests/test_suffix_decoding_tier.py +++ b/tests/test_suffix_decoding_tier.py @@ -191,7 +191,7 @@ def test_avoid_shows_worst_workload(self): assert "0.78" in table def test_long_avoid_note_fits_box_when_max_width_set(self): - """Regression: long ``avoid`` notes (e.g. ``gemma-4-26b``) used + """Regression: long ``avoid`` notes (e.g. ``gemma-4-26b-4bit``) used to overflow the right ``│`` border. Truncation must keep the tier word + numeric speedup whole while shortening the trailing rationale to ``…)``.""" @@ -217,7 +217,7 @@ def test_short_tier_note_is_not_wrongly_truncated(self): def test_table_rows_all_same_width_for_long_avoid(self): """Box-frame alignment invariant: every bordered row must end at the same column. Pre-fix the ``Suffix tier`` row for an alias - like ``gemma-4-26b`` would render past the right ``│``.""" + like ``gemma-4-26b-4bit`` would render past the right ``│``.""" cfg = ModelConfig( suffix_decoding_tier="avoid", suffix_bench_speedup={"json_array": 0.20}, diff --git a/tests/test_telemetry_emit.py b/tests/test_telemetry_emit.py index 86ba1d6d..92b86ed7 100644 --- a/tests/test_telemetry_emit.py +++ b/tests/test_telemetry_emit.py @@ -80,7 +80,7 @@ def test_request_no_op_when_disabled(fake_home, stub_queue): emit.request( endpoint="/v1/chat/completions", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=100, @@ -147,7 +147,7 @@ def test_cli_kill_switch_overrides_opt_in(opted_in, stub_queue): emit.session_end(subcommand="serve", duration_seconds=42) emit.request( endpoint="/v1/chat/completions", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=100, @@ -289,7 +289,7 @@ def test_request_buckets_not_raw_numbers(opted_in, stub_queue): emit.request( endpoint="/v1/chat/completions", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=137, @@ -469,7 +469,7 @@ def test_request_endpoint_constrained_to_allowlist(opted_in, stub_queue): # Allowed endpoint round-trips verbatim (after strip). emit.request( endpoint="/v1/chat/completions", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=10, @@ -483,7 +483,7 @@ def test_request_endpoint_constrained_to_allowlist(opted_in, stub_queue): # Query string + fragment stripped before allowlist match. emit.request( endpoint="/v1/chat/completions?api_key=sk-PROD-SECRET#anchor", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=10, @@ -502,7 +502,7 @@ def test_request_endpoint_constrained_to_allowlist(opted_in, stub_queue): # string leaks into the payload. emit.request( endpoint="/internal/dump?path=/Users/alice/secrets.txt", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=10, @@ -530,7 +530,7 @@ def test_request_endpoint_normalizes_full_url_to_path(opted_in, stub_queue): emit.request( endpoint="https://api.example.com/v1/chat/completions", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=10, @@ -549,7 +549,7 @@ def test_request_endpoint_normalizes_full_url_to_path(opted_in, stub_queue): # Combined: full URL + query + fragment still resolves correctly. emit.request( endpoint="https://host/v1/chat/completions?key=sk-PROD-LEAK#frag", - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=10, @@ -661,7 +661,7 @@ def test_safe_does_not_swallow_signature_mismatch(opted_in, stub_queue): with pytest.raises(TypeError): emit.request( # endpoint missing - model_alias="qwen3.5-9b", + model_alias="qwen3.5-9b-4bit", stream=True, tool_call_used=False, prompt_tokens=10, @@ -715,7 +715,7 @@ def test_flag_values_never_cross_telemetry_boundary(opted_in, stub_queue): prompt = "summarize this confidential email about Q3 numbers" argv = [ "serve", - "qwen3.5-9b", + "qwen3.5-9b-4bit", "--api-key", secret, "--auth-header", diff --git a/tests/test_telemetry_redact.py b/tests/test_telemetry_redact.py index f7c9eaa2..86549eff 100644 --- a/tests/test_telemetry_redact.py +++ b/tests/test_telemetry_redact.py @@ -126,8 +126,8 @@ def test_normalize_model_path_redacts_local(raw): def test_normalize_model_path_bare_alias_passes(): """Bare alias names (no slash) are public + harmless.""" - assert normalize_model_path("qwen3.5-9b") == "qwen3.5-9b" - assert normalize_model_path("hermes3-8b") == "hermes3-8b" + assert normalize_model_path("qwen3.5-9b-4bit") == "qwen3.5-9b-4bit" + assert normalize_model_path("hermes3-8b-4bit") == "hermes3-8b-4bit" def test_normalize_model_path_empty(): diff --git a/vllm_mlx/_completion.py b/vllm_mlx/_completion.py index 3a9a0ff7..2ba65c6c 100644 --- a/vllm_mlx/_completion.py +++ b/vllm_mlx/_completion.py @@ -116,7 +116,7 @@ def alias_csv_completer(prefix: str = "", **_: Any) -> list[str]: """Comma-separated-list variant for ``doctor --models a,b,c``. The user-visible prefix at completion time contains everything - typed for this flag — e.g. ``qwen3.5-4b,gem`` when partway through + typed for this flag — e.g. ``qwen3.5-4b-4bit,gem`` when partway through the second entry. We split on the last comma, strip whitespace around the tail so ``--models a, gem`` works the same as ``--models a,gem`` (the runtime ``split + strip`` accepts diff --git a/vllm_mlx/_download_gate.py b/vllm_mlx/_download_gate.py index 4e14cbd4..ec997a73 100644 --- a/vllm_mlx/_download_gate.py +++ b/vllm_mlx/_download_gate.py @@ -3,7 +3,7 @@ Persona-3 ("Ollama switcher") feedback (2026-05): running - rapid-mlx chat qwen3-coder + rapid-mlx chat qwen3-coder-4bit against an alias that wasn't yet cached silently kicked off a 41.8 GB download with no ``[Y/n]`` prompt. The download itself ran fine, but diff --git a/vllm_mlx/agents/__init__.py b/vllm_mlx/agents/__init__.py index 10835740..92ff5718 100644 --- a/vllm_mlx/agents/__init__.py +++ b/vllm_mlx/agents/__init__.py @@ -4,7 +4,7 @@ from vllm_mlx.agents import get_profile, list_profiles profile = get_profile("hermes") - config = profile.render_config("http://localhost:8000/v1", "qwen3.5-9b") + config = profile.render_config("http://localhost:8000/v1", "qwen3.5-9b-4bit") """ from __future__ import annotations diff --git a/vllm_mlx/agents/profiles/aider.yaml b/vllm_mlx/agents/profiles/aider.yaml index 3a8cfb6e..261b5be4 100644 --- a/vllm_mlx/agents/profiles/aider.yaml +++ b/vllm_mlx/agents/profiles/aider.yaml @@ -11,10 +11,10 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" - - "deepseek-r1-8b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" + - "deepseek-r1-8b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/cline.yaml b/vllm_mlx/agents/profiles/cline.yaml index 641e32e2..08b5e011 100644 --- a/vllm_mlx/agents/profiles/cline.yaml +++ b/vllm_mlx/agents/profiles/cline.yaml @@ -13,9 +13,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "gemma-4-26b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "gemma-4-26b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/codex.yaml b/vllm_mlx/agents/profiles/codex.yaml index b3c47b1c..a2bdf438 100644 --- a/vllm_mlx/agents/profiles/codex.yaml +++ b/vllm_mlx/agents/profiles/codex.yaml @@ -17,9 +17,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/generic.yaml b/vllm_mlx/agents/profiles/generic.yaml index 39ef387d..4ee271f2 100644 --- a/vllm_mlx/agents/profiles/generic.yaml +++ b/vllm_mlx/agents/profiles/generic.yaml @@ -10,9 +10,9 @@ config: models: recommended: - - "qwen3.5-4b" - - "qwen3.5-9b" - - "qwen3.6-35b" + - "qwen3.5-4b-4bit" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" streaming: extra_tool_tags: [] diff --git a/vllm_mlx/agents/profiles/goose.yaml b/vllm_mlx/agents/profiles/goose.yaml index 4287f07f..2e49a6b3 100644 --- a/vllm_mlx/agents/profiles/goose.yaml +++ b/vllm_mlx/agents/profiles/goose.yaml @@ -11,9 +11,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/hermes.yaml b/vllm_mlx/agents/profiles/hermes.yaml index 849f0506..afbd05e7 100644 --- a/vllm_mlx/agents/profiles/hermes.yaml +++ b/vllm_mlx/agents/profiles/hermes.yaml @@ -18,11 +18,11 @@ config: models: recommended: - - "qwen3.5-4b" - - "qwen3.5-9b" - - "gemma-4-26b" + - "qwen3.5-4b-4bit" + - "qwen3.5-9b-4bit" + - "gemma-4-26b-4bit" - "qwen3.5-35b-a3b" - - "qwen3.6-35b" + - "qwen3.6-35b-4bit" parser_override: null # use auto-detect per model streaming: diff --git a/vllm_mlx/agents/profiles/langchain.yaml b/vllm_mlx/agents/profiles/langchain.yaml index 55e10ded..8f8e9062 100644 --- a/vllm_mlx/agents/profiles/langchain.yaml +++ b/vllm_mlx/agents/profiles/langchain.yaml @@ -11,9 +11,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" streaming: max_tools: 10 diff --git a/vllm_mlx/agents/profiles/openclaude.yaml b/vllm_mlx/agents/profiles/openclaude.yaml index be8a890b..7d10bcca 100644 --- a/vllm_mlx/agents/profiles/openclaude.yaml +++ b/vllm_mlx/agents/profiles/openclaude.yaml @@ -13,10 +13,10 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" - - "gemma-4-26b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" + - "gemma-4-26b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/opencode.yaml b/vllm_mlx/agents/profiles/opencode.yaml index 28db5c2b..0143eeab 100644 --- a/vllm_mlx/agents/profiles/opencode.yaml +++ b/vllm_mlx/agents/profiles/opencode.yaml @@ -33,10 +33,10 @@ config: models: recommended: - - "qwen3-coder-30b" - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3-coder-30b-4bit" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/openhands.yaml b/vllm_mlx/agents/profiles/openhands.yaml index 8820f984..de6c8735 100644 --- a/vllm_mlx/agents/profiles/openhands.yaml +++ b/vllm_mlx/agents/profiles/openhands.yaml @@ -12,9 +12,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" parser_override: null streaming: diff --git a/vllm_mlx/agents/profiles/pydanticai.yaml b/vllm_mlx/agents/profiles/pydanticai.yaml index 6c34da17..bacfde84 100644 --- a/vllm_mlx/agents/profiles/pydanticai.yaml +++ b/vllm_mlx/agents/profiles/pydanticai.yaml @@ -11,9 +11,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" streaming: max_tools: 10 diff --git a/vllm_mlx/agents/profiles/smolagents.yaml b/vllm_mlx/agents/profiles/smolagents.yaml index 425a400c..e5880bee 100644 --- a/vllm_mlx/agents/profiles/smolagents.yaml +++ b/vllm_mlx/agents/profiles/smolagents.yaml @@ -11,9 +11,9 @@ config: models: recommended: - - "qwen3.5-9b" - - "qwen3.6-35b" - - "qwen3.5-4b" + - "qwen3.5-9b-4bit" + - "qwen3.6-35b-4bit" + - "qwen3.5-4b-4bit" streaming: max_tools: 10 diff --git a/vllm_mlx/aliases.json b/vllm_mlx/aliases.json index f8051597..7a5edf99 100644 --- a/vllm_mlx/aliases.json +++ b/vllm_mlx/aliases.json @@ -1,5 +1,5 @@ { - "qwen3.5-4b": { + "qwen3.5-4b-4bit": { "hf_path": "mlx-community/Qwen3.5-4B-MLX-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -14,7 +14,7 @@ "is_moe": false, "supports_spec_decode": false }, - "qwen3.5-9b": { + "qwen3.5-9b-4bit": { "hf_path": "mlx-community/Qwen3.5-9B-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -29,14 +29,14 @@ "is_moe": false, "supports_spec_decode": false }, - "qwen3.5-27b": { + "qwen3.5-27b-4bit": { "hf_path": "mlx-community/Qwen3.5-27B-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", "is_hybrid": true, "supports_spec_decode": false }, - "qwen3.5-35b": { + "qwen3.5-35b-8bit": { "hf_path": "mlx-community/Qwen3.5-35B-A3B-8bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -52,7 +52,7 @@ "supports_spec_decode": false, "is_moe": true }, - "qwen3.5-122b": { + "qwen3.5-122b-mxfp4": { "hf_path": "nightmedia/Qwen3.5-122B-A10B-Text-mxfp4-mlx", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -78,7 +78,7 @@ "supports_dflash": true, "dflash_draft_model": "z-lab/Qwen3.5-27B-DFlash" }, - "qwen3.6-27b": { + "qwen3.6-27b-4bit": { "hf_path": "mlx-community/Qwen3.6-27B-4bit", "tool_call_parser": "qwen3_coder_xml", "reasoning_parser": "qwen3", @@ -101,7 +101,7 @@ "is_hybrid": true, "supports_spec_decode": false }, - "qwen3.6-35b": { + "qwen3.6-35b-4bit": { "hf_path": "mlx-community/Qwen3.6-35B-A3B-4bit", "tool_call_parser": "qwen3_coder_xml", "reasoning_parser": "qwen3", @@ -141,13 +141,6 @@ "supports_spec_decode": false, "is_moe": true }, - "deepseek-v4-flash": { - "hf_path": "mlx-community/DeepSeek-V4-Flash-8bit", - "tool_call_parser": "deepseek", - "reasoning_parser": "deepseek_r1", - "is_hybrid": false, - "supports_spec_decode": true - }, "deepseek-v4-flash-2bit": { "hf_path": "mlx-community/DeepSeek-V4-Flash-2bit-DQ", "tool_call_parser": "deepseek", @@ -201,14 +194,14 @@ "is_moe": false, "supports_spec_decode": true }, - "qwen3-coder": { + "qwen3-coder-4bit": { "hf_path": "lmstudio-community/Qwen3-Coder-Next-MLX-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, "is_hybrid": true, "supports_spec_decode": false }, - "qwen3-vl-4b": { + "qwen3-vl-4b-4bit": { "hf_path": "mlx-community/Qwen3-VL-4B-Instruct-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -221,7 +214,7 @@ "code_edit": 1.001 } }, - "llama3-1b": { + "llama3-1b-4bit": { "hf_path": "mlx-community/Llama-3.2-1B-Instruct-4bit", "tool_call_parser": "llama", "reasoning_parser": null, @@ -235,7 +228,7 @@ "code_edit": 0.851 } }, - "llama3-3b": { + "llama3-3b-4bit": { "hf_path": "mlx-community/Llama-3.2-3B-Instruct-4bit", "tool_call_parser": "llama", "reasoning_parser": null, @@ -247,7 +240,7 @@ "json_array": 0.875 } }, - "hermes3-8b": { + "hermes3-8b-4bit": { "hf_path": "mlx-community/Hermes-3-Llama-3.1-8B-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -261,7 +254,7 @@ "code_edit": 0.584 } }, - "gemma-4-12b": { + "gemma-4-12b-4bit": { "hf_path": "mlx-community/gemma-4-12B-it-4bit", "tool_call_parser": "gemma4", "reasoning_parser": "gemma4", @@ -287,7 +280,7 @@ "top_k": 64 } }, - "gemma-4-26b": { + "gemma-4-26b-4bit": { "hf_path": "mlx-community/gemma-4-26b-a4b-it-4bit", "tool_call_parser": "gemma4", "reasoning_parser": "gemma4", @@ -303,7 +296,7 @@ "top_k": 64 } }, - "gemma-4-31b": { + "gemma-4-31b-4bit": { "hf_path": "mlx-community/gemma-4-31b-it-4bit", "tool_call_parser": "gemma4", "reasoning_parser": "gemma4", @@ -334,7 +327,7 @@ "top_k": 64 } }, - "gemma-4-12b-qat": { + "gemma-4-12b-qat-4bit": { "hf_path": "mlx-community/gemma-4-12B-it-qat-4bit", "tool_call_parser": "gemma4", "reasoning_parser": "gemma4", @@ -360,7 +353,7 @@ "top_k": 64 } }, - "gemma-4-26b-qat": { + "gemma-4-26b-qat-4bit": { "hf_path": "mlx-community/gemma-4-26B-A4B-it-qat-4bit", "tool_call_parser": "gemma4", "reasoning_parser": "gemma4", @@ -373,7 +366,7 @@ "top_k": 64 } }, - "gemma-4-31b-qat": { + "gemma-4-31b-qat-4bit": { "hf_path": "mlx-community/gemma-4-31B-it-qat-4bit", "tool_call_parser": "gemma4", "reasoning_parser": "gemma4", @@ -399,20 +392,7 @@ "top_k": 64 } }, - "gemma4": { - "hf_path": "mlx-community/gemma-4-12B-it-qat-4bit", - "tool_call_parser": "gemma4", - "reasoning_parser": "gemma4", - "is_hybrid": false, - "is_moe": false, - "supports_spec_decode": true, - "recommended_sampling": { - "temperature": 1.0, - "top_p": 0.95, - "top_k": 64 - } - }, - "gemma3-12b": { + "gemma3-12b-4bit": { "hf_path": "mlx-community/gemma-3-12b-it-qat-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -424,8 +404,8 @@ "top_k": 64 } }, - "phi4-14b": { - "hf_path": "mlx-community/phi-4-mini-instruct-4bit", + "phi-4-14b-4bit": { + "hf_path": "mlx-community/phi-4-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, "is_hybrid": false, @@ -438,14 +418,14 @@ "code_edit": 0.874 } }, - "mistral-24b": { + "mistral-24b-4bit": { "hf_path": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, "is_hybrid": false, "supports_spec_decode": true }, - "devstral-24b": { + "devstral-24b-4bit": { "hf_path": "mlx-community/Devstral-Small-2507-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -462,7 +442,7 @@ "temperature": 0.15 } }, - "glm4.7-9b": { + "glm4.7-9b-4bit": { "hf_path": "mlx-community/GLM-4.7-Flash-4bit", "tool_call_parser": "glm47", "reasoning_parser": "glm4", @@ -479,7 +459,7 @@ "top_p": 0.95 } }, - "glm4.5-air": { + "glm4.5-air-4bit": { "hf_path": "mlx-community/GLM-4.5-Air-4bit", "tool_call_parser": "glm47", "reasoning_parser": "glm4", @@ -497,28 +477,28 @@ "top_p": 0.95 } }, - "gpt-oss-20b": { + "gpt-oss-20b-mxfp4-q8": { "hf_path": "mlx-community/gpt-oss-20b-MXFP4-Q8", "tool_call_parser": "harmony", "reasoning_parser": "harmony", "is_hybrid": false, "supports_spec_decode": true }, - "minimax-m2.5": { + "minimax-m2.5-4bit": { "hf_path": "lmstudio-community/MiniMax-M2.5-MLX-4bit", "tool_call_parser": "minimax", "reasoning_parser": "minimax", "is_hybrid": false, "supports_spec_decode": true }, - "minimax-m2.7": { + "minimax-m2.7-mxfp4": { "hf_path": "mlx-community/MiniMax-M2.7-4bit-mxfp4", "tool_call_parser": "minimax", "reasoning_parser": "minimax", "is_hybrid": false, "supports_spec_decode": true }, - "deepseek-r1-8b": { + "deepseek-r1-8b-4bit": { "hf_path": "mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit", "tool_call_parser": "deepseek_v31", "reasoning_parser": "deepseek_r1", @@ -531,7 +511,7 @@ "code_edit": 0.289 } }, - "deepseek-r1-32b": { + "deepseek-r1-32b-4bit": { "hf_path": "mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit", "tool_call_parser": "deepseek", "reasoning_parser": "deepseek_r1", @@ -545,14 +525,14 @@ "code_edit": 1.002 } }, - "qwopus-9b": { + "qwopus-9b-4bit": { "hf_path": "Jackrong/MLX-Qwopus3.5-9B-v3-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", "is_hybrid": true, "supports_spec_decode": false }, - "qwopus-27b": { + "qwopus-27b-4bit": { "hf_path": "Jackrong/MLX-Qwopus3.5-27B-v3-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -566,21 +546,21 @@ "is_hybrid": true, "supports_spec_decode": false }, - "kimi-48b": { + "kimi-48b-4bit": { "hf_path": "mlx-community/Kimi-K2-Instruct-4bit", "tool_call_parser": "kimi", "reasoning_parser": null, "is_hybrid": false, "supports_spec_decode": true }, - "kimi-k2.5": { + "kimi-k2.5-3bit": { "hf_path": "mlx-community/Kimi-K2.5-3bit", "tool_call_parser": "kimi", "reasoning_parser": "qwen3", "is_hybrid": false, "supports_spec_decode": true }, - "ministral-3b": { + "ministral-3b-4bit": { "hf_path": "mlx-community/Ministral-3-3B-Instruct-2512-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -592,7 +572,7 @@ "json_array": 1.08 } }, - "hermes4-70b": { + "hermes4-70b-4bit": { "hf_path": "lmstudio-community/Hermes-4-70B-MLX-4bit", "tool_call_parser": "hermes", "reasoning_parser": "glm4", @@ -606,7 +586,7 @@ "code_edit": 0.679 } }, - "qwen3-vl-8b": { + "qwen3-vl-8b-4bit": { "hf_path": "mlx-community/Qwen3-VL-8B-Instruct-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -619,7 +599,7 @@ "code_edit": 1.024 } }, - "qwen3-vl-30b": { + "qwen3-vl-30b-4bit": { "hf_path": "mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -633,7 +613,7 @@ }, "is_moe": true }, - "devstral-v2-24b": { + "devstral-v2-24b-4bit": { "hf_path": "mlx-community/Devstral-Small-2-24B-Instruct-2512-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -649,7 +629,7 @@ "temperature": 0.15 } }, - "qwen3-coder-30b": { + "qwen3-coder-30b-4bit": { "hf_path": "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -664,7 +644,7 @@ }, "is_moe": true }, - "gemma-3n-e4b": { + "gemma-3n-e4b-4bit": { "hf_path": "mlx-community/gemma-3n-E4B-it-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -676,7 +656,7 @@ "top_k": 64 } }, - "gemma3-1b": { + "gemma3-1b-4bit": { "hf_path": "mlx-community/gemma-3-1b-it-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -694,7 +674,7 @@ "top_k": 64 } }, - "gemma3-27b": { + "gemma3-27b-4bit": { "hf_path": "mlx-community/gemma-3-27b-it-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, @@ -706,15 +686,7 @@ "top_k": 64 } }, - "nemotron-30b": { - "hf_path": "lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-4bit", - "tool_call_parser": "hermes", - "reasoning_parser": "qwen3", - "is_hybrid": true, - "supports_spec_decode": false, - "is_moe": true - }, - "nemotron-nano": { + "nemotron-30b-4bit": { "hf_path": "lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -722,7 +694,7 @@ "supports_spec_decode": false, "is_moe": true }, - "bonsai-1.7b": { + "bonsai-1.7b-unpacked": { "hf_path": "prism-ml/Bonsai-1.7B-unpacked", "tool_call_parser": "hermes", "reasoning_parser": "glm4", @@ -736,7 +708,7 @@ "code_edit": 1.051 } }, - "bonsai-4b": { + "bonsai-4b-unpacked": { "hf_path": "prism-ml/Bonsai-4B-unpacked", "tool_call_parser": "hermes", "reasoning_parser": "glm4", @@ -749,7 +721,7 @@ "code_edit": 1.012 } }, - "bonsai-8b": { + "bonsai-8b-unpacked": { "hf_path": "prism-ml/Bonsai-8B-unpacked", "tool_call_parser": "hermes", "reasoning_parser": "glm4", @@ -762,7 +734,7 @@ "code_edit": 1.181 } }, - "smollm3-3b": { + "smollm3-3b-4bit": { "hf_path": "mlx-community/SmolLM3-3B-4bit", "tool_call_parser": "hermes", "reasoning_parser": "qwen3", @@ -774,11 +746,20 @@ "tool_loop": 0.776 } }, - "granite4-tiny": { + "granite4-tiny-4bit": { "hf_path": "mlx-community/granite-4.0-h-tiny-4bit", "tool_call_parser": "hermes", "reasoning_parser": null, "is_hybrid": true, "supports_spec_decode": false + }, + "phi-4-mini-4bit": { + "hf_path": "mlx-community/phi-4-mini-instruct-4bit", + "tool_call_parser": "hermes", + "reasoning_parser": null, + "is_hybrid": false, + "is_moe": false, + "supports_spec_decode": true, + "suffix_decoding_tier": "unknown" } } diff --git a/vllm_mlx/api/utils.py b/vllm_mlx/api/utils.py index 29cd4fb5..1ec74a4f 100644 --- a/vllm_mlx/api/utils.py +++ b/vllm_mlx/api/utils.py @@ -133,7 +133,7 @@ def _clean_gpt_oss_output(text: str) -> str: """ # Tool-call structure must survive to the harmony tool parser: # if the model emitted ``<|channel|>commentary to=functions.X...<|call|>`` - # (which gpt-oss-20b does for every tool invocation), the parser needs + # (which gpt-oss-20b-mxfp4-q8 does for every tool invocation), the parser needs # those structural tokens intact to extract the call. Stripping them # here drops the args into plain text and the parser returns 0 calls. # Same regression class as PR #436 but for the tool parser. Final diff --git a/vllm_mlx/cli.py b/vllm_mlx/cli.py index fd0ed6c9..ad7c102a 100644 --- a/vllm_mlx/cli.py +++ b/vllm_mlx/cli.py @@ -10,9 +10,9 @@ rapid-mlx chat Interactive chat REPL Usage: - rapid-mlx serve qwen3.5-4b --port 8000 - rapid-mlx bench qwen3.5-4b --num-prompts 10 - rapid-mlx chat qwen3.5-4b + rapid-mlx serve qwen3.5-4b-4bit --port 8000 + rapid-mlx bench qwen3.5-4b-4bit --num-prompts 10 + rapid-mlx chat qwen3.5-4b-4bit """ import argparse @@ -1374,7 +1374,7 @@ def _print_cached_models() -> None: return # Reverse-map HF repo path → alias name so the alias column matches the - # user's mental model (``qwen3.5-4b`` not ``mlx-community/Qwen3.5-4B...``). + # user's mental model (``qwen3.5-4b-4bit`` not ``mlx-community/Qwen3.5-4B...``). profiles = list_profiles() hf_to_alias: dict[str, str] = {} for alias, p in profiles.items(): @@ -1450,11 +1450,11 @@ def models_command(args): print(f" Available models ({len(profiles)} aliases)") # Widths sized to fit the longest values currently in aliases.json: - # alias 22 (qwen3.5-122b-mxfp4 etc.), tool 16 (qwen3_coder_xml + 1 pad), - # reasoning 12 (deepseek_r1 + 1 pad), spec 10 ("✗ hybrid"), tier 11, - # dflash 7 ("✓ ready"/"—"). + # alias 24 (deepseek-v4-flash-8bit is 22 chars; +2 pad after explicit + # quant rename), tool 16 (qwen3_coder_xml + 1 pad), reasoning 12 + # (deepseek_r1 + 1 pad), spec 10 ("✗ hybrid"), tier 11, dflash 7. cols = ( - ("Alias", 22), + ("Alias", 24), ("Tools", 16), ("Reasoning", 12), ("Spec-Decode", 10), @@ -1487,7 +1487,7 @@ def models_command(args): # registry column is pure declarative state. dflash = "✓" if p.supports_dflash else "—" row = ( - f" {alias:<22} {tools:<16} {reasoning:<12} " + f" {alias:<24} {tools:<16} {reasoning:<12} " f"{spec:<10} {tier:<11} {dflash:<7}" ) print(row) @@ -1607,7 +1607,7 @@ def ps_command(_args): try: i = cmd.index("serve") + 1 # Pre-PR this loop ``break``ed on the first positional, so a - # ``rapid-mlx serve qwen3.5-4b --port 8005`` ended with + # ``rapid-mlx serve qwen3.5-4b-4bit --port 8005`` ended with # port="8000" because the positional model token came before # ``--port``. Keep scanning for flags after we've captured the # model — argparse accepts them on either side. @@ -1673,7 +1673,7 @@ def _spawn_chat_server( If ``served_name`` is given, it is passed via ``--served-model-name`` so the spawned server exposes the alias as the API model name (e.g. user - typed ``qwen3.5-4b`` → API requests use ``qwen3.5-4b`` rather than the + typed ``qwen3.5-4b-4bit`` → API requests use ``qwen3.5-4b-4bit`` rather than the expanded HF path). """ import socket @@ -1809,7 +1809,7 @@ def _has_short_pattern_dominating_suffix( - ``"BarleyBarleyBarley..."`` (no whitespace separator) — the entire suffix collapses to a single ``str.split()`` token whose count - never increments. Real qwen3.5-4b regression surfaced in the + never increments. Real qwen3.5-4b-4bit regression surfaced in the 0.6.28 onboarding test. - Long-cycle phrase loops, e.g. a ~280-char clause that repeats verbatim until ``max_tokens``. Surfaced when asked "describe the @@ -2051,7 +2051,7 @@ def _close_open_md_spans() -> None: # pattern. Catches the form ``"BarleyBarleyBarley..."`` (no # whitespace separator), where ``piece.split()`` produces one # giant token whose count never increments — this was a real - # qwen3.5-4b regression in 0.6.28 (issue surfaced post-release). + # qwen3.5-4b-4bit regression in 0.6.28 (issue surfaced post-release). REPEAT_LIMIT = 25 repeat_last: str | None = None repeat_run = 0 @@ -2442,7 +2442,7 @@ def _sigterm_handler(*_): # we can distinguish "user did not pass it" from "user passed 2048 # explicitly". When ``--think`` is set and the user did not supply a # value, raise the default from 2048 to 4096 so the reasoning trace + - # final answer both fit (the round-1 finding: ``chat qwen3.5-4b + # final answer both fit (the round-1 finding: ``chat qwen3.5-4b-4bit # --think`` filled the 2048 budget with reasoning and emitted an # empty answer with ``finish_reason='length'``). user_passed_max_tokens = args.max_tokens is not None @@ -2487,7 +2487,7 @@ def _sigterm_handler(*_): # # Default thinking OFF in the REPL. Reasoning models (Qwen3.5/3.6, etc.) # otherwise emit raw chain-of-thought to stdout AND, on the default - # qwen3.5-4b model, degenerate into infinite repetition until max-tokens + # qwen3.5-4b-4bit model, degenerate into infinite repetition until max-tokens # truncates the response — producing zero usable output for a brand-new # user. ``--think`` opts back in for users who explicitly want to see # reasoning traces; ``--no-think`` is preserved as the legacy form. @@ -3264,12 +3264,12 @@ def main(): formatter_class=argparse.RawDescriptionHelpFormatter, epilog="""\ Examples: - rapid-mlx chat # interactive REPL (defaults to qwen3.5-4b) - rapid-mlx chat qwen3.5-9b --think # larger model, surface reasoning - rapid-mlx serve qwen3.5-9b --port 8000 # OpenAI-compatible server + rapid-mlx chat # interactive REPL (defaults to qwen3.5-4b-4bit) + rapid-mlx chat qwen3.5-9b-4bit --think # larger model, surface reasoning + rapid-mlx serve qwen3.5-9b-4bit --port 8000 # OpenAI-compatible server rapid-mlx serve mlx-community/Qwen3.5-9B-4bit # full HF repo also works rapid-mlx models # list all aliases - rapid-mlx info qwen3.5-9b # show per-alias profile + rapid-mlx info qwen3.5-9b-4bit # show per-alias profile """, ) parser.add_argument( @@ -4092,13 +4092,13 @@ def main(): "pull", help="Download a model to the HuggingFace cache (no server)" ) pull_parser.add_argument( - "model", help="Model alias (e.g. qwen3.5-4b) or HF repo (org/name)" + "model", help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (org/name)" ).completer = alias_completer rm_parser = subparsers.add_parser( "rm", help="Remove a cached model from the HuggingFace cache" ) rm_parser.add_argument( - "model", help="Model alias (e.g. qwen3.5-4b) or HF repo (org/name)" + "model", help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (org/name)" ).completer = alias_completer subparsers.add_parser("ps", help="List running rapid-mlx servers") @@ -4135,9 +4135,9 @@ def main(): chat_parser.add_argument( "model", nargs="?", - default="qwen3.5-4b", - help="Model alias (e.g. qwen3.5-4b) or HF repo (org/name). " - "Defaults to qwen3.5-4b when omitted.", + default="qwen3.5-4b-4bit", + help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (org/name). " + "Defaults to qwen3.5-4b-4bit when omitted.", ).completer = alias_completer chat_parser.add_argument( "--system", @@ -4213,7 +4213,7 @@ def main(): ) info_parser.add_argument( "model", - help="Model alias (e.g. qwen3.5-4b) or HF repo (e.g. mlx-community/SmolLM3-3B-4bit)", + help="Model alias (e.g. qwen3.5-4b-4bit) or HF repo (e.g. mlx-community/SmolLM3-3B-4bit)", ).completer = alias_completer # Agents command @@ -4271,14 +4271,14 @@ def main(): "--model", type=str, default=None, - help="Model alias for check tier (default: qwen3.5-35b)", + help="Model alias for check tier (default: qwen3.5-35b-8bit)", ).completer = alias_completer doctor_parser.add_argument( "--models", type=str, default=None, help="Comma-separated model aliases for full / benchmark tiers " - "(full default: qwen3.5-35b,qwen3.6-35b; " + "(full default: qwen3.5-35b-8bit,qwen3.6-35b-4bit; " "benchmark default: auto-discovered from local cache)", ).completer = alias_csv_completer doctor_parser.add_argument( diff --git a/vllm_mlx/doctor/__init__.py b/vllm_mlx/doctor/__init__.py index 2e1ebafb..4227c476 100644 --- a/vllm_mlx/doctor/__init__.py +++ b/vllm_mlx/doctor/__init__.py @@ -4,8 +4,8 @@ Three tiers + a benchmark sweep: - smoke (~2 min, no model) — pytest, ruff, CLI sanity - - check (~15 min, qwen3.5-35b) — server + perf + agents + baseline diff - - full (~2-3 hr, 2 models) — check across qwen3.5-35b + qwen3.6-35b + - check (~15 min, qwen3.5-35b-8bit) — server + perf + agents + baseline diff + - full (~2-3 hr, 2 models) — check across qwen3.5-35b-8bit + qwen3.6-35b-4bit - benchmark (overnight, all models) — cross-model × cross-engine scorecard Entry point: ``rapid-mlx doctor {smoke,check,full,benchmark}`` diff --git a/vllm_mlx/doctor/baseline.py b/vllm_mlx/doctor/baseline.py index c0812598..f84d5f43 100644 --- a/vllm_mlx/doctor/baseline.py +++ b/vllm_mlx/doctor/baseline.py @@ -10,7 +10,7 @@ { "captured_at": "2026-04-15T21:00:00", "rapid_mlx_version": "0.5.1", - "model": "qwen3.5-35b", + "model": "qwen3.5-35b-8bit", "metrics": { "decode_tps": 156.2, "ttft_cold_ms": 412, diff --git a/vllm_mlx/doctor/cli.py b/vllm_mlx/doctor/cli.py index 901f1a91..ace4abae 100644 --- a/vllm_mlx/doctor/cli.py +++ b/vllm_mlx/doctor/cli.py @@ -20,7 +20,7 @@ # Default model used by the check tier. Tier 3 (full) loops a wider list. # A real-capacity 8-bit model is required so eval failures can be cleanly # attributed to rapid-mlx bugs rather than small-model quant noise. -DEFAULT_CHECK_MODEL = "qwen3.5-35b" +DEFAULT_CHECK_MODEL = "qwen3.5-35b-8bit" # Model list for the full tier: real-capacity Qwen lines only. No 4B # (small models can't separate model errors from engine errors) and no @@ -29,7 +29,7 @@ # refusal, multi-turn context drift; failures don't cleanly attribute # to rapid-mlx so it's noise here). Add Gemma back when a tighter # instruct variant ships. -DEFAULT_FULL_MODELS = ["qwen3.5-35b", "qwen3.6-35b"] +DEFAULT_FULL_MODELS = ["qwen3.5-35b-8bit", "qwen3.6-35b-4bit"] # Agent profiles to exercise per-model in the full tier. None ⇒ all # loaded profiles. Limit here if a particular profile is too slow to @@ -180,7 +180,7 @@ def run_full_tier(models: list[str], update_baselines: bool = False): update_baselines=update_baselines, agent_profiles=profile_names, # boot_timeout_s=None → _suggested_boot_timeout picks 600s for - # the 27B+ models (qwen3.5-35b, gemma-4-26b) and 180s for + # the 27B+ models (qwen3.5-35b-8bit, gemma-4-26b-4bit) and 180s for # smaller ones, so the same logic applies regardless of which # tier called us. ) @@ -226,7 +226,7 @@ def _resolve_agent_profiles(explicit: list[str] | None) -> list[str]: # Earlier iterations tried to pick a tier-/alias-aware shorter budget for # the small-model case, but every heuristic missed at least one supported # large model (Qwen3-Coder lacks a 'NNb' hint in its alias, MiniMax M2.5 -# is huge but named 'minimax-m2.5', etc.). The optimisation isn't worth +# is huge but named 'minimax-m2.5-4bit', etc.). The optimisation isn't worth # the false-fail risk. DEFAULT_BOOT_TIMEOUT_S = 600 diff --git a/vllm_mlx/engine/batched.py b/vllm_mlx/engine/batched.py index 40b9253a..aa71c8a3 100644 --- a/vllm_mlx/engine/batched.py +++ b/vllm_mlx/engine/batched.py @@ -1199,7 +1199,7 @@ async def chat( # # Hybrid-only gate: the boundary split routes through # ``BatchGenerator.insert_segments`` which on pure-Transformer models - # (e.g. gpt-oss-20b harmony) corrupts the harmony tool-call channel + # (e.g. gpt-oss-20b-mxfp4-q8 harmony) corrupts the harmony tool-call channel # state across multi-turn-with-tools and the agent loops forever. # Pure Transformers don't need the boundary save anyway — the prefix # cache already reuses via trim+supersequence. Only hybrid models @@ -1234,7 +1234,7 @@ def _is_hybrid_model(self) -> bool: Pure Transformer models don't have this constraint — trim works — so they don't need the boundary save. Worse, the boundary - split routes through ``insert_segments`` which on gpt-oss-20b + split routes through ``insert_segments`` which on gpt-oss-20b-mxfp4-q8 empirically corrupts harmony tool-call channel state across multi-turn-with-tools (pydantic_ai multi_tool 5/6 → loops on ``add(3,4)``). Gating the entire boundary path on this flag is @@ -1518,7 +1518,7 @@ async def _stream_with_output_router( # returns a response missing the ``logprobs`` field entirely # because ``_extract_streaming_token_logprobs`` sees # ``chunk.logprobs is None`` for every routed chunk. Confirmed - # on gpt-oss-20b PyPI v0.6.66 during the 2026-05-23 onboarding + # on gpt-oss-20b-mxfp4-q8 PyPI v0.6.66 during the 2026-05-23 onboarding # sweep. PR #450 fixed the pre-existing AttributeError on the # non-routed path but couldn't surface this gap because its # tests use single-token GenerationOutput stubs that never go @@ -1546,7 +1546,7 @@ async def _stream_with_output_router( # TOOL_CALL it would override the accumulated body with # just the end-marker token's text, dropping the body on # the floor and breaking streaming tool calls for gemma4 - # and harmony — caught on gemma-4-26b post-v0.6.61. + # and harmony — caught on gemma-4-26b-4bit post-v0.6.61. if event.channel == Channel.TOOL_CALL: event_text = event.text # Tool-call channel aggregates many tokens; the diff --git a/vllm_mlx/model_aliases.py b/vllm_mlx/model_aliases.py index 6360e9c3..2cefef61 100644 --- a/vllm_mlx/model_aliases.py +++ b/vllm_mlx/model_aliases.py @@ -38,8 +38,8 @@ # Reverse index: hf_path → first alias that references it. Built once # alongside ``_aliases`` so reverse lookups in ``resolve_profile`` are # O(1) instead of scanning all 50+ profiles on every cache-miss. -# When two aliases share the same hf_path (e.g. ``nemotron-30b`` and -# ``nemotron-nano`` both pointing at the same MLX repo), the first one +# When two aliases share the same hf_path (e.g. ``nemotron-30b-4bit`` and +# ``nemotron-30b-4bit`` both pointing at the same MLX repo), the first one # in JSON order wins. The contract is "any profile valid for this # path" rather than "the canonical alias", so this is fine. _hf_to_alias: dict[str, str] | None = None @@ -306,7 +306,7 @@ def resolve_profile(name: str) -> AliasProfile | None: """Return the profile for an alias name or full HF path. Two lookups in order: - 1. Direct alias name match (``qwen3.5-4b``). + 1. Direct alias name match (``qwen3.5-4b-4bit``). 2. Reverse HF-path match (``mlx-community/Qwen3.5-4B-MLX-4bit``) via the pre-built ``_hf_to_alias`` index — O(1). @@ -332,7 +332,7 @@ def _family_prefix(name: str) -> str: ``hermes`` → ``hermes`` (single token, no change) Used to keep typo suggestions inside the same family — ``deepseek-v4-27b`` - suggests ``deepseek-v4-flash``, not ``deepseek-r1-32b``. + suggests ``deepseek-v4-flash-8bit``, not ``deepseek-r1-32b-4bit``. """ parts = name.split("-") while parts: @@ -355,7 +355,7 @@ def _letters_only_prefix(name: str) -> str: returns nothing useful — handles cases where the user collapses or inserts separators we don't use (``gemma4-27b`` → ``gemma``, matches our ``gemma-4-*`` and ``gemma3-*`` aliases; ``mistral24b`` → - ``mistral``, matches ``mistral-24b``). + ``mistral``, matches ``mistral-24b-4bit``). """ out = [] for ch in name.lower(): @@ -372,7 +372,7 @@ def suggest_similar(name: str, n: int = 3, cutoff: float = 0.5) -> list[str]: Family-aware in two passes: 1. **Strict family match** — uses ``_family_prefix`` (drops trailing size/quant tokens). Keeps the wrong-family bait-and-switch (typing - ``deepseek-v4-27b`` and being told ``deepseek-r1-32b``) from + ``deepseek-v4-27b`` and being told ``deepseek-r1-32b-4bit``) from happening, and prevents legitimate single-segment HuggingFace IDs like ``gpt2`` or ``bert-base-uncased`` from spuriously matching. 2. **Letter-only prefix fallback** — if step 1 finds nothing, retry @@ -397,7 +397,7 @@ def suggest_similar(name: str, n: int = 3, cutoff: float = 0.5) -> list[str]: if same_fam and same_fam != [fam]: # If we found candidates in the same strict family, trust the # cutoff — even if it filters everything out. The cutoff - # rejecting ``gpt2`` against ``gpt-oss-20b`` is the + # rejecting ``gpt2`` against ``gpt-oss-20b-mxfp4-q8`` is the # legitimate-HF-ID guarantee at work; the letter-only # fallback below would override that and is wrong here. return difflib.get_close_matches(name, same_fam, n=n, cutoff=cutoff) @@ -439,12 +439,12 @@ def suggest_similar(name: str, n: int = 3, cutoff: float = 0.5) -> list[str]: # the small/fast tier and one well-known representative per category — # auto-generation would spit out alphabetic noise like ``bonsai-*`` first. POPULAR_ALIASES: tuple[str, ...] = ( - "qwen3.5-4b", # default smoke / small - "qwen3.5-9b", # mid-size general - "qwen3.6-27b", # latest hybrid family - "qwen3-coder-30b", # coding - "gemma4", # gemma family rep (12B QAT 4-bit) - "llama3-3b", # tiny llama - "mistral-24b", # mistral - "deepseek-r1-32b", # reasoning + "qwen3.5-4b-4bit", # default smoke / small + "qwen3.5-9b-4bit", # mid-size general + "qwen3.6-27b-4bit", # latest hybrid family + "qwen3-coder-30b-4bit", # coding + "gemma-4-12b-qat-4bit", # gemma family rep (12B QAT 4-bit) + "llama3-3b-4bit", # tiny llama + "mistral-24b-4bit", # mistral + "deepseek-r1-32b-4bit", # reasoning ) diff --git a/vllm_mlx/model_auto_config.py b/vllm_mlx/model_auto_config.py index 1ecdf057..267460c0 100644 --- a/vllm_mlx/model_auto_config.py +++ b/vllm_mlx/model_auto_config.py @@ -317,7 +317,7 @@ def detect_model_config(model_path: str) -> ModelConfig | None: Two-stage lookup: 1. **Alias profile** (single source of truth) — if ``model_path`` is a - known alias name (``qwen3.5-4b``) or maps to one's HF path + known alias name (``qwen3.5-4b-4bit``) or maps to one's HF path (``mlx-community/Qwen3.5-4B-MLX-4bit``), return that profile's config directly. This guarantees per-alias granularity for any optimization that varies by size/quant within a family. diff --git a/vllm_mlx/output_router_harmony.py b/vllm_mlx/output_router_harmony.py index 2ab98413..ee19ed8b 100644 --- a/vllm_mlx/output_router_harmony.py +++ b/vllm_mlx/output_router_harmony.py @@ -73,7 +73,7 @@ # HuggingFace hub cache snapshot dir pattern. Path components have the # form ``models----`` so ``/.../models--openai--gpt-oss-20b -# /snapshots//`` resolves to the identity ``openai/gpt-oss-20b`` +# /snapshots//`` resolves to the identity ``openai/gpt-oss-20b-mxfp4-q8`` # (the basename is the snapshot SHA, which on its own gives no hint # that this is a gpt-oss tokenizer). Codex round-14 BLOCKING — the # previous basename-only check rejected this path shape and the gate @@ -99,7 +99,7 @@ # * round-12: anchored basename still let arbitrary owners through # (``some-user/gpt-oss-remapped``) → restrict to known owners. # * round-13: pure remote-id-prefix matching rejected legitimate -# LOCAL paths (``/models/gpt-oss-20b``, ``~/.cache/.../gpt-oss-20b``) +# LOCAL paths (``/models/gpt-oss-20b-mxfp4-q8``, ``~/.cache/.../gpt-oss-20b-mxfp4-q8``) # and made production fall back to the leaking legacy router. # * round-14: HF cache snapshot dir ``models--openai--gpt-oss-20b # /snapshots/`` has SHA basename → recognise the ``models-- @@ -177,7 +177,7 @@ def _is_known_harmony_identity(name_or_path: str) -> bool: # and corrupts content / tool-call arguments (codex round-1 BLOCKING). # Pick short strings that exercise common body-vocab regions: plain # English, JSON-shaped text, and the smoking-gun multi-token word -# ``commentary`` from PR #514 (``comment``+``ary`` on gpt-oss-20b). +# ``commentary`` from PR #514 (``comment``+``ary`` on gpt-oss-20b-mxfp4-q8). _BODY_VOCAB_PROBES = ( "Hello world", 'functions.get_weather {"a":1}', diff --git a/vllm_mlx/reasoning/harmony_parser.py b/vllm_mlx/reasoning/harmony_parser.py index 5f352c58..6fb69c5e 100644 --- a/vllm_mlx/reasoning/harmony_parser.py +++ b/vllm_mlx/reasoning/harmony_parser.py @@ -26,7 +26,7 @@ ) # Final channel content. Harmony spec uses ``<|return|>`` to terminate the -# final channel, but gpt-oss-20b emits ``<|end|>`` in practice for a sizeable +# final channel, but gpt-oss-20b-mxfp4-q8 emits ``<|end|>`` in practice for a sizeable # fraction of non-streaming responses (observed in v0.6.64 pr_validate runs: # anthropic_sdk 0/5, langchain 2/6, pydantic_ai 1/6 on # ``mlx-community/gpt-oss-20b-MXFP4-Q8`` — every non-streaming test landed diff --git a/vllm_mlx/routes/anthropic.py b/vllm_mlx/routes/anthropic.py index 0a565ab1..17dfc09f 100644 --- a/vllm_mlx/routes/anthropic.py +++ b/vllm_mlx/routes/anthropic.py @@ -530,7 +530,7 @@ async def _stream_anthropic_messages( # stripped at the token layer, so its state machine # never leaves the "Unknown channel, suppress" arm and # this loop emits no ``content_block_delta`` events. The - # symptom (v0.6.64 pr_validate on gpt-oss-20b: anthropic + # symptom (v0.6.64 pr_validate on gpt-oss-20b-mxfp4-q8: anthropic # stream test 4 returned 0 content chunks) is the # streaming counterpart of the non-streaming empty- # TextBlock bug fixed in diff --git a/vllm_mlx/runtime/model_registry.py b/vllm_mlx/runtime/model_registry.py index f91e37b0..28932e9a 100644 --- a/vllm_mlx/runtime/model_registry.py +++ b/vllm_mlx/runtime/model_registry.py @@ -7,11 +7,11 @@ Usage: registry = ModelRegistry() - registry.add("qwen3.5-4b", engine, is_default=True) - registry.add("qwen3.5-27b", engine2) + registry.add("qwen3.5-4b-4bit", engine, is_default=True) + registry.add("qwen3.5-27b-4bit", engine2) # Request routing - engine = registry.get_engine("qwen3.5-27b") # specific + engine = registry.get_engine("qwen3.5-27b-4bit") # specific engine = registry.get_engine("default") # default engine = registry.get_engine(None) # default """ diff --git a/vllm_mlx/service/postprocessor.py b/vllm_mlx/service/postprocessor.py index 0809cae9..02ffa2f0 100644 --- a/vllm_mlx/service/postprocessor.py +++ b/vllm_mlx/service/postprocessor.py @@ -702,7 +702,7 @@ def _process_channel_routed( # to the client but leave both accumulators empty — _build_usage # then sees ``reasoning_text=None`` and omits the field entirely, # creating stream/non-stream usage shape drift. Verified on - # gemma-4-26b + gpt-oss-20b during the v0.6.66 onboarding sweep. + # gemma-4-26b-4bit + gpt-oss-20b-mxfp4-q8 during the v0.6.66 onboarding sweep. if content: self.accumulated_text += content if reasoning: @@ -982,7 +982,7 @@ def finalize(self) -> list[StreamEvent]: # Previously gated on ``has_pending_tool_call`` — but that gate # uses the SAME canonical-wrapper check as the streaming parser, so # by construction it can never catch what the streaming parser - # missed. The 2026-05-20 ≥20B onboarding sweep caught gemma-4-26b + # missed. The 2026-05-20 ≥20B onboarding sweep caught gemma-4-26b-4bit # producing structured tool_calls in non-stream mode that the # streaming parser dropped on the floor; the only difference between # the two modes was this gate. See knowledge/guided_generation_gaps_2026-05-20.md diff --git a/vllm_mlx/share/cli.py b/vllm_mlx/share/cli.py index 4f7431d8..06746c96 100644 --- a/vllm_mlx/share/cli.py +++ b/vllm_mlx/share/cli.py @@ -237,7 +237,7 @@ def _pick_port(preferred: int) -> int: def _resolve_served_model_name(port: int, api_key: str) -> str | None: """Read the model id rapid-mlx serve is exposing via /v1/models. - The CLI accepts a short alias (``qwen3.5-4b``) but the OpenAI + The CLI accepts a short alias (``qwen3.5-4b-4bit``) but the OpenAI endpoint only recognises the full HF model id (``mlx-community/Qwen3.5-4B-MLX-4bit``). Without this lookup the curl example we paste into the security banner fails on first @@ -459,10 +459,10 @@ def share_command(args: argparse.Namespace) -> None: # alias resolution BEFORE dispatching to us — by the time we get # here ``args.model`` is the rewritten HF repo (e.g. # ``mlx-community/Qwen3.5-4B-MLX-4bit``) and the user-typed alias - # lives on ``args._original_alias`` (e.g. ``qwen3.5-4b``). The child + # lives on ``args._original_alias`` (e.g. ``qwen3.5-4b-4bit``). The child # ``serve`` subprocess re-runs alias resolution on whatever we pass # it. We want the child to land the same way ``rapid-mlx serve - # qwen3.5-4b`` does — including setting ``_model_alias`` on the + # qwen3.5-4b-4bit`` does — including setting ``_model_alias`` on the # server so the public ``/v1/models`` endpoint advertises (and # accepts) the short alias the user actually typed. So we forward # the original alias to the child when one is set; fall back to @@ -767,7 +767,7 @@ def register(subparsers: argparse._SubParsersAction) -> None: ) p.add_argument( "model", - help="Alias to serve (same names as `rapid-mlx serve`, e.g. qwen3.5-4b)", + help="Alias to serve (same names as `rapid-mlx serve`, e.g. qwen3.5-4b-4bit)", ).completer = alias_completer p.add_argument( "--port", diff --git a/vllm_mlx/telemetry/redact.py b/vllm_mlx/telemetry/redact.py index e91bba9b..38dadc72 100644 --- a/vllm_mlx/telemetry/redact.py +++ b/vllm_mlx/telemetry/redact.py @@ -103,7 +103,7 @@ def bucket_memory_gb(bytes_: int) -> int: def normalize_model_path(path: str) -> str: """Pass through ``org/name`` repo IDs; redact local paths to ``""``. - A local ``./qwen3.5-9b`` checkout that resolve_model() prefers over + A local ``./qwen3.5-9b-4bit`` checkout that resolve_model() prefers over the alias would otherwise leak the user's home-directory layout via the model name. """ @@ -128,7 +128,7 @@ def normalize_model_path(path: str) -> str: if _HF_REPO_RE.match(path): return path return "" - # Bare alias names (``qwen3.5-9b``) are public + harmless. + # Bare alias names (``qwen3.5-9b-4bit``) are public + harmless. return path diff --git a/vllm_mlx/tool_parsers/harmony_tool_parser.py b/vllm_mlx/tool_parsers/harmony_tool_parser.py index d87f2fe4..bf4bc537 100644 --- a/vllm_mlx/tool_parsers/harmony_tool_parser.py +++ b/vllm_mlx/tool_parsers/harmony_tool_parser.py @@ -35,7 +35,7 @@ def _generate_tool_id() -> str: # Terminator: ``<|call|>`` is the in-output token, but the engine stops # generation when it emits it (``<|call|>`` is part of the harmony EOS # set), so the token is consumed and never appears in ``output_text``. -# Empirically (gpt-oss-20b via /v1/chat/completions, 2026-05-22) the +# Empirically (gpt-oss-20b-mxfp4-q8 via /v1/chat/completions, 2026-05-22) the # commentary block ends with the JSON args and no terminator. Accept # end-of-string OR the next channel marker as alternative terminators # so a complete-but-unterminated tool call still parses. Same regression From 7a39e93a79b2f124656b899ac8278f43538055fe Mon Sep 17 00:00:00 2001 From: Raullen Chai Date: Tue, 9 Jun 2026 17:38:20 -0700 Subject: [PATCH 2/6] fix(aliases): codex round 1 + lint cleanup on PR #547 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex BLOCKING (round 1, pr_validate): - ``tests/test_aliases_contract.py:455-457`` had duplicate ``"nemotron-30b-4bit"`` keys after the sweep collapsed both ``nemotron-30b`` and ``nemotron-nano`` to the same canonical name. Python silently overwrote the first key with the second, so the test no longer pinned both pre-rename cases. Collapsed to a single entry (the post-rename registry now contains only one nemotron alias). - ``tests/test_model_profiles_ssot.py:90-91`` had the same duplicate-in-tuple-iteration pattern. Same fix. - ``test_reverse_lookup_for_shared_hf_path_is_deterministic`` and ``test_reverse_lookup_handles_deepseek_v4_flash_duplicate`` were pinning the tie-break for two now-removed duplicate-hf_path pairs (``nemotron-30b`` / ``nemotron-nano`` and ``deepseek-v4-flash`` / ``deepseek-v4-flash-8bit``). After the rename, no pair of aliases shares an hf_path, so the tie-break is unreachable from the live registry. Removed both tests with a comment pointing at the remaining reverse-lookup mechanism test (``test_reverse_lookup_index_built_once_after_first_load``). Lint (ruff check + ruff format): - Auto-fixable F401 / F541 / I001 across the 4 PR-touched scripts (``bench_engine_parity.py``, ``bench_readme_refresh.py``, ``local_bench_vs_ollama.py``, ``mhi_eval.py``). These were pre-existing issues the sweep re-surfaced. - Manual fixes inside ``scripts/mhi_eval.py``: * ``from tau_bench.types import EnvRunResult`` is an availability probe — annotated ``# noqa: F401``. * E741 single-letter ``l`` rebound to ``ch``. - Ruff format applied to the 7 touched .py files. Full-unit (e2e): - ``test_weather_with_fallback`` + ``test_multi_step_tool_chain`` failed once on the initial run, both passed on local rerun against the same live qwen3.5-4b-4bit server. These are model-behaviour tests (which tool name the model picks for a given prompt) and are flakey by design — not caused by this PR. Co-Authored-By: Claude Opus 4.7 (1M context) --- scripts/bench_engine_parity.py | 132 ++++++++++++++++++++----- scripts/bench_readme_refresh.py | 5 +- scripts/local_bench_vs_ollama.py | 4 +- scripts/mhi_eval.py | 156 ++++++++++++++++++++---------- tests/test_aliases_contract.py | 5 +- tests/test_cli_argcomplete.py | 6 +- tests/test_cli_chat.py | 8 +- tests/test_model_profiles_ssot.py | 47 ++------- 8 files changed, 240 insertions(+), 123 deletions(-) diff --git a/scripts/bench_engine_parity.py b/scripts/bench_engine_parity.py index f2d6404e..747e82b4 100644 --- a/scripts/bench_engine_parity.py +++ b/scripts/bench_engine_parity.py @@ -91,7 +91,11 @@ def bench_ttft_cold(base_url: str, model: str) -> dict: ttft = (first_token_time - t0) if first_token_time else elapsed tps = total_tokens / (elapsed - ttft) if elapsed > ttft and total_tokens > 0 else 0 - return {"ttft_ms": round(ttft * 1000, 1), "decode_tps": round(tps, 1), "tokens": total_tokens} + return { + "ttft_ms": round(ttft * 1000, 1), + "decode_tps": round(tps, 1), + "tokens": total_tokens, + } def bench_ttft_cached(base_url: str, model: str) -> dict: @@ -135,13 +139,19 @@ def bench_ttft_cached(base_url: str, model: str) -> dict: elapsed = time.perf_counter() - t0 ttft = (first_token_time - t0) if first_token_time else elapsed - tps = total_tokens / (elapsed - ttft) if elapsed > ttft and total_tokens > 0 else 0 + tps = ( + total_tokens / (elapsed - ttft) + if elapsed > ttft and total_tokens > 0 + else 0 + ) results.append({"ttft_ms": round(ttft * 1000, 1), "decode_tps": round(tps, 1)}) # First is cold, rest are cached return { "cold_ttft_ms": results[0]["ttft_ms"], - "cached_ttft_ms": round(sum(r["ttft_ms"] for r in results[1:]) / len(results[1:]), 1), + "cached_ttft_ms": round( + sum(r["ttft_ms"] for r in results[1:]) / len(results[1:]), 1 + ), "avg_tps": round(sum(r["decode_tps"] for r in results) / len(results), 1), } @@ -223,7 +233,12 @@ def bench_decode_long(base_url: str, model: str) -> dict: f"{base_url}/chat/completions", json={ "model": model, - "messages": [{"role": "user", "content": "Write a detailed essay about the history of computing. Be thorough."}], + "messages": [ + { + "role": "user", + "content": "Write a detailed essay about the history of computing. Be thorough.", + } + ], "max_tokens": 256, "stream": True, }, @@ -244,7 +259,11 @@ def bench_decode_long(base_url: str, model: str) -> dict: decode_time = elapsed - ttft tps = total_tokens / decode_time if decode_time > 0 and total_tokens > 0 else 0 - return {"ttft_ms": round(ttft * 1000, 1), "decode_tps": round(tps, 1), "tokens": total_tokens} + return { + "ttft_ms": round(ttft * 1000, 1), + "decode_tps": round(tps, 1), + "tokens": total_tokens, + } def run_suite(name: str, base_url: str, model: str) -> dict: @@ -258,11 +277,15 @@ def run_suite(name: str, base_url: str, model: str) -> dict: print("\n [1/5] Cold TTFT...") results["cold"] = bench_ttft_cold(base_url, model) - print(f" TTFT: {results['cold']['ttft_ms']}ms, {results['cold']['decode_tps']} tok/s") + print( + f" TTFT: {results['cold']['ttft_ms']}ms, {results['cold']['decode_tps']} tok/s" + ) print(" [2/5] Cached TTFT (3 turns, same system prompt)...") results["cached"] = bench_ttft_cached(base_url, model) - print(f" Cold: {results['cached']['cold_ttft_ms']}ms, Cached: {results['cached']['cached_ttft_ms']}ms, {results['cached']['avg_tps']} tok/s") + print( + f" Cold: {results['cached']['cold_ttft_ms']}ms, Cached: {results['cached']['cached_ttft_ms']}ms, {results['cached']['avg_tps']} tok/s" + ) print(" [3/5] Multi-turn (4 turns)...") results["multi_turn"] = bench_multi_turn(base_url, model) @@ -270,11 +293,15 @@ def run_suite(name: str, base_url: str, model: str) -> dict: print(" [4/5] Tool call (3 calls)...") results["tool_call"] = bench_tool_call(base_url, model) - print(f" Avg: {results['tool_call']['avg_latency_ms']}ms, {results['tool_call']['success_rate']:.0%} success") + print( + f" Avg: {results['tool_call']['avg_latency_ms']}ms, {results['tool_call']['success_rate']:.0%} success" + ) print(" [5/5] Long decode (256 tokens)...") results["long_decode"] = bench_decode_long(base_url, model) - print(f" {results['long_decode']['decode_tps']} tok/s, {results['long_decode']['tokens']} tokens") + print( + f" {results['long_decode']['decode_tps']} tok/s, {results['long_decode']['tokens']} tokens" + ) return results @@ -288,9 +315,11 @@ def main(): print(f" {name} ({url}): OK") except Exception as e: print(f" {name} ({url}): NOT AVAILABLE — {e}") - print(f"\nPlease start both servers:") - print(f" Terminal 1: rapid-mlx serve qwen3.5-4b-4bit --port 8000") - print(f" Terminal 2: rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching") + print("\nPlease start both servers:") + print(" Terminal 1: rapid-mlx serve qwen3.5-4b-4bit --port 8000") + print( + " Terminal 2: rapid-mlx serve qwen3.5-4b-4bit --port 8001 --continuous-batching" + ) sys.exit(1) model_simple = detect_model(SIMPLE_URL) @@ -302,38 +331,90 @@ def main(): # Compare print(f"\n{'=' * 60}") - print(f" COMPARISON: BatchedEngine vs SimpleEngine") + print(" COMPARISON: BatchedEngine vs SimpleEngine") print(f"{'=' * 60}") comparisons = [ - ("Cold TTFT", simple_results["cold"]["ttft_ms"], batched_results["cold"]["ttft_ms"], "ms", True), - ("Cold decode", simple_results["cold"]["decode_tps"], batched_results["cold"]["decode_tps"], "tok/s", False), - ("Cached TTFT", simple_results["cached"]["cached_ttft_ms"], batched_results["cached"]["cached_ttft_ms"], "ms", True), - ("Cached decode", simple_results["cached"]["avg_tps"], batched_results["cached"]["avg_tps"], "tok/s", False), - ("Multi-turn avg", simple_results["multi_turn"]["avg_turn_ms"], batched_results["multi_turn"]["avg_turn_ms"], "ms", True), - ("Tool call avg", simple_results["tool_call"]["avg_latency_ms"], batched_results["tool_call"]["avg_latency_ms"], "ms", True), - ("Long decode", simple_results["long_decode"]["decode_tps"], batched_results["long_decode"]["decode_tps"], "tok/s", False), + ( + "Cold TTFT", + simple_results["cold"]["ttft_ms"], + batched_results["cold"]["ttft_ms"], + "ms", + True, + ), + ( + "Cold decode", + simple_results["cold"]["decode_tps"], + batched_results["cold"]["decode_tps"], + "tok/s", + False, + ), + ( + "Cached TTFT", + simple_results["cached"]["cached_ttft_ms"], + batched_results["cached"]["cached_ttft_ms"], + "ms", + True, + ), + ( + "Cached decode", + simple_results["cached"]["avg_tps"], + batched_results["cached"]["avg_tps"], + "tok/s", + False, + ), + ( + "Multi-turn avg", + simple_results["multi_turn"]["avg_turn_ms"], + batched_results["multi_turn"]["avg_turn_ms"], + "ms", + True, + ), + ( + "Tool call avg", + simple_results["tool_call"]["avg_latency_ms"], + batched_results["tool_call"]["avg_latency_ms"], + "ms", + True, + ), + ( + "Long decode", + simple_results["long_decode"]["decode_tps"], + batched_results["long_decode"]["decode_tps"], + "tok/s", + False, + ), ] - print(f"\n {'Metric':<20s} {'Simple':>10s} {'Batched':>10s} {'Diff':>10s} {'Verdict':>10s}") + print( + f"\n {'Metric':<20s} {'Simple':>10s} {'Batched':>10s} {'Diff':>10s} {'Verdict':>10s}" + ) print(f" {'─' * 62}") all_pass = True for name, simple_val, batched_val, unit, lower_is_better in comparisons: if lower_is_better: - diff_pct = ((batched_val - simple_val) / simple_val * 100) if simple_val > 0 else 0 + diff_pct = ( + ((batched_val - simple_val) / simple_val * 100) if simple_val > 0 else 0 + ) verdict = "OK" if diff_pct < 5 else "WARN" if diff_pct < 10 else "FAIL" else: - diff_pct = ((simple_val - batched_val) / simple_val * 100) if simple_val > 0 else 0 + diff_pct = ( + ((simple_val - batched_val) / simple_val * 100) if simple_val > 0 else 0 + ) verdict = "OK" if diff_pct < 5 else "WARN" if diff_pct < 10 else "FAIL" if verdict != "OK": all_pass = False sign = "+" if diff_pct > 0 else "" - print(f" {name:<20s} {simple_val:>8.1f}{unit:>3s} {batched_val:>8.1f}{unit:>3s} {sign}{diff_pct:>+7.1f}% {verdict:>8s}") + print( + f" {name:<20s} {simple_val:>8.1f}{unit:>3s} {batched_val:>8.1f}{unit:>3s} {sign}{diff_pct:>+7.1f}% {verdict:>8s}" + ) - print(f"\n Overall: {'PASS — BatchedEngine within 5%' if all_pass else 'REVIEW NEEDED'}") + print( + f"\n Overall: {'PASS — BatchedEngine within 5%' if all_pass else 'REVIEW NEEDED'}" + ) # Save results output = { @@ -347,6 +428,7 @@ def main(): out_path = "reports/engine_parity_benchmark.json" import os + os.makedirs(os.path.dirname(out_path), exist_ok=True) with open(out_path, "w") as f: json.dump(output, f, indent=2) diff --git a/scripts/bench_readme_refresh.py b/scripts/bench_readme_refresh.py index 50a2fb32..a2fc1999 100644 --- a/scripts/bench_readme_refresh.py +++ b/scripts/bench_readme_refresh.py @@ -108,7 +108,10 @@ class ModelSpec: "Ollama Gemma 3 12B (Gemma 4 not yet on llama.cpp)", ), ModelSpec( - "gpt-oss-20b-mxfp4-q8", "mlx-community/gpt-oss-20b-MXFP4-Q8", "gpt-oss:20b", "Same arch" + "gpt-oss-20b-mxfp4-q8", + "mlx-community/gpt-oss-20b-MXFP4-Q8", + "gpt-oss:20b", + "Same arch", ), ModelSpec( "qwen3.6-35b-4bit", diff --git a/scripts/local_bench_vs_ollama.py b/scripts/local_bench_vs_ollama.py index e8dd23cf..a9a4c268 100644 --- a/scripts/local_bench_vs_ollama.py +++ b/scripts/local_bench_vs_ollama.py @@ -683,7 +683,9 @@ def summary_row(metric: str, ratio: float, desc: str) -> None: # ── Main ────────────────────────────────────────────────────────────────────── def main() -> int: parser = argparse.ArgumentParser(description="Benchmark Rapid-MLX vs Ollama") - parser.add_argument("--model", default="qwen3.5-4b-4bit", help="Rapid-MLX model name") + parser.add_argument( + "--model", default="qwen3.5-4b-4bit", help="Rapid-MLX model name" + ) parser.add_argument( "--ollama-model", default=None, diff --git a/scripts/mhi_eval.py b/scripts/mhi_eval.py index 77f7843f..d49ca9ee 100644 --- a/scripts/mhi_eval.py +++ b/scripts/mhi_eval.py @@ -25,18 +25,17 @@ import os import subprocess import sys -import tempfile -import textwrap import time from datetime import datetime -from pathlib import Path # --------------------------------------------------------------------------- # OpenAI client helper # --------------------------------------------------------------------------- + def get_client(base_url: str, api_key: str = "not-needed"): from openai import OpenAI + return OpenAI(base_url=base_url, api_key=api_key) @@ -51,14 +50,17 @@ def detect_model(client) -> str: TAU_TASK_IDS = [24, 10, 5, 17, 33, 14, 15, 20, 30, 4] + def run_tau_bench(base_url: str, model: str, api_key: str = "not-needed") -> dict: """Run 10 curated TAU-bench retail tasks.""" try: - from tau_bench.envs import get_env from tau_bench.agents.tool_calling_agent import ToolCallingAgent - from tau_bench.types import EnvRunResult + from tau_bench.envs import get_env + from tau_bench.types import EnvRunResult # noqa: F401 — availability probe except ImportError: - return {"error": "tau-bench not installed. pip install tau-bench @ git+https://github.com/sierra-research/tau-bench.git"} + return { + "error": "tau-bench not installed. pip install tau-bench @ git+https://github.com/sierra-research/tau-bench.git" + } os.environ["OPENAI_API_KEY"] = api_key os.environ["OPENAI_API_BASE"] = base_url @@ -102,12 +104,14 @@ def run_tau_bench(base_url: str, model: str, api_key: str = "not-needed") -> dic status = "PASS" if reward == 1.0 else "FAIL" print(f" [TAU] Task {idx:3d}: {status} ({elapsed:.1f}s)") - results.append({ - "task_id": idx, - "reward": reward, - "elapsed_s": round(elapsed, 1), - "error": error, - }) + results.append( + { + "task_id": idx, + "reward": reward, + "elapsed_s": round(elapsed, 1), + "error": error, + } + ) passed = sum(1 for r in results if r["reward"] == 1.0) score = passed / len(results) @@ -125,8 +129,16 @@ def run_tau_bench(base_url: str, model: str, api_key: str = "not-needed") -> dic # --------------------------------------------------------------------------- HUMANEVAL_IDS = [ - "HumanEval/0", "HumanEval/1", "HumanEval/2", "HumanEval/3", "HumanEval/4", - "HumanEval/5", "HumanEval/6", "HumanEval/7", "HumanEval/8", "HumanEval/9", + "HumanEval/0", + "HumanEval/1", + "HumanEval/2", + "HumanEval/3", + "HumanEval/4", + "HumanEval/5", + "HumanEval/6", + "HumanEval/7", + "HumanEval/8", + "HumanEval/9", ] @@ -160,7 +172,14 @@ def run_humaneval(base_url: str, model: str, api_key: str = "not-needed") -> dic ) completion = resp.choices[0].text or "" except Exception as e: - results.append({"task_id": task_id, "passed": False, "elapsed_s": round(time.time() - t0, 1), "error": str(e)}) + results.append( + { + "task_id": task_id, + "passed": False, + "elapsed_s": round(time.time() - t0, 1), + "error": str(e), + } + ) print(f" [HumanEval] {task_id}: FAIL (API error)") continue @@ -180,12 +199,14 @@ def run_humaneval(base_url: str, model: str, api_key: str = "not-needed") -> dic status = "PASS" if passed else "FAIL" print(f" [HumanEval] {task_id}: {status} ({elapsed:.1f}s)") - results.append({ - "task_id": task_id, - "passed": passed, - "elapsed_s": round(elapsed, 1), - "error": None, - }) + results.append( + { + "task_id": task_id, + "passed": passed, + "elapsed_s": round(elapsed, 1), + "error": None, + } + ) passed_count = sum(1 for r in results if r.get("passed")) score = passed_count / len(results) @@ -264,7 +285,9 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict: # Use the pre-formatted 5-shot prompt from tinyMMLU formatted = item.get("input_formatted", "") if not formatted: - choices_text = "\n".join(f"{chr(65+i)}. {c}" for i, c in enumerate(choices)) + choices_text = "\n".join( + f"{chr(65 + i)}. {c}" for i, c in enumerate(choices) + ) formatted = f"{question}\n{choices_text}\nAnswer:" t0 = time.time() @@ -278,7 +301,14 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict: ) answer = resp.choices[0].text or "" except Exception as e: - results.append({"idx": idx, "correct": False, "elapsed_s": round(time.time() - t0, 1), "error": str(e)}) + results.append( + { + "idx": idx, + "correct": False, + "elapsed_s": round(time.time() - t0, 1), + "error": str(e), + } + ) print(f" [MMLU] Q{idx} ({subject}): FAIL (API error)") continue @@ -287,17 +317,21 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict: correct = predicted == correct_letter elapsed = time.time() - t0 - status = "PASS" if correct else f"FAIL (got {predicted}, expected {correct_letter})" + status = ( + "PASS" if correct else f"FAIL (got {predicted}, expected {correct_letter})" + ) print(f" [MMLU] Q{idx} ({subject}): {status} ({elapsed:.1f}s)") - results.append({ - "idx": idx, - "subject": subject, - "correct": correct, - "predicted": predicted, - "expected": correct_letter, - "elapsed_s": round(elapsed, 1), - "error": None, - }) + results.append( + { + "idx": idx, + "subject": subject, + "correct": correct, + "predicted": predicted, + "expected": correct_letter, + "elapsed_s": round(elapsed, 1), + "error": None, + } + ) correct_count = sum(1 for r in results if r.get("correct")) score = correct_count / len(results) @@ -313,22 +347,23 @@ def run_mmlu(base_url: str, model: str, api_key: str = "not-needed") -> dict: def _extract_letter(text: str) -> str: """Extract A/B/C/D from model response.""" import re + text = text.strip() # Direct single letter if len(text) == 1 and text.upper() in "ABCD": return text.upper() # "The answer is B" / "Answer: B" / "correct answer is C" - m = re.search(r'(?:answer|option)\s*(?:is|:)\s*([A-Da-d])', text, re.IGNORECASE) + m = re.search(r"(?:answer|option)\s*(?:is|:)\s*([A-Da-d])", text, re.IGNORECASE) if m: return m.group(1).upper() # "B." or "B)" at start of line - m = re.search(r'^([A-Da-d])[.\):]', text, re.MULTILINE) + m = re.search(r"^([A-Da-d])[.\):]", text, re.MULTILINE) if m: return m.group(1).upper() # Last single letter A-D in the text (models often explain then conclude) - letters = re.findall(r'\b([A-Da-d])\b', text) + letters = re.findall(r"\b([A-Da-d])\b", text) # Filter to only A-D - valid = [l.upper() for l in letters if l.upper() in "ABCD"] + valid = [ch.upper() for ch in letters if ch.upper() in "ABCD"] if valid: return valid[-1] # Take last mentioned letter return "?" @@ -339,9 +374,9 @@ def _extract_letter(text: str) -> str: # --------------------------------------------------------------------------- WEIGHTS = { - "tau_bench": 0.50, # Agent tool use — highest signal for model×harness - "humaneval": 0.30, # Code generation - "tinyMMLU": 0.20, # Knowledge baseline + "tau_bench": 0.50, # Agent tool use — highest signal for model×harness + "humaneval": 0.30, # Code generation + "tinyMMLU": 0.20, # Knowledge baseline } @@ -358,14 +393,28 @@ def compute_mhi(suite_results: dict) -> float: # Main # --------------------------------------------------------------------------- + def main(): parser = argparse.ArgumentParser(description="MHI Eval — Model-Harness Index") - parser.add_argument("--base-url", default="http://localhost:8000/v1", help="OpenAI-compatible API base URL") - parser.add_argument("--model", default=None, help="Model name (auto-detected if not set)") + parser.add_argument( + "--base-url", + default="http://localhost:8000/v1", + help="OpenAI-compatible API base URL", + ) + parser.add_argument( + "--model", default=None, help="Model name (auto-detected if not set)" + ) parser.add_argument("--api-key", default="not-needed", help="API key") - parser.add_argument("--suite", default="all", choices=["all", "tau", "humaneval", "mmlu"], help="Which suite to run") + parser.add_argument( + "--suite", + default="all", + choices=["all", "tau", "humaneval", "mmlu"], + help="Which suite to run", + ) parser.add_argument("--output", default=None, help="Output JSON path") - parser.add_argument("--label", default=None, help="Label for this run (e.g. 'qwopus27b+hermes')") + parser.add_argument( + "--label", default=None, help="Label for this run (e.g. 'qwopus27b+hermes')" + ) args = parser.parse_args() # Detect model @@ -386,13 +435,13 @@ def main(): break label = name[:50] - print(f"\n{'='*60}") - print(f" MHI Eval — Model-Harness Index") + print(f"\n{'=' * 60}") + print(" MHI Eval — Model-Harness Index") print(f" Model: {model}") print(f" Label: {label}") print(f" Base URL: {args.base_url}") print(f" Suite: {args.suite}") - print(f"{'='*60}\n") + print(f"{'=' * 60}\n") results = {} t_start = time.time() @@ -421,7 +470,7 @@ def main(): mhi_score = compute_mhi(results) # Summary - print(f"\n{'='*60}") + print(f"\n{'=' * 60}") print(f" MHI Score: {mhi_score}") print(f" Label: {label}") print(f" Time: {total_time:.0f}s") @@ -429,8 +478,10 @@ def main(): for suite, weight in WEIGHTS.items(): if suite in results and "score" in results[suite]: r = results[suite] - print(f" {suite:12s}: {r['passed']}/{r['total']} ({r['score']:.0%}) × {weight:.0%} weight") - print(f"{'='*60}\n") + print( + f" {suite:12s}: {r['passed']}/{r['total']} ({r['score']:.0%}) × {weight:.0%} weight" + ) + print(f"{'=' * 60}\n") # Save results output = { @@ -444,7 +495,10 @@ def main(): "suites": results, } - out_path = args.output or f"reports/mhi/{label.replace('/', '_')}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json" + out_path = ( + args.output + or f"reports/mhi/{label.replace('/', '_')}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json" + ) os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True) with open(out_path, "w") as f: json.dump(output, f, indent=2) diff --git a/tests/test_aliases_contract.py b/tests/test_aliases_contract.py index c007dfe4..a5a2e159 100644 --- a/tests/test_aliases_contract.py +++ b/tests/test_aliases_contract.py @@ -453,7 +453,6 @@ def test_audit_batch_reasoning_parser_wirings() -> None: """ profiles = list_profiles() expected = { - "nemotron-30b-4bit": "qwen3", "nemotron-30b-4bit": "qwen3", "kimi-k2.5-3bit": "qwen3", "hermes4-70b-4bit": "glm4", @@ -574,7 +573,9 @@ def test_aliases_with_known_broken_hf_paths_stay_fixed() -> None: # gpt-oss-20b-mxfp4-q8 previously pointed at mlx-community/GPT-OSS-20B-4bit # which 404s; the canonical mlx-community release uses the # MXFP4-Q8 hybrid quantization. - assert profiles["gpt-oss-20b-mxfp4-q8"].hf_path != "mlx-community/GPT-OSS-20B-4bit", ( + assert ( + profiles["gpt-oss-20b-mxfp4-q8"].hf_path != "mlx-community/GPT-OSS-20B-4bit" + ), ( "gpt-oss-20b-mxfp4-q8 must not regress to the 404 path; current canonical " "upload is mlx-community/gpt-oss-20b-MXFP4-Q8." ) diff --git a/tests/test_cli_argcomplete.py b/tests/test_cli_argcomplete.py index b034c1cd..69a6413f 100644 --- a/tests/test_cli_argcomplete.py +++ b/tests/test_cli_argcomplete.py @@ -136,9 +136,9 @@ def test_alias_csv_completer_multiple_commas() -> None: carried through unchanged. Lock this in because rpartition vs partition is an easy-to-flip bug.""" result = alias_csv_completer("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") - assert all(m.startswith("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") for m in result), ( - "csv completer must preserve all prior csv tokens" - ) + assert all( + m.startswith("qwen3.5-4b-4bit,gemma-4-12b-4bit,qwen3.6-") for m in result + ), "csv completer must preserve all prior csv tokens" def test_aliases_path_resolves_to_real_file() -> None: diff --git a/tests/test_cli_chat.py b/tests/test_cli_chat.py index 3718449e..6aae57ec 100644 --- a/tests/test_cli_chat.py +++ b/tests/test_cli_chat.py @@ -1534,7 +1534,9 @@ def test_serve_accepts_no_think_as_alias_for_no_thinking(): ``no_thinking=True`` destination as ``serve --no-thinking``.""" captured: list = [] with ( - patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-think"]), + patch.object( + sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-think"] + ), patch.object(cli, "serve_command", side_effect=captured.append), ): cli.main() @@ -2247,7 +2249,9 @@ def test_serve_allow_abbrev_disabled_rejects_ambiguous_no_thi(capsys): """Same as the chat case — ``serve`` also got the hidden cross-alias and the same ambiguity must be reported, not silently resolved.""" with ( - patch.object(sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-thi"]), + patch.object( + sys, "argv", ["rapid-mlx", "serve", "qwen3.5-4b-4bit", "--no-thi"] + ), pytest.raises(SystemExit), ): cli.main() diff --git a/tests/test_model_profiles_ssot.py b/tests/test_model_profiles_ssot.py index 49169c92..65d22b86 100644 --- a/tests/test_model_profiles_ssot.py +++ b/tests/test_model_profiles_ssot.py @@ -88,7 +88,6 @@ def test_orphan_aliases_now_covered() -> None: "bonsai-8b-unpacked", "ministral-3b-4bit", "nemotron-30b-4bit", - "nemotron-30b-4bit", ): profile = resolve_profile(orphan) assert profile is not None, f"{orphan} regressed to orphan" @@ -339,43 +338,15 @@ def test_per_alias_schema_allows_independent_overrides() -> None: # ---- Reverse-lookup behaviour with shared hf_paths ----------------------- - - -def test_reverse_lookup_for_shared_hf_path_is_deterministic() -> None: - """Two aliases (``nemotron-30b-4bit`` and ``nemotron-30b-4bit``) point at the - same MLX repo. Reverse lookup by HF path should return the - JSON-insertion-order-first alias's profile, deterministically. - - The contract is "any profile valid for this path", but we lock in - the order so a future re-shuffle of aliases.json is forced to - explicitly update this test (which is the right place to think - about who's the canonical alias). - """ - profiles = list_profiles() - nemotron_30b = profiles["nemotron-30b-4bit"] - nemotron_nano = profiles["nemotron-30b-4bit"] - assert nemotron_30b.hf_path == nemotron_nano.hf_path - - # nemotron-30b-4bit appears first in aliases.json, so reverse lookup - # by the shared HF path returns nemotron-30b-4bit's profile object. - via_path = resolve_profile(nemotron_30b.hf_path) - assert via_path is not None - assert via_path is nemotron_30b - - -def test_reverse_lookup_handles_deepseek_v4_flash_duplicate() -> None: - """``deepseek-v4-flash-8bit`` and ``deepseek-v4-flash-8bit`` share - ``mlx-community/DeepSeek-V4-Flash-8bit`` — same regression guard - pattern as the nemotron pair, different family.""" - profiles = list_profiles() - flash = profiles["deepseek-v4-flash-8bit"] - flash_8bit = profiles["deepseek-v4-flash-8bit"] - assert flash.hf_path == flash_8bit.hf_path - via_path = resolve_profile(flash.hf_path) - assert via_path is not None - # Both profiles agree on capability flags (same model), so either - # would be correct semantically. Pin the JSON order winner. - assert via_path is flash +# +# The original two tests in this section pinned the duplicate-hf_path +# tie-break for ``(nemotron-30b, nemotron-nano)`` and +# ``(deepseek-v4-flash, deepseek-v4-flash-8bit)``. After the explicit-quant +# alias rename, those codename aliases are gone (see the PR description for +# ``feat/explicit-alias-naming``) and aliases.json no longer has any pair +# pointing at the same hf_path, so the tie-break is unreachable from the +# current registry. The reverse-lookup *mechanism* is still exercised by +# ``test_reverse_lookup_index_built_once_after_first_load`` below. def test_reverse_lookup_index_built_once_after_first_load() -> None: From bdb00818db3080c127c4b620fc17f04b51ce9fc8 Mon Sep 17 00:00:00 2001 From: Raullen Chai Date: Tue, 9 Jun 2026 18:01:57 -0700 Subject: [PATCH 3/6] fix(aliases): codex round 2 + sweep-collateral repair on PR #547 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex BLOCKING (round 2, pr_validate): - ``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:702`` — the sweep rewrote the canonical spoofing example ``evil-org/gpt-oss-20b`` into ``evil-org/gpt-oss-20b-mxfp4-q8``. The spoof shape (third-party org publishing under the same bare repo name OpenAI uses) is the exact case this matcher must reject; the alias-suffixed variant tests a strictly easier case. Restored the canonical form. (Adding the suffixed variant separately would be redundant — the matcher already covers the broader shape.) - Same file line 716 — the sweep also rewrote ``openai/gpt-oss-20b`` (OpenAI's real bare repo id on HuggingFace) into the rapid-mlx alias ``openai/gpt-oss-20b-mxfp4-q8``. The bare repo id is what the matcher actually sees from upstream tokenizers, so dropping the unsuffixed form would have left a gap. Restored the bare repo id and the bare-suffix variants of the local-path examples (``/models/gpt-oss-20b``, etc.) alongside the alias-suffixed cases the sweep wrote. Codex NIT (round 2, pr_validate): - ``harness/README.md:104`` previously said the ``full`` tier's two baselines are "both 8-bit". After the rename, ``qwen3.6-35b`` is the 4-bit variant. Reworded to call out each model's quant. - ``scripts/rename_aliases.py`` — the ``dropped`` counter always printed 0 because dropped codename aliases store their redirect target as a non-None string in ``rename_map``. Reworked the three counters (renamed / dropped / kept) to compute from the input data's perspective so they always sum back to the input alias count and ``dropped`` is the real number of MANUAL ``drop=True`` specs the script processed. Verified: against ``main``'s aliases.json the script prints ``48 renamed, 3 dropped, 23 kept`` (= 74). - ``scripts/sweep_alias_refs.py`` — the comment promised a "hand-written pass below" for ``gemma4`` that did not exist (the sweep deliberately leaves every ``gemma4`` occurrence alone because the literal is also the parser ID). Reworked the comment to make the no-op intent and reason explicit so a future maintainer doesn't hunt for a missing implementation. Tests: 4789 passed, 11 skipped, 7 xfailed. ``test_simple_exec`` / ``test_multi_step_tool_chain`` flaked again (model-behaviour pick of tool name varies run-to-run); rerunning against the same live server passes both — same as round 1. Co-Authored-By: Claude Opus 4.7 (1M context) --- harness/README.md | 8 +- scripts/rename_aliases.py | 17 ++- scripts/rename_map.json | 100 +++++++++--------- scripts/sweep_alias_refs.py | 9 +- ...est_issue_513_harmony_streamable_parser.py | 20 +++- 5 files changed, 90 insertions(+), 64 deletions(-) diff --git a/harness/README.md b/harness/README.md index 75ebf703..801209c1 100644 --- a/harness/README.md +++ b/harness/README.md @@ -100,10 +100,10 @@ Override the model with `--model qwen3.6-35b-4bit` (will need its own baseline). ### `full` (~2-3 hr, 3 models × 11 agent profiles) -Loops the check tier across `qwen3.5-35b-8bit` and `qwen3.6-35b-4bit` -(real-capacity Qwen lines — both 8-bit, both go through the Hermes -parser path that most users hit). For each model, also runs all 11 -agent profiles' auto-generated test plans. +Loops the check tier across `qwen3.5-35b-8bit` (8-bit) and +`qwen3.6-35b-4bit` (4-bit) — real-capacity Qwen lines that both go +through the Hermes parser path that most users hit. For each model, +also runs all 11 agent profiles' auto-generated test plans. > Gemma 4 was previously in the default list for orthogonal coverage > but was dropped after PR #208 validation showed it fails multiple diff --git a/scripts/rename_aliases.py b/scripts/rename_aliases.py index a33fa8a5..ed71ec49 100644 --- a/scripts/rename_aliases.py +++ b/scripts/rename_aliases.py @@ -153,9 +153,20 @@ def main() -> None: json.dump(rename_map, fp, indent=2, sort_keys=True) fp.write("\n") - renamed = sum(1 for o, n in rename_map.items() if n and o != n) - dropped = sum(1 for n in rename_map.values() if n is None) - kept = sum(1 for o, n in rename_map.items() if n and o == n) + # Counters from the input data's perspective so the three lines add + # back up to the input alias count. Each input alias is exactly one + # of: dropped (MANUAL says ``drop=True``), renamed (name changed), + # or kept (name unchanged because the old key already carried the + # canonical quant suffix). Counting drops via ``rename_map[o] is None`` + # would always print 0 because dropped codename aliases store their + # redirect target — non-None — in the rename map. + def _is_drop(old: str) -> bool: + spec = MANUAL.get(old) + return isinstance(spec, dict) and bool(spec.get("drop")) + + dropped = sum(1 for old in data if _is_drop(old)) + renamed = sum(1 for old in data if not _is_drop(old) and rename_map[old] != old) + kept = sum(1 for old in data if not _is_drop(old) and rename_map[old] == old) print(f" renamed: {renamed}") print(f" dropped: {dropped}") print(f" kept (already explicit): {kept}") diff --git a/scripts/rename_map.json b/scripts/rename_map.json index 810983c0..8a9150b4 100644 --- a/scripts/rename_map.json +++ b/scripts/rename_map.json @@ -1,76 +1,74 @@ { - "bonsai-1.7b": "bonsai-1.7b-unpacked", - "bonsai-4b": "bonsai-4b-unpacked", - "bonsai-8b": "bonsai-8b-unpacked", - "deepseek-r1-32b": "deepseek-r1-32b-4bit", - "deepseek-r1-8b": "deepseek-r1-8b-4bit", - "deepseek-v4-flash": "deepseek-v4-flash-8bit", + "bonsai-1.7b-unpacked": "bonsai-1.7b-unpacked", + "bonsai-4b-unpacked": "bonsai-4b-unpacked", + "bonsai-8b-unpacked": "bonsai-8b-unpacked", + "deepseek-r1-32b-4bit": "deepseek-r1-32b-4bit", + "deepseek-r1-8b-4bit": "deepseek-r1-8b-4bit", "deepseek-v4-flash-2bit": "deepseek-v4-flash-2bit", "deepseek-v4-flash-4bit": "deepseek-v4-flash-4bit", "deepseek-v4-flash-8bit": "deepseek-v4-flash-8bit", - "devstral-24b": "devstral-24b-4bit", - "devstral-v2-24b": "devstral-v2-24b-4bit", - "gemma-3n-e4b": "gemma-3n-e4b-4bit", - "gemma-4-12b": "gemma-4-12b-4bit", + "devstral-24b-4bit": "devstral-24b-4bit", + "devstral-v2-24b-4bit": "devstral-v2-24b-4bit", + "gemma-3n-e4b-4bit": "gemma-3n-e4b-4bit", + "gemma-4-12b-4bit": "gemma-4-12b-4bit", "gemma-4-12b-8bit": "gemma-4-12b-8bit", - "gemma-4-12b-qat": "gemma-4-12b-qat-4bit", + "gemma-4-12b-qat-4bit": "gemma-4-12b-qat-4bit", "gemma-4-12b-qat-8bit": "gemma-4-12b-qat-8bit", - "gemma-4-26b": "gemma-4-26b-4bit", - "gemma-4-26b-qat": "gemma-4-26b-qat-4bit", - "gemma-4-31b": "gemma-4-31b-4bit", + "gemma-4-26b-4bit": "gemma-4-26b-4bit", + "gemma-4-26b-qat-4bit": "gemma-4-26b-qat-4bit", + "gemma-4-31b-4bit": "gemma-4-31b-4bit", "gemma-4-31b-8bit": "gemma-4-31b-8bit", - "gemma-4-31b-qat": "gemma-4-31b-qat-4bit", + "gemma-4-31b-qat-4bit": "gemma-4-31b-qat-4bit", "gemma-4-31b-qat-8bit": "gemma-4-31b-qat-8bit", - "gemma3-12b": "gemma3-12b-4bit", - "gemma3-1b": "gemma3-1b-4bit", - "gemma3-27b": "gemma3-27b-4bit", - "gemma4": "gemma-4-12b-qat-4bit", - "glm4.5-air": "glm4.5-air-4bit", - "glm4.7-9b": "glm4.7-9b-4bit", - "gpt-oss-20b": "gpt-oss-20b-mxfp4-q8", - "granite4-tiny": "granite4-tiny-4bit", - "hermes3-8b": "hermes3-8b-4bit", - "hermes4-70b": "hermes4-70b-4bit", - "kimi-48b": "kimi-48b-4bit", - "kimi-k2.5": "kimi-k2.5-3bit", + "gemma3-12b-4bit": "gemma3-12b-4bit", + "gemma3-1b-4bit": "gemma3-1b-4bit", + "gemma3-27b-4bit": "gemma3-27b-4bit", + "glm4.5-air-4bit": "glm4.5-air-4bit", + "glm4.7-9b-4bit": "glm4.7-9b-4bit", + "gpt-oss-20b-mxfp4-q8": "gpt-oss-20b-mxfp4-q8", + "granite4-tiny-4bit": "granite4-tiny-4bit", + "hermes3-8b-4bit": "hermes3-8b-4bit", + "hermes4-70b-4bit": "hermes4-70b-4bit", + "kimi-48b-4bit": "kimi-48b-4bit", + "kimi-k2.5-3bit": "kimi-k2.5-3bit", "llama-3.1-8b-8bit": "llama-3.1-8b-8bit", - "llama3-1b": "llama3-1b-4bit", - "llama3-3b": "llama3-3b-4bit", - "minimax-m2.5": "minimax-m2.5-4bit", - "minimax-m2.7": "minimax-m2.7-mxfp4", - "ministral-3b": "ministral-3b-4bit", - "mistral-24b": "mistral-24b-4bit", - "nemotron-30b": "nemotron-30b-4bit", - "nemotron-nano": "nemotron-30b-4bit", - "phi4-14b": "phi-4-14b-4bit", + "llama3-1b-4bit": "llama3-1b-4bit", + "llama3-3b-4bit": "llama3-3b-4bit", + "minimax-m2.5-4bit": "minimax-m2.5-4bit", + "minimax-m2.7-mxfp4": "minimax-m2.7-mxfp4", + "ministral-3b-4bit": "ministral-3b-4bit", + "mistral-24b-4bit": "mistral-24b-4bit", + "nemotron-30b-4bit": "nemotron-30b-4bit", + "phi-4-14b-4bit": "phi-4-14b-4bit", + "phi-4-mini-4bit": "phi-4-mini-4bit", "qwen3-0.6b-8bit": "qwen3-0.6b-8bit", "qwen3-4b-8bit": "qwen3-4b-8bit", "qwen3-8b-8bit": "qwen3-8b-8bit", - "qwen3-coder": "qwen3-coder-4bit", - "qwen3-coder-30b": "qwen3-coder-30b-4bit", - "qwen3-vl-30b": "qwen3-vl-30b-4bit", - "qwen3-vl-4b": "qwen3-vl-4b-4bit", - "qwen3-vl-8b": "qwen3-vl-8b-4bit", - "qwen3.5-122b": "qwen3.5-122b-mxfp4", + "qwen3-coder-30b-4bit": "qwen3-coder-30b-4bit", + "qwen3-coder-4bit": "qwen3-coder-4bit", + "qwen3-vl-30b-4bit": "qwen3-vl-30b-4bit", + "qwen3-vl-4b-4bit": "qwen3-vl-4b-4bit", + "qwen3-vl-8b-4bit": "qwen3-vl-8b-4bit", "qwen3.5-122b-8bit": "qwen3.5-122b-8bit", - "qwen3.5-27b": "qwen3.5-27b-4bit", + "qwen3.5-122b-mxfp4": "qwen3.5-122b-mxfp4", + "qwen3.5-27b-4bit": "qwen3.5-27b-4bit", "qwen3.5-27b-8bit": "qwen3.5-27b-8bit", - "qwen3.5-35b": "qwen3.5-35b-8bit", "qwen3.5-35b-4bit": "qwen3.5-35b-4bit", - "qwen3.5-4b": "qwen3.5-4b-4bit", + "qwen3.5-35b-8bit": "qwen3.5-35b-8bit", + "qwen3.5-4b-4bit": "qwen3.5-4b-4bit", "qwen3.5-4b-8bit": "qwen3.5-4b-8bit", - "qwen3.5-9b": "qwen3.5-9b-4bit", + "qwen3.5-9b-4bit": "qwen3.5-9b-4bit", "qwen3.5-9b-8bit": "qwen3.5-9b-8bit", - "qwen3.6-27b": "qwen3.6-27b-4bit", + "qwen3.6-27b-4bit": "qwen3.6-27b-4bit", "qwen3.6-27b-8bit": "qwen3.6-27b-8bit", "qwen3.6-27b-ud": "qwen3.6-27b-ud", - "qwen3.6-35b": "qwen3.6-35b-4bit", + "qwen3.6-35b-4bit": "qwen3.6-35b-4bit", "qwen3.6-35b-6bit": "qwen3.6-35b-6bit", "qwen3.6-35b-8bit": "qwen3.6-35b-8bit", "qwen3.6-35b-dwq": "qwen3.6-35b-dwq", "qwen3.6-35b-ud": "qwen3.6-35b-ud", - "qwopus-27b": "qwopus-27b-4bit", + "qwopus-27b-4bit": "qwopus-27b-4bit", "qwopus-27b-8bit": "qwopus-27b-8bit", - "qwopus-9b": "qwopus-9b-4bit", - "smollm3-3b": "smollm3-3b-4bit" + "qwopus-9b-4bit": "qwopus-9b-4bit", + "smollm3-3b-4bit": "smollm3-3b-4bit" } diff --git a/scripts/sweep_alias_refs.py b/scripts/sweep_alias_refs.py index 918f2ca7..bb16affe 100644 --- a/scripts/sweep_alias_refs.py +++ b/scripts/sweep_alias_refs.py @@ -163,8 +163,13 @@ def main() -> int: # elsewhere in the codebase. ``gemma4`` is the parser ID # (registered in ``gemma4_tool_parser.py``, referenced from # ``model_auto_config.py``, ``output_router.py``, etc.) — auto- - # rewriting it would corrupt the parser registry. These are handled - # by a hand-written pass below. + # rewriting it would corrupt the parser registry, so the sweep + # leaves every occurrence of ``gemma4`` untouched. The matching + # codename alias was removed from ``aliases.json`` by + # ``rename_aliases.py``; any remaining alias-context usage (e.g. + # ``rapid-mlx serve gemma4``) is a manual edit, NOT something this + # script will rewrite. Rerunning this script on a fresh checkout is + # therefore intentionally a no-op for ``gemma4``. HAND_HANDLED = {"gemma4"} rename_map = {o: n for o, n in rename_map.items() if o not in HAND_HANDLED} diff --git a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py index 0657adab..57758014 100644 --- a/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py +++ b/tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py @@ -699,7 +699,13 @@ def encode(self, text, add_special_tokens=False): "my-not-gpt-oss-20b", "notgpt-oss-fake", "some-user/gpt-oss-remapped", - "evil-org/gpt-oss-20b-mxfp4-q8", + # ``evil-org/gpt-oss-20b`` is the canonical spoof case — a third-party + # org happens to use the same bare repo name as OpenAI's. A matcher + # that accepts any ``*/gpt-oss-20b`` would false-accept this. Kept + # verbatim (no alias suffix) so the spoof shape stays representative + # — adding the canonical alias suffix would only test a stricter + # variant that's already covered by the matcher. + "evil-org/gpt-oss-20b", "anonymous/gpt-oss", ) for name in rejected_names: @@ -713,15 +719,21 @@ class _Fake(_CompatTokenizerBase): ) accepted_names = ( + # OpenAI's canonical bare repo id — kept verbatim so the matcher + # is tested against the real upstream shape, not just the rapid-mlx + # alias form. + "openai/gpt-oss-20b", + # rapid-mlx alias post-rename — separately covered so an alias + # match doesn't shadow the bare repo match above. "openai/gpt-oss-20b-mxfp4-q8", "mlx-community/gpt-oss-20b-MXFP4-Q8", "unsloth/gpt-oss-20b-MLX-8bit", "gpt-oss-20b-mxfp4-q8", "gpt-oss", - "/models/gpt-oss-20b-mxfp4-q8", - "~/lmstudio-models/gpt-oss-20b-mxfp4-q8", + "/models/gpt-oss-20b", + "~/lmstudio-models/gpt-oss-20b", "./gpt-oss-20b-quantized", - "../models/gpt-oss-20b-mxfp4-q8", + "../models/gpt-oss-20b", ) for name in accepted_names: From 59ce089a5420d931d396cf9b0d92b96d87c1efd6 Mon Sep 17 00:00:00 2001 From: Raullen Chai Date: Tue, 9 Jun 2026 18:15:17 -0700 Subject: [PATCH 4/6] fix(aliases): regenerate rename_map.json from pre-rename aliases.json MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex round-3 NIT: the checked-in ``rename_map.json`` was the result of an idempotent rerun against the already-renamed ``aliases.json``, so every entry was an identity mapping (e.g. ``qwen3.5-4b-4bit`` → ``qwen3.5-4b-4bit``). A maintainer running ``scripts/sweep_alias_refs.py`` from a pre-rename checkout to verify the operation is reproducible would see the sweep do nothing because no legacy name (``qwen3.5-4b``, ``gemma4``, ``nemotron-nano``, …) was in the map. Regenerated from ``main``'s ``vllm_mlx/aliases.json`` so the map now contains the real 74-entry legacy → canonical mapping plus the three dropped codename redirects (``deepseek-v4-flash`` → ``deepseek-v4-flash-8bit``, ``gemma4`` → ``gemma-4-12b-qat-4bit``, ``nemotron-nano`` → ``nemotron-30b-4bit``). The current ``aliases.json`` is untouched — only the auxiliary map file used by the sweep tool changes. Co-Authored-By: Claude Opus 4.7 (1M context) --- scripts/rename_map.json | 100 ++++++++++++++++++++-------------------- 1 file changed, 51 insertions(+), 49 deletions(-) diff --git a/scripts/rename_map.json b/scripts/rename_map.json index 8a9150b4..810983c0 100644 --- a/scripts/rename_map.json +++ b/scripts/rename_map.json @@ -1,74 +1,76 @@ { - "bonsai-1.7b-unpacked": "bonsai-1.7b-unpacked", - "bonsai-4b-unpacked": "bonsai-4b-unpacked", - "bonsai-8b-unpacked": "bonsai-8b-unpacked", - "deepseek-r1-32b-4bit": "deepseek-r1-32b-4bit", - "deepseek-r1-8b-4bit": "deepseek-r1-8b-4bit", + "bonsai-1.7b": "bonsai-1.7b-unpacked", + "bonsai-4b": "bonsai-4b-unpacked", + "bonsai-8b": "bonsai-8b-unpacked", + "deepseek-r1-32b": "deepseek-r1-32b-4bit", + "deepseek-r1-8b": "deepseek-r1-8b-4bit", + "deepseek-v4-flash": "deepseek-v4-flash-8bit", "deepseek-v4-flash-2bit": "deepseek-v4-flash-2bit", "deepseek-v4-flash-4bit": "deepseek-v4-flash-4bit", "deepseek-v4-flash-8bit": "deepseek-v4-flash-8bit", - "devstral-24b-4bit": "devstral-24b-4bit", - "devstral-v2-24b-4bit": "devstral-v2-24b-4bit", - "gemma-3n-e4b-4bit": "gemma-3n-e4b-4bit", - "gemma-4-12b-4bit": "gemma-4-12b-4bit", + "devstral-24b": "devstral-24b-4bit", + "devstral-v2-24b": "devstral-v2-24b-4bit", + "gemma-3n-e4b": "gemma-3n-e4b-4bit", + "gemma-4-12b": "gemma-4-12b-4bit", "gemma-4-12b-8bit": "gemma-4-12b-8bit", - "gemma-4-12b-qat-4bit": "gemma-4-12b-qat-4bit", + "gemma-4-12b-qat": "gemma-4-12b-qat-4bit", "gemma-4-12b-qat-8bit": "gemma-4-12b-qat-8bit", - "gemma-4-26b-4bit": "gemma-4-26b-4bit", - "gemma-4-26b-qat-4bit": "gemma-4-26b-qat-4bit", - "gemma-4-31b-4bit": "gemma-4-31b-4bit", + "gemma-4-26b": "gemma-4-26b-4bit", + "gemma-4-26b-qat": "gemma-4-26b-qat-4bit", + "gemma-4-31b": "gemma-4-31b-4bit", "gemma-4-31b-8bit": "gemma-4-31b-8bit", - "gemma-4-31b-qat-4bit": "gemma-4-31b-qat-4bit", + "gemma-4-31b-qat": "gemma-4-31b-qat-4bit", "gemma-4-31b-qat-8bit": "gemma-4-31b-qat-8bit", - "gemma3-12b-4bit": "gemma3-12b-4bit", - "gemma3-1b-4bit": "gemma3-1b-4bit", - "gemma3-27b-4bit": "gemma3-27b-4bit", - "glm4.5-air-4bit": "glm4.5-air-4bit", - "glm4.7-9b-4bit": "glm4.7-9b-4bit", - "gpt-oss-20b-mxfp4-q8": "gpt-oss-20b-mxfp4-q8", - "granite4-tiny-4bit": "granite4-tiny-4bit", - "hermes3-8b-4bit": "hermes3-8b-4bit", - "hermes4-70b-4bit": "hermes4-70b-4bit", - "kimi-48b-4bit": "kimi-48b-4bit", - "kimi-k2.5-3bit": "kimi-k2.5-3bit", + "gemma3-12b": "gemma3-12b-4bit", + "gemma3-1b": "gemma3-1b-4bit", + "gemma3-27b": "gemma3-27b-4bit", + "gemma4": "gemma-4-12b-qat-4bit", + "glm4.5-air": "glm4.5-air-4bit", + "glm4.7-9b": "glm4.7-9b-4bit", + "gpt-oss-20b": "gpt-oss-20b-mxfp4-q8", + "granite4-tiny": "granite4-tiny-4bit", + "hermes3-8b": "hermes3-8b-4bit", + "hermes4-70b": "hermes4-70b-4bit", + "kimi-48b": "kimi-48b-4bit", + "kimi-k2.5": "kimi-k2.5-3bit", "llama-3.1-8b-8bit": "llama-3.1-8b-8bit", - "llama3-1b-4bit": "llama3-1b-4bit", - "llama3-3b-4bit": "llama3-3b-4bit", - "minimax-m2.5-4bit": "minimax-m2.5-4bit", - "minimax-m2.7-mxfp4": "minimax-m2.7-mxfp4", - "ministral-3b-4bit": "ministral-3b-4bit", - "mistral-24b-4bit": "mistral-24b-4bit", - "nemotron-30b-4bit": "nemotron-30b-4bit", - "phi-4-14b-4bit": "phi-4-14b-4bit", - "phi-4-mini-4bit": "phi-4-mini-4bit", + "llama3-1b": "llama3-1b-4bit", + "llama3-3b": "llama3-3b-4bit", + "minimax-m2.5": "minimax-m2.5-4bit", + "minimax-m2.7": "minimax-m2.7-mxfp4", + "ministral-3b": "ministral-3b-4bit", + "mistral-24b": "mistral-24b-4bit", + "nemotron-30b": "nemotron-30b-4bit", + "nemotron-nano": "nemotron-30b-4bit", + "phi4-14b": "phi-4-14b-4bit", "qwen3-0.6b-8bit": "qwen3-0.6b-8bit", "qwen3-4b-8bit": "qwen3-4b-8bit", "qwen3-8b-8bit": "qwen3-8b-8bit", - "qwen3-coder-30b-4bit": "qwen3-coder-30b-4bit", - "qwen3-coder-4bit": "qwen3-coder-4bit", - "qwen3-vl-30b-4bit": "qwen3-vl-30b-4bit", - "qwen3-vl-4b-4bit": "qwen3-vl-4b-4bit", - "qwen3-vl-8b-4bit": "qwen3-vl-8b-4bit", + "qwen3-coder": "qwen3-coder-4bit", + "qwen3-coder-30b": "qwen3-coder-30b-4bit", + "qwen3-vl-30b": "qwen3-vl-30b-4bit", + "qwen3-vl-4b": "qwen3-vl-4b-4bit", + "qwen3-vl-8b": "qwen3-vl-8b-4bit", + "qwen3.5-122b": "qwen3.5-122b-mxfp4", "qwen3.5-122b-8bit": "qwen3.5-122b-8bit", - "qwen3.5-122b-mxfp4": "qwen3.5-122b-mxfp4", - "qwen3.5-27b-4bit": "qwen3.5-27b-4bit", + "qwen3.5-27b": "qwen3.5-27b-4bit", "qwen3.5-27b-8bit": "qwen3.5-27b-8bit", + "qwen3.5-35b": "qwen3.5-35b-8bit", "qwen3.5-35b-4bit": "qwen3.5-35b-4bit", - "qwen3.5-35b-8bit": "qwen3.5-35b-8bit", - "qwen3.5-4b-4bit": "qwen3.5-4b-4bit", + "qwen3.5-4b": "qwen3.5-4b-4bit", "qwen3.5-4b-8bit": "qwen3.5-4b-8bit", - "qwen3.5-9b-4bit": "qwen3.5-9b-4bit", + "qwen3.5-9b": "qwen3.5-9b-4bit", "qwen3.5-9b-8bit": "qwen3.5-9b-8bit", - "qwen3.6-27b-4bit": "qwen3.6-27b-4bit", + "qwen3.6-27b": "qwen3.6-27b-4bit", "qwen3.6-27b-8bit": "qwen3.6-27b-8bit", "qwen3.6-27b-ud": "qwen3.6-27b-ud", - "qwen3.6-35b-4bit": "qwen3.6-35b-4bit", + "qwen3.6-35b": "qwen3.6-35b-4bit", "qwen3.6-35b-6bit": "qwen3.6-35b-6bit", "qwen3.6-35b-8bit": "qwen3.6-35b-8bit", "qwen3.6-35b-dwq": "qwen3.6-35b-dwq", "qwen3.6-35b-ud": "qwen3.6-35b-ud", - "qwopus-27b-4bit": "qwopus-27b-4bit", + "qwopus-27b": "qwopus-27b-4bit", "qwopus-27b-8bit": "qwopus-27b-8bit", - "qwopus-9b-4bit": "qwopus-9b-4bit", - "smollm3-3b-4bit": "smollm3-3b-4bit" + "qwopus-9b": "qwopus-9b-4bit", + "smollm3-3b": "smollm3-3b-4bit" } From 93c649591ca43cb0dc778c7caff7ac3e4eb15718 Mon Sep 17 00:00:00 2001 From: Raullen Chai Date: Tue, 9 Jun 2026 18:53:49 -0700 Subject: [PATCH 5/6] fix(tests/integrations): raise PydanticAI max_tokens for multi-turn / multi-tool PydanticAI defaults max_tokens to ~1024. On verbose 4B-class models (qwen3.5-4b-4bit) the multi-turn and sequential-tool-call test paths spill past the cap and PydanticAI raises ``Model token limit (provider default) exceeded`` before any response is generated. That ceiling is a client-side default, not a rapid-mlx server contract, so the SDK integration test should bypass it: pass ``model_settings={"max_tokens": 2048}`` on tests 5 and 6. release-check-m3 G7 PydanticAI now 6/6 PASS on qwen3.5-4b-4bit. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/integrations/test_pydantic_ai_full.py | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/tests/integrations/test_pydantic_ai_full.py b/tests/integrations/test_pydantic_ai_full.py index 8abe3524..96d7a551 100644 --- a/tests/integrations/test_pydantic_ai_full.py +++ b/tests/integrations/test_pydantic_ai_full.py @@ -102,7 +102,10 @@ def get_weather(city: str) -> str: # === 5. Multi-turn conversation === print("\n=== Test 5: Multi-turn ===") try: - agent = Agent(model) + # PydanticAI defaults max_tokens to ~1024; on verbose 4B-class models the + # second turn can spill past that. Raise the cap so the SDK contract test + # checks rapid-mlx behaviour, not PydanticAI's default ceiling. + agent = Agent(model, model_settings={"max_tokens": 2048}) r1 = agent.run_sync("My name is Bob. Remember this.") r2 = agent.run_sync("What is my name?", message_history=r1.all_messages()) assert "bob" in r2.output.lower(), r2.output @@ -115,7 +118,10 @@ def get_weather(city: str) -> str: # === 6. Multiple tools, sequential === print("\n=== Test 6: Multiple tools ===") try: - agent = Agent(model) + # Sequential tool calls accumulate output across two tool-call round trips; + # PydanticAI's default max_tokens ceiling kicks in before the final answer + # on small models. Same fix as test 5. + agent = Agent(model, model_settings={"max_tokens": 2048}) @agent.tool_plain def add(a: int, b: int) -> int: From 244761d8c8d375eb3028aed5de348bca005282cc Mon Sep 17 00:00:00 2001 From: Raullen Chai Date: Tue, 9 Jun 2026 18:53:59 -0700 Subject: [PATCH 6/6] chore: bump version to 0.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major release — every alias in ``vllm_mlx/aliases.json`` now carries an explicit quantization suffix (``-4bit`` / ``-8bit`` / ``-mxfp4`` / ``-dwq`` / etc.), no implicit-quant short forms remain. The three legacy codename aliases (``deepseek-v4-flash``, ``gemma4``, ``nemotron-nano``) were dropped; the ``phi4-14b`` schema bug (name claimed 14B but hf_path pointed at phi-4-mini ~4B) was fixed by renaming to ``phi-4-14b-4bit`` AND swapping hf_path to the real Phi-4 14B; ``phi-4-mini-4bit`` was added to preserve the small-model entry. README now documents the 7-segment naming template ``-----`` and the canonical quant-suffix table. Total: 74 → 72 aliases. Old short names are not deprecated — they're just gone, per user direction ("没有多少用户"). Co-Authored-By: Claude Opus 4.7 (1M context) --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index 64e12c99..82e6a10d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "rapid-mlx" -version = "0.6.83" +version = "0.7.0" description = "Rapid-MLX — AI inference for Apple Silicon. Drop-in OpenAI API, 2-4x faster than Ollama." readme = "README.md" license = {text = "Apache-2.0"}