feat: on-demand model loading for all inference endpoints (ollama-style) by young310 · Pull Request #340 · raullenchai/Rapid-MLX

young310 · 2026-05-09T17:11:44Z

Problem

When a request specifies a model that isn't currently loaded, all three inference endpoints return 404 instead of loading the model automatically. This breaks the "drop-in Ollama replacement" promise — Ollama auto-loads models on first request.

The chat endpoint had a partial fix gated behind if cfg.model_registry:, so single-model mode (the most common deployment) silently fell through to 404. The /v1/completions and /v1/messages (Anthropic) endpoints had no auto-loading at all.

Solution

Core helpers (`service/helpers.py`)

_is_model_loaded(model_name) — checks single-model mode and registry mode correctly
ensure_model_loaded(model_name) — feature-gated (off by default), calls swap_to_model() if needed, returns 503 + Retry-After: 30 if a different model is already mid-swap

New functions in `server.py`

_get_swap_lock() — lazy asyncio.Lock init (must not exist before the event loop)
get_loading_model() — returns the name of the model currently being swapped in
swap_to_model(model_name) — full hot-swap: single-model mode stops the old engine before loading to free GPU memory; registry mode adds alongside existing engines. Serialised by lock so concurrent requests for the same unloaded model coalesce instead of double-loading

Feature gate (`--enable-on-demand-loading`)

Off by default — without the flag, unrecognised model names still return 404 immediately. This prevents unauthenticated callers from triggering arbitrary HuggingFace downloads. Recommended to pair with --api-key in production.

`/v1/models` now lists all local cache (`routes/models.py`)

When --enable-on-demand-loading is active, GET /v1/models scans ~/.cache/huggingface/hub/ and surfaces every locally-cached MLX model (.safetensors / .npz). Non-chat models (TTS, Whisper, embeddings) are filtered out. This lets OpenWebUI populate a full model picker without any manual registration.

Anthropic route (`routes/anthropic.py`)

Removed the ensure_model_loaded call added by the original commit. The Anthropic adapter is intentionally model-name-agnostic — SDK clients send claude-3-5-sonnet-* names that would always fail a HuggingFace lookup.

Bug fixed: `main` module aliasing

Running python3 -m vllm_mlx.server registers the module as __main__, not vllm_mlx.server. When helpers.py does from ..server import swap_to_model, Python doesn't find vllm_mlx.server in sys.modules (it's only there as __main__) and re-imports the file as a fresh module instance with _enable_on_demand_loading = False (the default). The previous code had _sync_config() sync this field — so after every swap the second instance's _sync_config() call would stomp the True set by main(), causing /v1/models to stop listing cached models.

Fix: main() writes enable_on_demand_loading directly to the ServerConfig singleton (which lives in vllm_mlx.config.server_config and is shared across all module instances). _sync_config() no longer touches this field.

Behaviour

Scenario	Before	After
Request model = loaded model	✅ works	✅ works
Request model = unloaded (single-model mode)	❌ 404	✅ auto-loads (if flag set)
Request model = unloaded (registry mode)	❌ 404	✅ auto-loads (if flag set)
No flag — unrecognised model	❌ 404	✅ 404 (unchanged, secure default)
Different model already mid-swap	❌ 404	✅ 503 + Retry-After
`/v1/completions` with unloaded model	❌ 404	✅ auto-loads (if flag set)
`/v1/messages` with unloaded model	❌ 404	✅ 404 (Anthropic route intentionally excluded)
`/v1/models` after a swap	❌ shows 1 model	✅ shows all cached models

Testing

Tested end-to-end on macOS (Apple Silicon, Python 3.14), with OpenWebUI as the client:

# Start server with Qwen3-0.6B as initial model
python3 -m vllm_mlx.server \
  --model mlx-community/Qwen3-0.6B-8bit \
  --enable-on-demand-loading \
  --port 8000

# /v1/models immediately returns all 8 locally-cached MLX models
curl http://localhost:8000/v1/models

# Request for a different cached model triggers auto-swap
curl http://localhost:8000/v1/chat/completions \
  -d '{"model": "mlx-community/Llama-3.2-1B-Instruct-4bit", "messages": [...]}'

# /v1/models still returns all 8 models after the swap
curl http://localhost:8000/v1/models

Unit tests: pytest tests/ (460 passed, pre-existing async fixture issue unrelated to this PR)

raullenchai · 2026-05-10T14:25:14Z

Thanks for the PR @young310 — "drop-in Ollama replacement" auto-load is a real gap and the helper-extraction shape is exactly the right architecture. Unfortunately the PR doesn't run as-is. Flagging blockers below.

P0 — blocker (PR is non-functional)

`vllm_mlx.server` does not export `swap_to_model` or `get_loading_model`

vllm_mlx/service/helpers.py:272:

from ..server import get_loading_model, swap_to_model

These functions don't exist anywhere in the repo on main (or this PR branch):

$ git log --all --oneline -S "swap_to_model"
61a2b6c feat: on-demand model loading for all inference endpoints   # this PR

$ grep -rn "def swap_to_model\|def get_loading_model" vllm_mlx/
# (no matches)

Demonstrated runtime failure:

$ python3.12 -c "
import asyncio
from vllm_mlx.service.helpers import ensure_model_loaded
from vllm_mlx.config import get_config
cfg = get_config()
cfg.model_name = 'qwen3.5-4b'
cfg.model_alias = None; cfg.model_path = None; cfg.model_registry = None
asyncio.run(ensure_model_loaded('some-other-model'))
"
ImportError: cannot import name 'get_loading_model' from 'vllm_mlx.server'

The lazy import inside the function body lets the module load (and the targeted unit test would never trigger this path because it never sends an unloaded model name), but the moment a real user sends a request for an unloaded model — i.e., the entire feature this PR claims to add — the server returns 500.

The PR description claims:

Tested locally on macOS (Apple Silicon), rapid-mlx 0.6.30:
# → server swaps model automatically, returns response from Qwen2.5-Coder

That output cannot have come from this codebase. The only "hot-swap" we have today is in the CLI side (vllm_mlx/cli.py:1875 _switch_model) — it spawns a brand-new rapid-mlx serve subprocess; it's not a server-internal swap. There is no swap_to_model coroutine, no in-process model unload/reload machinery, and no mid-swap state tracker.

Two paths forward:

Build the missing infra in the same PR. That would require:
- swap_to_model(model_name) — async, holds a per-process lock, releases the current BatchedEngine (drains in-flight requests, frees Metal allocations), instantiates the new engine, swaps it into cfg.engine. Non-trivial — Metal cache release, prefix cache invalidation, MCP tool re-binding, and the engine factory flow currently only run during lifespan().
- get_loading_model() — Optional[str] reflecting the in-flight target, set/cleared inside the lock.
- Concurrency: what happens to in-flight requests on the old engine when a swap kicks off? Drain? Cancel with 503?
- Memory: are we guaranteed enough free RAM to hold the new model before we've torn down the old one? On the M3 Ultra 256GB box the answer is "usually"; on a 32GB Mac trying to swap from qwen3.5-9b → kimi-48b it's "no".
Scope this PR down. Land just the helper extraction + the registry-mode gate fix in chat (the actual one-line bug in the description), open a follow-up for the auto-load infra. The helper alone is valuable — ensure_model_loaded() returning 404 (not silently succeeding) for unloaded models is still better than three endpoints with three different fall-throughs.

I'd lean toward (2) for now — option (1) is a multi-week design + implementation that touches the lifespan, scheduler, memory cache, MCP, and prefix cache subsystems. Worth a design doc + issue first.

P0 — blocker (security / cost)

Auto-loading runs before `_validate_model_name`

In all three routes:

await ensure_model_loaded(request.model)   # ← this triggers an HF download
_validate_model_name(request.model)        # ← this would have rejected the input

Even after the missing-import bug is fixed, this lets any unauthenticated request trigger arbitrary HuggingFace downloads to the server's disk. A model can be 200GB+. There's no allowlist (cfg.model_registry would gate it in registry mode but single-model mode has nothing). A malicious or misconfigured client can fill the disk in minutes.

Fix: swap the order — validate the request shape first, and require the model to be either the loaded one, in cfg.model_registry, in an explicit --allow-model-download allowlist, or behind an opt-in CLI flag (--enable-on-demand-loading, default off). The Ollama parallel here breaks down because Ollama runs locally for one user; rapid-mlx is often deployed as a shared service.

P1 — should fix

HTTP timeout vs download time. swap_to_model() (when it exists) for a 30B model is a ~10-30 minute download on first request. The chat completion request is held open the entire time. Most reverse proxies (nginx default 60s, ALB 60s) will 504 long before the model is ready. Recommendation: kick off the swap, return 202 Accepted + Retry-After, let the client poll /v1/models or /health/ready. Or — if synchronous is required — at least set an explicit cfg.swap_timeout_seconds so users can tune it for their proxy.
TOCTOU in the in-flight check.
```
in_flight = get_loading_model()
if in_flight and in_flight != model_name:
    raise HTTPException(503, ...)
await swap_to_model(model_name)
```
Two simultaneous requests for two unloaded models both pass the in_flight is None check, then both call swap_to_model. Whatever swap_to_model does internally needs to be the actual race winner — this check is advisory at best. Recommend documenting that the lock lives in swap_to_model, not here.
Anthropic endpoint will try to auto-load claude-* model names. Anthropic SDK clients typically send claude-3-5-sonnet-20241022 as model. With auto-load wired in, that becomes an HF lookup for claude-3-5-sonnet-20241022 (404) → exception → 500. Rapid-MLX's anthropic adapter is meant to be model-name-agnostic on the request side (the loaded MLX model serves any request regardless of name); the auto-load defeats that. Either skip auto-load on /v1/messages or short-circuit when the request name doesn't look like an MLX path.
No tests for the new behavior. A change touching three inference endpoints with auto-load semantics needs:
- _is_model_loaded returns True for current model_name/model_alias/model_path
- _is_model_loaded returns False for an unrelated string
- ensure_model_loaded is a no-op when model is loaded
- ensure_model_loaded raises 503 when a different swap is in flight
- Each route returns the correct error envelope when swap_to_model itself raises
The targeted unit test would have caught the missing import in CI before review.

P2 — nits

_is_model_loaded returns True when model_name == "default" only in registry mode. Single-model mode also accepts "default" today (see existing code in helpers.py:222). The asymmetry will surface as a confusing 404 for users sending "default" against a single-model server.
Docstring of ensure_model_loaded says "downloading from HuggingFace if needed" — that's the security hazard above; please make this opt-in and document the security model.

Verdict

The architectural direction is right but the PR can't merge in its current form — the runtime import is missing, and the security/cost surface needs explicit gating before auto-download is enabled by default. Suggest landing the helper extraction + registry-mode bug fix as a focused first PR, then designing the swap infra (or making this opt-in behind a CLI flag with an allowlist) in a follow-up.

Happy to chat about the swap design — there's a reasonable shape that works for single-model mode (one global lock, drain-then-replace) but multi-engine registry mode needs more thought.

Reviewed by @raullenchai (Rapid-MLX maintainer).

young310 · 2026-05-11T13:14:38Z

Code review

Found 1 issue:

ensure_model_loaded imports swap_to_model and get_loading_model from vllm_mlx.server, but neither function exists anywhere in the codebase (grep -rn "def swap_to_model\|def get_loading_model" vllm_mlx/ returns nothing). Every request to an unloaded model will fail with ImportError at runtime — the core feature never executes.

https://github.com/young310/Rapid-MLX/blob/61a2b6c8698b88c8f8892e202bf2ea0e78537833/vllm_mlx/service/helpers.py#L252-L264

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…g bug Resolves PR raullenchai#340 P0 blockers: 1. Implements missing `swap_to_model` and `get_loading_model` in server.py, with asyncio.Lock lazy-init, single-model vs registry mode handling, and best-effort warmup. Previously any on-demand load attempt raised ImportError. 2. Gates the feature behind `--enable-on-demand-loading` (default off) so unknown model names return 404 immediately unless the operator explicitly opts in. 3. Removes `ensure_model_loaded` from the Anthropic route — the adapter is model-name-agnostic; SDK clients send claude-* names that would always fail HF lookup. 4. Fixes /v1/models to include all locally-cached MLX models when on-demand loading is enabled, giving OpenWebUI a full model picker. 5. Fixes a `__main__` module aliasing bug: running `-m vllm_mlx.server` registers the module as `__main__`, but `from ..server import swap_to_model` in helpers.py re-imports `vllm_mlx.server` as a fresh instance with `_enable_on_demand_loading = False`. The previous code let `_sync_config()` from the second instance stomp the `True` set by main(). Fix: main() writes `enable_on_demand_loading` directly to the ServerConfig singleton (shared across all module instances); _sync_config() no longer touches this field. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

young310 · 2026-05-11T15:01:39Z

@raullenchai I add some more testing and please have a look, thank you

raullenchai · 2026-05-13T22:10:22Z

Review — PR Merge SOP audit

Thanks for the contribution! Did a multi-pass adversarial review against the PR Merge SOP — auto-deploy makes us conservative on external PRs. Below are findings ranked P0 (blocking) / P1 (should fix) / P2 (nit). I've reproduced each against the diff at head=feat/on-demand-model-loading.

P0 — blocking merge

P0-1. Feature flag is unreachable from the standard rapid-mlx serve CLI.
The --enable-on-demand-loading argparse arg lives in vllm_mlx/server.py::main() (legacy entrypoint via python -m vllm_mlx.server). But users invoke rapid-mlx serve <model> which routes through vllm_mlx/cli.py::serve_command() (cli.py:370). That function directly mutates server._api_key / server._default_timeout / etc. (cli.py:425-465) and calls uvicorn.run() itself — it does not call server.main(). The new flag is not wired into the serve subparser (~cli.py:3060) and is never copied onto the singleton from cli.py. So:

rapid-mlx serve qwen3.5-4b --enable-on-demand-loading → argparse error "unrecognized arguments"
Even after fixing argparse, serve_command() never sets cfg.enable_on_demand_loading, so ensure_model_loaded() always returns early.

Please wire the flag through cli.py's serve_parser and serve_command. The pattern to mirror is --default-temperature / --default-top-p (already plumbed both ways).

P0-2. swap_to_model does not drain in-flight requests before tearing down the old engine.
server.py:677 acquires _swap_lock and checks _is_model_loaded(model_name) — that protects swap-vs-swap but not swap-vs-request. In single-model mode the sequence is _model_registry.remove(old_name) → await old_engine.stop() (which sets self._loaded=False, drops _model, _engine). Any coroutine currently iterating engine.stream_generate(...) on the old engine will dereference None mid-stream and the client sees a torn SSE response. There's also a window between await old_engine.stop() and load_model(new) where the module global _engine is still pointing at a half-destroyed object — new requests landing there get an opaque AttributeError instead of a clean 503.

Mitigation needs an inflight-request counter (or engine.drain()) before stop(), and _engine = None set explicitly during the window so get_engine() returns 503 cleanly.

P1 — should fix before merge

P1-1. No validation on request.model before it's handed to load_model() → arbitrary HF download / disk-fill DoS.
chat.py:150 and completions.py:45 do await ensure_model_loaded(request.model) before _validate_model_name(request.model). With the flag on, an attacker-controlled string ("evil-attacker/backdoor", "any-public-repo-with-100GB-weights") reaches swap_to_model → load_model(model_name) → huggingface_hub.snapshot_download(...). No HF-repo-name regex, no disk-space precheck, no allowlist. Even with --api-key gating the endpoint, a single authenticated client can fill disk. Need at minimum: regex check ^[\\w\\-./]+/[\\w\\-.]+$, the existing _check_disk_space() call (cli.py:41), and an optional RAPID_MLX_ALLOWED_REPOS env var.

P1-2. _cached_mlx_models() walks the filesystem on every /v1/models request.
routes/models.py:16-43 — no caching, double for walks every snapshot dir. With 50+ cached models this becomes a per-request os.stat() storm an unauthenticated attacker can amplify by polling /v1/models. Add a 30s mtime-keyed lru_cache.

P1-3. Model-type filter is fragile in both directions.
routes/models.py:33 substring set (\"tts\", \"whisper\", \"asr\", \"sentence-transformer\", \"embed\"). Misses: kokoro (TTS), parakeet (ASR), bge-*, e5-*, gte-*, wav2vec, seamless. False positives: Qwen2.5-Embedded-…, anything with tts in the name that's actually a base LM. Detection should look at config.json's architectures field, not substring-match repo names. (We have is_mllm_model() in api/utils.py as a precedent.)

P1-4. The _sync_config double-import workaround leaves module-globals out of sync with the singleton forever.
server.py:725-731 comment is accurate — python -m vllm_mlx.server does create a __main__ instance separate from vllm_mlx.server. But the chosen fix (write to get_config() directly and exclude this field from _sync_config) means _enable_on_demand_loading (module global, used at server.py:1051 for logging) and cfg.enable_on_demand_loading (singleton, used by ensure_model_loaded) can disagree forever. The robust fix is to never define the flag as a module global — store it only on the singleton. Also: cli.py doesn't go through python -m vllm_mlx.server at all, so the workaround is for an entrypoint that isn't the documented one.

P2 — nits

P2-1. No tests added. Per project SOP §3: "every new behavior MUST have a new test". The state machine introduced here has multiple new behaviors that need pinning:

lock contention (two concurrent swaps coalesce, second acquirer returns fast after _is_model_loaded recheck)
_loading_model cleared in finally so swap is retryable after load_model failure
single-model vs registry branch
_is_model_loaded handles all of {model_name, model_alias, model_path, \"default\"}
ensure_model_loaded returns early when flag is off (this is the production default path)

P2-2. Apple-Silicon memory hygiene. await old_engine.stop() does not gc.collect() or mx.clear_cache(). On a 64GB box swapping from a 35GB model, the new load can race against unreclaimed Metal allocator buffers. Either add an explicit cleanup, or document why it's unnecessary.

P2-3. Supply-chain audit (per SOP §2.5): clean. No new deps, no workflow changes, no install hooks, no subprocess/urllib.request at import time. Network surface is contained to the existing huggingface_hub.snapshot_download path. The risk is P1-1 (input validation on the download trigger), not new infrastructure.

Summary

The motivation is great — Ollama-style auto-load is exactly the kind of UX win we want. But P0-1 means the feature isn't actually reachable from the supported entrypoint as written, and P0-2 means the swap path is unsafe under concurrent traffic (which is the realistic usage). P1-1 is a meaningful security gap that the flag-off default doesn't help users who legitimately enable the feature.

Marking as request-changes. Happy to discuss the design — particularly for P0-2, whether you'd prefer a drain-counter or a BatchedEngine.drain() API addition. The PR after these fixes will be very close to mergeable.

— Generated with Claude Code (multi-pass adversarial review)

raullenchai

See review findings comment above. 2 P0 (CLI unreachable + concurrent-request safety) block merge; P1-1 (input validation) is security-meaningful. Happy to discuss the design.

raullenchai · 2026-05-13T22:12:17Z

Supplementary findings (pr_validate / DeepSeek V4 Pro pass)

Re-ran a second independent review via pr_validate (DeepSeek V4 Pro). Confirms the core findings above; surfaces 2 edge cases worth fixing alongside the P0s:

P1-5. _cached_mlx_models() is not exception-safe (routes/models.py:23-38).
The iterdir() walks have no try/except. A single broken symlink, permission error, or partially-downloaded snapshot raises OSError / PermissionError and takes down /v1/models entirely (500 instead of returning the entries that did parse). Wrap both the outer iterdir and the inner snapshot walks in try: except OSError: continue.

P1-6. swap_to_model crashes with TypeError: object of type 'NoneType' has no len() if _model_registry is None (server.py:~651).
The branch is registry_mode = len(_model_registry) > 1. _model_registry is currently always initialized to a ModelRegistry instance at module import, but if the swap ever runs before that initialization (during a partially-bootstrapped startup, or after a future refactor) it crashes. Cheap guard: registry_mode = _model_registry is not None and len(_model_registry) > 1.

Both are easy fixes that should land in the same revision as the P0s.

(Tools used: pr_validate pipeline → deepseek-v4-pro review on the full diff. The stress_e2e_bench step also ran but failed for an unrelated infrastructure reason — port 8451 collision on my local box. That's not your PR.)

raullenchai

Thanks for the PR @young310 — feature is genuinely useful and overlaps with our own #319 (lazy-load + idle-unload). The __main__-aliasing diagnosis in the description is sharp.

That said, a multi-round adversarial review surfaced 6 P0 blockers and 4 P1 items. The biggest issue: --enable-on-demand-loading is unreachable from the production entry point, so the feature doesn't work for users running rapid-mlx serve. Detail below.

P0 — must fix

1. Flag unreachable via rapid-mlx serve. --enable-on-demand-loading is only added to vllm_mlx/server.py::main() (argparse at L1003). But the production entry point (per pyproject.toml) is rapid-mlx serve → vllm_mlx.cli:main → vllm_mlx/cli.py::serve_command (L370). The new flag isn't in cli.py's serve_parser (L2628+), so rapid-mlx serve --enable-on-demand-loading errors with "unrecognized argument". Your testing via python -m vllm_mlx.server works because that path runs server.py::main directly — but that's the dev path, not how users reach the server. Fix: mirror the flag into cli.py's serve_parser and propagate it into serve_command.

2. __main__-aliasing workaround is incomplete. Your fix routes enable_on_demand_loading through ServerConfig singleton — that part is correct. But swap_to_model and get_loading_model still operate on module-globals (_engine, _loading_model, _model_registry, _swap_lock at server.py:213–216, 642+). When helpers.py:288 does from ..server import swap_to_model, Python imports a second vllm_mlx.server module instance whose globals are stale defaults. Under python -m vllm_mlx.server, swap_to_model mutates the wrong module's _engine/_loading_model — the first instance (the __main__ one) still holds the real running engine. The whole engine-state plane needs to move to the singleton, or python -m vllm_mlx.server execution must be made unreachable (e.g., the canonical entry is cli.py, so this whole story disappears once #1 is fixed and users never go through server.py::main).

3. Synchronous load_model() inside async swap_to_model() blocks the event loop. server.py:657 — load_model(model_name) is called without await inside an async def. Cold downloads can be minutes (Qwen3.5-27B is ~16GB at 4-bit). For the duration, no other request progresses. Fix: wrap in await asyncio.to_thread(load_model, model_name).

4. Registry-mode swap likely orphans engines. server.py:651–677 — when len(_model_registry) > 1, code skips stopping the old engine but still calls load_model(model_name), which rebinds the global _engine to the new engine. The registry's get_engine(name) lookup for the previous model may now miss (depends on whether load_model properly inserts into _model_registry — unverified in the diff). At minimum needs a verification test for "two-model registry mode after on-demand swap, both get_engine(A) and get_engine(B) still resolve".

5. TOCTOU race in ensure_model_loaded for the 503 protection. helpers.py:281–286 — get_loading_model() is checked outside the lock. Two concurrent requests for different unloaded models both see in_flight == None, both proceed past the 503 check, both call swap_to_model. The lock serializes the swap, but the second request now silently re-swaps over the first instead of returning 503. The intended UX (503 + Retry-After when a different model is mid-swap) is defeated. Fix: move the in-flight check inside the lock, or expose the lock state via _swap_lock.locked() and gate on that.

6. Zero tests. ~250 LOC of async state-machine code (concurrency, module-aliasing, external I/O) needs tests covering: (a) flag propagation through both entry points; (b) successful swap → state correct; (c) failed swap → _loading_model cleared, retryable; (d) two concurrent requests for the same model → second is no-op, not double-load; (e) two concurrent requests for different models → second gets 503; (f) _cached_mlx_models filtering positive/negative cases. Per the PR Merge SOP "every new behavior MUST have a new test." The author's local pytest claim of "460 passed" is suspect — the repo has 2289+ tests; that count suggests Python 3.14 collection issues or a stale checkout (we target 3.10–3.13 per pyproject.toml).

P1 — should fix

7. _cached_mlx_models substring filter is fragile. routes/models.py:35–37 — .safetensors doesn't imply MLX-quantized; many PyTorch repos ship safetensors. The substring filter on ("tts", "whisper", "asr", "sentence-transformer", "embed") will also exclude legit models with embed or whisper in unrelated parts of the repo name. Better: read config.json and check for quantization block (MLX-specific marker) or model_type against a known-chat allowlist.

8. /v1/models does filesystem I/O on every call. routes/models.py:_cached_mlx_models walks ~/.cache/huggingface/hub/ and recurses into each snapshots/*/ directory. OpenWebUI hammers /v1/models on tab-switch; this should be cached and invalidated on swap.

9. Python 3.14 in test plan. PR description says "Tested … Python 3.14". We declare >=3.10 with classifiers through 3.13 (pyproject.toml:23–26); 3.14 isn't in our CI matrix. Please retest on 3.12 (the project default) before requesting re-review.

10. PR is CONFLICTING with main since 2026-05-13. Conflicts in chat.py, completions.py, server.py (the same files this PR touches). Needs rebase against raullenchai/main (1cbbb86) before CI can re-run.

Supply-chain audit (PR Merge SOP §2.5)

✅ Clean — no new deps in pyproject.toml, no workflow changes, no install hooks. Author account legitimate (GitHub since 2014, 45 repos).

Recommendation

Request changes. The feature direction is right, but the entry-point gap (P0/#1) means the feature is dead code for rapid-mlx serve users today, and the module-aliasing fix (P0/#2) only addresses one symptom. Closing P0/1–6 + P1/10 (rebase) is the path to merge.

Happy to pair on the entry-point integration if helpful — it's mostly mirroring --enable-on-demand-loading into cli.py:serve_parser and threading it through serve_command to ServerConfig. The hot-path fix is a 1-line asyncio.to_thread().

— Review run via DeepSeek V4 Pro + manual structural audit per the project's PR Merge SOP.

…oints Scope-down of raullenchai#340 v1, per maintainer review (option 2 from review raullenchai#1) and to avoid overlap with raullenchai#319 (lazy-load + idle-unload), which subsumes the swap-infra direction of the original PR. Changes: - helpers.py: extract _is_model_loaded() so single-model and registry modes share one rule (fixes the "default" asymmetry called out as P2-1 in review raullenchai#2 — _validate_model_name() previously accepted "default" only in registry mode). Add ensure_model_loaded() that strictly raises 404 for unknown models (no auto-load). - routes/chat.py + routes/completions.py: call ensure_model_loaded() before _validate_model_name() so the OpenAI-compatible endpoints share one fall-through. - routes/anthropic.py: unchanged (adapter is model-name-agnostic). - tests/test_ensure_model_loaded.py: 12 cases covering single-mode, registry-mode, "default" in both modes, and the strict-404 contract. ensure_model_loaded() is intentionally an extension point — when raullenchai#319 lands, the lifecycle/swap/drain logic fits inside it without touching route code. Removed from v1: swap_to_model + get_loading_model, --enable-on-demand-loading flag, /v1/models cache enumeration, __main__ aliasing workaround. Verified locally: - pytest tests/test_ensure_model_loaded.py → 12 passed - pytest tests/test_routes.py tests/test_chat_route_tool_tag_leak.py \\ tests/test_anthropic_route_auth.py → 65 passed - Branch rebased onto upstream/main at d171d18 (v0.6.50). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

young310 · 2026-05-16T07:23:32Z

@raullenchai Thanks for both review rounds and for surfacing #319 — that materially changes how I want to scope this PR.

Force-pushed a scoped-down revision (832d547). Took option (2) from review #1: dropped the swap infra, --enable-on-demand-loading flag, /v1/models cache enumeration, and the __main__ aliasing workaround. What remains is just the helper extraction and the unified 404 path. The swap/lifecycle work belongs in #319 — please treat this PR as a prep step #319 can build on top of, not a competing implementation.

One thing I want to acknowledge first. Review #1 caught that the "Tested locally … server swaps model automatically" line in the v1 description described behavior the code did not implement (swap_to_model did not exist). Review #3 caught the "Tested on Python 3.14" + "460 passed" mismatch against the supported 3.10–3.13 range / 2289+ test count. Both were description errors on my side and you were right to call them out. This revision only claims what I actually ran; I will be stricter about that going forward.

Diff (4 files, ~60 net lines excluding tests):

vllm_mlx/service/helpers.py: _is_model_loaded() + ensure_model_loaded() (strict 404, no auto-load). Also fixes the P2-1 "default" asymmetry from review feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse #2 by routing _validate_model_name() through _is_model_loaded(), so single-mode and registry-mode share one rule.
vllm_mlx/routes/chat.py + vllm_mlx/routes/completions.py: call ensure_model_loaded() before _validate_model_name().
vllm_mlx/routes/anthropic.py: unchanged (adapter is model-name-agnostic; SDK clients send claude-* names that shouldn't trigger an HF lookup).
tests/test_ensure_model_loaded.py: 12 cases — single-mode, registry-mode, "default" in both modes, strict-404 contract.

ensure_model_loaded() is intentionally an extension point: when #319 lands, the lifecycle/swap/drain logic fits inside that helper without changing route code.

Explicitly out of scope (deferred to #319 or a separate PR):

swap_to_model / get_loading_model / lifecycle (P0-2 review feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse #2)
--enable-on-demand-loading plumbed through cli.py:serve_parser (P0-1)
in-flight drain + clean _engine window (P0-2)
await asyncio.to_thread(load_model, ...) to keep the event loop responsive (P0-3 review feat: Prompt cache for SimpleEngine + tool logits safety #3)
registry-mode swap orphan handling (P0-4)
503 TOCTOU under concurrent unknown-model requests (P0-5)
HF-repo regex / disk-space precheck / allowlist (P1-1)
/v1/models cache enumeration with caching + arch-based detection + exception safety (P1-2, P1-3, P1-5)
__main__ aliasing workaround (P1-4)

Verification (only commands I actually ran):

pytest tests/test_ensure_model_loaded.py → 12 passed
pytest tests/test_routes.py tests/test_chat_route_tool_tag_leak.py tests/test_anthropic_route_auth.py → 65 passed
Rebased onto upstream/main at d171d18 (v0.6.50). The earlier conflict against 1cbbb86 is resolved.
Local env is Python 3.14, which is outside the supported 3.10–3.13 range per pyproject.toml. CI on a supported version is authoritative; I have not made any "full suite passes" claim.

Happy to close this entirely and let the helper live inside #319 if you'd rather not carry two PRs. Also happy to narrow further (e.g. only the P2-1 default-asymmetry fix, dropping the new ensure_model_loaded extension point) — just say the word.

raullenchai requested changes May 13, 2026

View reviewed changes

raullenchai requested changes May 14, 2026

View reviewed changes

young310 force-pushed the feat/on-demand-model-loading branch from dec3fed to 832d547 Compare May 16, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: on-demand model loading for all inference endpoints (ollama-style)#340

feat: on-demand model loading for all inference endpoints (ollama-style)#340
young310 wants to merge 1 commit into
raullenchai:mainfrom
young310:feat/on-demand-model-loading

young310 commented May 9, 2026 •

edited

Loading

Uh oh!

raullenchai commented May 10, 2026

Uh oh!

young310 commented May 11, 2026

Uh oh!

young310 commented May 11, 2026

Uh oh!

raullenchai commented May 13, 2026

Uh oh!

raullenchai left a comment

Uh oh!

raullenchai commented May 13, 2026

Uh oh!

raullenchai left a comment

Uh oh!

young310 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

young310 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Core helpers (service/helpers.py)

New functions in server.py

Feature gate (--enable-on-demand-loading)

/v1/models now lists all local cache (routes/models.py)

Anthropic route (routes/anthropic.py)

Bug fixed: __main__ module aliasing

Behaviour

Testing

Uh oh!

raullenchai commented May 10, 2026

P0 — blocker (PR is non-functional)

vllm_mlx.server does not export swap_to_model or get_loading_model

P0 — blocker (security / cost)

Auto-loading runs before _validate_model_name

P1 — should fix

P2 — nits

Verdict

Uh oh!

young310 commented May 11, 2026

Code review

Uh oh!

young310 commented May 11, 2026

Uh oh!

raullenchai commented May 13, 2026

Review — PR Merge SOP audit

P0 — blocking merge

P1 — should fix before merge

P2 — nits

Summary

Uh oh!

raullenchai left a comment

Choose a reason for hiding this comment

Uh oh!

raullenchai commented May 13, 2026

Supplementary findings (pr_validate / DeepSeek V4 Pro pass)

Uh oh!

raullenchai left a comment

Choose a reason for hiding this comment

P0 — must fix

P1 — should fix

Supply-chain audit (PR Merge SOP §2.5)

Recommendation

Uh oh!

young310 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

young310 commented May 9, 2026 •

edited

Loading

Core helpers (`service/helpers.py`)

New functions in `server.py`

Feature gate (`--enable-on-demand-loading`)

`/v1/models` now lists all local cache (`routes/models.py`)

Anthropic route (`routes/anthropic.py`)

Bug fixed: `main` module aliasing

`vllm_mlx.server` does not export `swap_to_model` or `get_loading_model`

Auto-loading runs before `_validate_model_name`