Fix ConfigI metallib detection + add minimal local chat UI by ljubomirj · Pull Request #6 · TheTom/vllm-swift

ljubomirj · 2026-04-26T09:52:37Z

Hello, thanks for sharing this project.

I wanted to test Qwen3.6-27B dense (ConfigI) tok/s on an M2 Max and found a few setup issues while bringing the repo up locally. This PR contains the minimal fixes needed to get a reliable install path and quick local validation workflow.

What is included:

installer: robust gated_delta kernel detection in mlx.metallib
installer: fail-fast when required ConfigI kernels are missing (with explicit override)
installer: clearer diagnostics for missing Metal toolchain
docs: updated quick-start/troubleshooting for ConfigI model flow
tool: single-file chat.html OpenAI-compatible local chat client for smoke tests

Notes:

I observed ~8-10 tok/s chatting with Qwen3.6-27B-ConfigI-MLX on my machine.
Investigation and patch drafting were done with Codex while I provided direction and validation goals.

Thanks again for publishing this on GitHub — very interesting work.

The installer now validates gated_delta kernel presence, fails fast when required kernels are missing, and surfaces concrete Metal toolchain guidance. A single-file local chat UI is added for quick OpenAI-compatible smoke testing. Constraint: Users can run with incomplete Xcode Metal components and still produce a misleading successful install Rejected: Keep warning-only behavior | allows silent runtime failure on ConfigI models Rejected: Commit generated activate.sh | machine-specific artifact, not source Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep gated_delta detection tied to required kernel symbols for ConfigI-class models Tested: scripts/install.sh syntax check and full install run with gated_delta verification Tested: vllm serve + /ping + /v1/models + chat completion smoke test Not-tested: Remote CI matrix and Homebrew bottle workflow

Field reports from the v0.5.1 alpha (Tom's buddy) surfaced 5 obvious bugs and 2 non-obvious ones (Metal-side; tracked separately). This release fixes the obvious and locks them down with regression tests. Bugs fixed: - #1 vllm not declared as runtime dep. `pip install vllm-swift==0.5.1` left users at ModuleNotFoundError on first `vllm-swift serve`. pyproject now declares vllm>=0.10. Side benefit: narrows pip's resolver window, stops --pre pulling rc/dev safetensors / tokenizers / transformers. - #3 reasoning-budget bump clobbered explicit small max_tokens. Client sent max_tokens=64, got completion_tokens=20480 because the bump fired unconditionally. Now respected when client sets <1024 (curl smokes, "say hello", token-count probes). The OpenCode/Hermes 4K-8K starvation case still bumps as before. - #7 message.reasoning not normalized to message.reasoning_content. Some vLLM versions emit `reasoning` (their newer naming). Normalize to the OpenAI-standard `reasoning_content` so OpenAI clients (Hermes, openai-python) see the field they expect. Original `reasoning` preserved for back-compat. - #6 longctx splice spammed 8 chunks regardless of relevance. Trivial "say hello" produced prompt_tokens=5423. Added cosine-score >= 0.20 floor (env-tunable via LONGCTX_RELEVANCE_FLOOR) that drops noise chunks before splicing. - #2 --max-model-len exceeding model's max_position_embeddings. Pre-flight reads model's config.json and warns with actual numbers ("65536 exceeds 40960; recommend --max-model-len 40960") instead of letting vLLM reject prompts later with a less specific error. Plus a CI-fixing pass: tests/test_longctx_endpoint.py had stale imports flagged by ruff F811/F401 + I001 (the v0.5.1 commit's CI failed on this). All ruff lint clean now. 8 new regression tests in tests/test_longctx_endpoint.py pin all five behaviors. 505/505 tests pass total. NOT fixed in this release (separate Metal-kernel investigation): - #4 KV-cache corruption signature under turbo4v2 4-bit + sustained decode. Workaround: drop --additional-config or use kv_bits: 8 (asymmetric K8/V4) for the same scheme. - #5 4× decode throughput decay (128 → 30 tok/s monotonic) — likely same root cause as #4. Same workaround. Versions caught up: pyproject.toml 0.5.1 → 0.5.2 __init__.py 0.5.1 → 0.5.2 homebrew formula 0.5.1 → 0.5.2; bottle SHAs cleared scripts/build_bottle.sh 0.5.1 → 0.5.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lefromage mentioned this pull request Apr 26, 2026

TCP client failed to connect/validate to host 100.64.0.1:59782 - retrying #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ConfigI metallib detection + add minimal local chat UI#6

Fix ConfigI metallib detection + add minimal local chat UI#6
ljubomirj wants to merge 1 commit into
TheTom:mainfrom
ljubomirj:pr/configi-gdn-metallib-chat

ljubomirj commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ljubomirj commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ljubomirj commented Apr 26, 2026 •

edited

Loading