Fix ConfigI metallib detection + add minimal local chat UI#6
Open
ljubomirj wants to merge 1 commit into
Open
Conversation
The installer now validates gated_delta kernel presence, fails fast when required kernels are missing, and surfaces concrete Metal toolchain guidance. A single-file local chat UI is added for quick OpenAI-compatible smoke testing. Constraint: Users can run with incomplete Xcode Metal components and still produce a misleading successful install Rejected: Keep warning-only behavior | allows silent runtime failure on ConfigI models Rejected: Commit generated activate.sh | machine-specific artifact, not source Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep gated_delta detection tied to required kernel symbols for ConfigI-class models Tested: scripts/install.sh syntax check and full install run with gated_delta verification Tested: vllm serve + /ping + /v1/models + chat completion smoke test Not-tested: Remote CI matrix and Homebrew bottle workflow
TheTom
added a commit
that referenced
this pull request
May 7, 2026
Field reports from the v0.5.1 alpha (Tom's buddy) surfaced 5 obvious bugs and 2 non-obvious ones (Metal-side; tracked separately). This release fixes the obvious and locks them down with regression tests. Bugs fixed: - #1 vllm not declared as runtime dep. `pip install vllm-swift==0.5.1` left users at ModuleNotFoundError on first `vllm-swift serve`. pyproject now declares vllm>=0.10. Side benefit: narrows pip's resolver window, stops --pre pulling rc/dev safetensors / tokenizers / transformers. - #3 reasoning-budget bump clobbered explicit small max_tokens. Client sent max_tokens=64, got completion_tokens=20480 because the bump fired unconditionally. Now respected when client sets <1024 (curl smokes, "say hello", token-count probes). The OpenCode/Hermes 4K-8K starvation case still bumps as before. - #7 message.reasoning not normalized to message.reasoning_content. Some vLLM versions emit `reasoning` (their newer naming). Normalize to the OpenAI-standard `reasoning_content` so OpenAI clients (Hermes, openai-python) see the field they expect. Original `reasoning` preserved for back-compat. - #6 longctx splice spammed 8 chunks regardless of relevance. Trivial "say hello" produced prompt_tokens=5423. Added cosine-score >= 0.20 floor (env-tunable via LONGCTX_RELEVANCE_FLOOR) that drops noise chunks before splicing. - #2 --max-model-len exceeding model's max_position_embeddings. Pre-flight reads model's config.json and warns with actual numbers ("65536 exceeds 40960; recommend --max-model-len 40960") instead of letting vLLM reject prompts later with a less specific error. Plus a CI-fixing pass: tests/test_longctx_endpoint.py had stale imports flagged by ruff F811/F401 + I001 (the v0.5.1 commit's CI failed on this). All ruff lint clean now. 8 new regression tests in tests/test_longctx_endpoint.py pin all five behaviors. 505/505 tests pass total. NOT fixed in this release (separate Metal-kernel investigation): - #4 KV-cache corruption signature under turbo4v2 4-bit + sustained decode. Workaround: drop --additional-config or use kv_bits: 8 (asymmetric K8/V4) for the same scheme. - #5 4× decode throughput decay (128 → 30 tok/s monotonic) — likely same root cause as #4. Same workaround. Versions caught up: pyproject.toml 0.5.1 → 0.5.2 __init__.py 0.5.1 → 0.5.2 homebrew formula 0.5.1 → 0.5.2; bottle SHAs cleared scripts/build_bottle.sh 0.5.1 → 0.5.2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello, thanks for sharing this project.
I wanted to test Qwen3.6-27B dense (ConfigI) tok/s on an M2 Max and found a few setup issues while bringing the repo up locally. This PR contains the minimal fixes needed to get a reliable install path and quick local validation workflow.
What is included:
Notes:
Thanks again for publishing this on GitHub — very interesting work.