Skip to content

Fix ConfigI metallib detection + add minimal local chat UI#6

Open
ljubomirj wants to merge 1 commit into
TheTom:mainfrom
ljubomirj:pr/configi-gdn-metallib-chat
Open

Fix ConfigI metallib detection + add minimal local chat UI#6
ljubomirj wants to merge 1 commit into
TheTom:mainfrom
ljubomirj:pr/configi-gdn-metallib-chat

Conversation

@ljubomirj
Copy link
Copy Markdown

@ljubomirj ljubomirj commented Apr 26, 2026

Hello, thanks for sharing this project.

I wanted to test Qwen3.6-27B dense (ConfigI) tok/s on an M2 Max and found a few setup issues while bringing the repo up locally. This PR contains the minimal fixes needed to get a reliable install path and quick local validation workflow.

What is included:

  • installer: robust gated_delta kernel detection in mlx.metallib
  • installer: fail-fast when required ConfigI kernels are missing (with explicit override)
  • installer: clearer diagnostics for missing Metal toolchain
  • docs: updated quick-start/troubleshooting for ConfigI model flow
  • tool: single-file chat.html OpenAI-compatible local chat client for smoke tests

Notes:

  • I observed ~8-10 tok/s chatting with Qwen3.6-27B-ConfigI-MLX on my machine.
  • Investigation and patch drafting were done with Codex while I provided direction and validation goals.

Thanks again for publishing this on GitHub — very interesting work.

The installer now validates gated_delta kernel presence, fails fast when required kernels are missing, and surfaces concrete Metal toolchain guidance. A single-file local chat UI is added for quick OpenAI-compatible smoke testing.

Constraint: Users can run with incomplete Xcode Metal components and still produce a misleading successful install

Rejected: Keep warning-only behavior | allows silent runtime failure on ConfigI models

Rejected: Commit generated activate.sh | machine-specific artifact, not source

Confidence: high

Scope-risk: narrow

Reversibility: clean

Directive: Keep gated_delta detection tied to required kernel symbols for ConfigI-class models

Tested: scripts/install.sh syntax check and full install run with gated_delta verification

Tested: vllm serve + /ping + /v1/models + chat completion smoke test

Not-tested: Remote CI matrix and Homebrew bottle workflow
TheTom added a commit that referenced this pull request May 7, 2026
Field reports from the v0.5.1 alpha (Tom's buddy) surfaced 5 obvious
bugs and 2 non-obvious ones (Metal-side; tracked separately). This
release fixes the obvious and locks them down with regression tests.

Bugs fixed:
- #1 vllm not declared as runtime dep. `pip install vllm-swift==0.5.1`
  left users at ModuleNotFoundError on first `vllm-swift serve`.
  pyproject now declares vllm>=0.10. Side benefit: narrows pip's
  resolver window, stops --pre pulling rc/dev safetensors / tokenizers
  / transformers.
- #3 reasoning-budget bump clobbered explicit small max_tokens. Client
  sent max_tokens=64, got completion_tokens=20480 because the bump
  fired unconditionally. Now respected when client sets <1024 (curl
  smokes, "say hello", token-count probes). The OpenCode/Hermes 4K-8K
  starvation case still bumps as before.
- #7 message.reasoning not normalized to message.reasoning_content.
  Some vLLM versions emit `reasoning` (their newer naming). Normalize
  to the OpenAI-standard `reasoning_content` so OpenAI clients (Hermes,
  openai-python) see the field they expect. Original `reasoning`
  preserved for back-compat.
- #6 longctx splice spammed 8 chunks regardless of relevance. Trivial
  "say hello" produced prompt_tokens=5423. Added cosine-score >= 0.20
  floor (env-tunable via LONGCTX_RELEVANCE_FLOOR) that drops noise
  chunks before splicing.
- #2 --max-model-len exceeding model's max_position_embeddings.
  Pre-flight reads model's config.json and warns with actual numbers
  ("65536 exceeds 40960; recommend --max-model-len 40960") instead of
  letting vLLM reject prompts later with a less specific error.

Plus a CI-fixing pass: tests/test_longctx_endpoint.py had stale
imports flagged by ruff F811/F401 + I001 (the v0.5.1 commit's CI
failed on this). All ruff lint clean now.

8 new regression tests in tests/test_longctx_endpoint.py pin all five
behaviors. 505/505 tests pass total.

NOT fixed in this release (separate Metal-kernel investigation):
- #4 KV-cache corruption signature under turbo4v2 4-bit + sustained
  decode. Workaround: drop --additional-config or use kv_bits: 8
  (asymmetric K8/V4) for the same scheme.
- #5 4× decode throughput decay (128 → 30 tok/s monotonic) — likely
  same root cause as #4. Same workaround.

Versions caught up:
  pyproject.toml      0.5.1 → 0.5.2
  __init__.py         0.5.1 → 0.5.2
  homebrew formula    0.5.1 → 0.5.2; bottle SHAs cleared
  scripts/build_bottle.sh  0.5.1 → 0.5.2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant