A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.
Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.
Homebrew (recommended for Mac power users):
brew tap TheTom/tap && brew install vllm-swiftpip (everyone else, including dev containers and non-brew Macs):
pip install vllm-swiftThe pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.
From source:
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH (generated by install.sh)vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096 # increase as needed, max 40960Homebrew users don't need
activate.sh—vllm-swift servehandles everything.
Server running at http://localhost:8000 (OpenAI-compatible API).
Drop-in replacement for vLLM on Apple Silicon. All
vllm serveflags work unchanged.
Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 364 | 1,527 | 2,859 | 3,425 |
| vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 147 | 477 | 1,194 | 1,518 |
| vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |
Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.
TurboQuant+ KV Cache Compression
TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.
Qwen3.5 2B (4-bit weights)
| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K |
|---|---|---|---|---|---|
| FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s |
| turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s |
| turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |
The entire forward pass runs in Swift/Metal. Python is used only for orchestration.
Python (vLLM API, tokenization, scheduling) ← github.com/vllm-project/vllm
↓ ctypes FFI
C bridge (bridge.h)
↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
↓
Metal GPU
- OpenAI-compatible API (
/v1/completions,/v1/chat/completions) - Streaming (SSE) responses
- Chat templates (applied by vLLM, model-specific)
- Batched concurrent decode with
BatchedKVCache(fully batched projections + attention) - Per-request temperature sampling in batched path
- Auto model download from HuggingFace Hub
- TurboQuant+ KV cache compression (
turbo3,turbo4v2) via mlx-swift-lm - longctx code-aware retrieval companion (
--enable-longctx, experimental) - TriAttention V3 query-aware KV-cache eviction (env-gated, experimental — pair with longctx, see Effectively-unbounded context)
- Decode and prompt logprobs
- Greedy and temperature sampling
- EOS / stop token detection (vLLM scheduler)
- VLM (vision-language model) support (experimental)
- Works with Hermes, OpenCode, and any OpenAI-compatible client
# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
--served-model-name qwen3-4b \
--enable-auto-tool-choice --tool-call-parser hermesThen point your tool at it:
# Hermes — set in ~/.hermes/config.yaml:
# base_url: http://localhost:8000/v1
# model: qwen3-4b
# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode
# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermesvllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-reasoning --reasoning-parser deepseek_r1Long context with TurboQuant+
Compress KV cache 3-5× to fit longer context with modest throughput cost:
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'| Scheme | Compression | Best for |
|---|---|---|
turbo4v2 |
~3× | Recommended — best quality/compression balance |
turbo3 |
~4.6× | Maximum compression, higher PPL trade-off |
Both pieces are off by default. Wiring works on Qwen3.5-2B-4bit (M5 Max + M2 Mac mini) up to 256K. Other model families load fine but are less tested. APIs and defaults may change.
Two independent features that compose:
- longctx — retrieval companion that adds a
## Retrieved code contextblock to chat completions, sourced from the user's repo. See longctx. - TriAttention V3 — query-aware KV-cache eviction policy. Independent Swift port and hybrid extension of Mao et al., "TriAttention: Efficient Long Reasoning with Trigonometric KV Compression" (arXiv:2604.04921). Drops low-salience cells once the cache passes a budget. Design doc.
| Workload | Use this |
|---|---|
| Coding assistant on a local repo | longctx alone |
| Long single-shot prompt that fits the model's context window | TurboQuant+ KV codec (turbo4v2) |
| Long multi-turn chat that would otherwise outgrow GPU memory | V3 + longctx |
| Retrieval-heavy workloads (NIAH-style) at 32K+ | V3 + longctx |
V3 alone is not recommended. Eviction is one-way; without a recovery layer, evicted facts are gone. On Qwen3.5-2B-4bit at 32K → 256K, V3-only misses recall at every rung; V3+longctx passes at every rung (table at the bottom of this section).
--enable-longctx and the V3 rescue path both need the longctx-svc Python service. Install once into the bundled vllm-swift venv:
~/.vllm-swift/venv/bin/pip install longctx-svcOr install globally if you'd rather:
pip install longctx-svcvllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--enable-longctxThe sidecar boots automatically when --enable-longctx is set. Each chat completion's prompt is scanned for absolute file paths; the first path's project root is detected (.git, package.json, etc.), the repo is indexed, and top-K relevant chunks are spliced in as a system message. Works alongside --enable-auto-tool-choice. See longctx for scope tuning, watch-mode, and tester notes.
# Start longctx-svc separately (the auto-spawn from --enable-longctx
# is enough for the code-aware path; the V3 rescue path benefits
# from a long-running shared instance)
~/.vllm-swift/venv/bin/longctx-svc serve --host 127.0.0.1 --port 5054 &
# Serve with V3 + longctx wired together
VLLM_TRIATT_ENABLED=1 \
VLLM_TRIATT_BUDGET=230400 \
VLLM_TRIATT_WINDOW=128 \
VLLM_TRIATT_PREFIX=32 \
VLLM_TRIATT_WARMUP=256 \
VLLM_TRIATT_HYBRID=2 \
LONGCTX_ENDPOINT=http://127.0.0.1:5054 \
vllm-swift serve ~/models/Qwen3.5-2B-4bit \
--served-model-name qwen35-2b \
--enable-longctx \
--max-model-len 262144Set VLLM_TRIATT_BUDGET to ~90% of --max-model-len for a 10% eviction headroom. The auto Tier-3 rehydrate hook fires before each turn's prefill, queries longctx with the user's question, and prepends recovered chunks as a system message. The model sees a normal multi-turn chat with the recovered context up top.
| Var | Default | Purpose |
|---|---|---|
VLLM_TRIATT_ENABLED |
unset (off) | master switch |
VLLM_TRIATT_BUDGET |
required | KV cells to keep (set to ~90% of --max-model-len for 10% eviction headroom) |
VLLM_TRIATT_WINDOW |
128 | always-keep recent window |
VLLM_TRIATT_PREFIX |
32 | always-keep prompt prefix |
VLLM_TRIATT_WARMUP |
256 | tokens before first eviction round |
VLLM_TRIATT_HYBRID |
2 | eviction policy mode |
LONGCTX_ENDPOINT |
unset | URL of longctx-svc — required for the rescue path |
- V3 cache (
TriAttentionKVCache) is FP16 only. Stacking V3 with TurboQuant codecs (turbo4v2, turbo8v4) is not yet supported. Track progress at mlx-swift-lm task #187. - V3 hooks are wired on Qwen3 / Qwen3.5 / Qwen3-MoE / Llama / Mistral3 / Phi / Phi3 / Gemma3 / GLM4. Other model families fall back to non-V3 caches.
- Tier-3 rehydrate auto-fires only through the chat-completions multi-turn path (
ChatSessionin mlx-swift-lm). Single-shot completions skip it; document or NIAH-style workloads need to structure as a 2+ turn chat. - longctx-svc is an alpha companion service with its own caveats; see its README.
12-cell ramp on Qwen3.5-2B-4bit (M5 Max), 32K → 256K planted-fact NIAH:
| ctx | baseline turbo8v4 | V3 only | V3 + longctx |
|---|---|---|---|
| 32K | ✓HIT | ✗miss | ✓HIT |
| 64K | ✓HIT | ✗miss | ✓HIT |
| 128K | ✓HIT | ✗miss | ✓HIT |
| 256K | ✓HIT | ✗miss | ✓HIT |
Source paper: longctx and TriAttention.
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermes \
--enable-reasoning --reasoning-parser deepseek_r1 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'vllm-swift serve <model> [options]
--served-model-name NAME Clean model name for API clients (recommended)
--max-model-len N Max sequence length (default: model config)
--port PORT API server port (default: 8000)
--gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
--dtype float16 Model dtype (default: float16)
--enable-auto-tool-choice Enable tool/function calling
--tool-call-parser NAME Tool call format (hermes, llama3, mistral, etc.)
--enable-reasoning Enable chain-of-thought parsing
--reasoning-parser NAME Reasoning format (deepseek_r1, etc.)
--additional-config JSON Extra config (kv_scheme, kv_bits)All standard vLLM flags work — these are just the most common ones.
| Doc | What's in it |
|---|---|
| docs/PERFORMANCE.md | Full perf matrix vs vllm-metal, methodology, long-context cells |
| docs/MODEL_COMPATIBILITY.md | Empirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing) |
| docs/TROUBLESHOOTING.md | Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.) |
| CHANGELOG.md | Release history |
See CHANGELOG.md for release history.
- LoRA not supported (Swift engine limitation)
- Chunked prefill disabled (Swift engine handles full sequences)
- top_p sampling not supported in batched decode path (temperature works)
- Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
- Requires macOS on Apple Silicon (no Linux/CUDA)
brew tap TheTom/tap && brew install vllm-swiftPrebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.
To update to the latest version:
vllm-swift update
# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swiftgit clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh # builds Swift, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096Homebrew checksum error on reinstall:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift"No module named vllm" or plugin not loading after brew install:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setupvLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:
# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup
# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllmactivate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.
Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:
cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
$(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/vllm-swift download mlx-community/Qwen3-4B-4bit
# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit
# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latestvllm_swift/ Python plugin (vLLM WorkerBase)
swift/
Sources/VLLMBridge/ C bridge (@_cdecl exports)
bridge.h C API (prefill, decode, batched decode)
scripts/
install.sh One-step build + install
build_bottle.sh Build + upload Homebrew bottle
integration_test.sh End-to-end smoke test
homebrew/
vllm-swift.rb Homebrew formula
tests/ 84 tests, 97% coverage
- macOS 14+ on Apple Silicon
- Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
- Python 3.10+
- vLLM 0.19+
- mlx-swift-lm (pulled automatically by Swift Package Manager)
Apache-2.0
