OpenAI-compatible inference server powered by transformers.js. Run ONNX models locally with WebGPU acceleration or CPU — no Python, no CUDA required.
Think vLLM or llama.cpp, but for the ts crowd.
npx wandler --llm onnx-community/gemma-4-E4B-it-ONNX:q4# custom model, precision, device, port
npx wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX:fp16 --device cpu --port 3000# with embeddings and STT
npx wandler --llm onnx-community/gemma-4-E4B-it-ONNX:q4 \
--embedding Xenova/all-MiniLM-L6-v2:q8 \
--stt onnx-community/whisper-tiny:q4Use it with the OpenAI SDK:
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "-" });
const response = await client.chat.completions.create({
model: "onnx-community/gemma-4-E4B-it-ONNX",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of response) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}wandler — transformers.js inference server
Usage:
wandler --llm org/repo[:precision] [options]
wandler model ls [--type <type>]
Commands:
models List available models from the catalog
Model:
-l, --llm <id> LLM model
--backend <name> LLM backend: wandler, transformersjs (default: wandler)
-e, --embedding <id> Embedding model
-s, --stt <id> STT model
-d, --device <type> Device: auto, cpu, cuda, coreml, dml, webgpu, wasm (default: auto)
--hf-token <token> HuggingFace token for gated models
--cache-dir <path> Model cache directory
Server:
-p, --port <number> Port (default: 8000)
--host <addr> Bind address (default: 127.0.0.1)
-k, --api-key <key> API key for auth (or WANDLER_API_KEY)
--cors-origin <origin> Allowed CORS origin (default: *)
--max-tokens <n> Max tokens per request (default: 2048)
--max-concurrent <n> Max concurrent requests (default: 1)
--timeout <ms> Request timeout in ms (default: 120000)
--log-level <level> debug, info, warn, error (default: info)
--quiet Suppress non-error startup/profile logs
--prefill-chunk-size <n>
Chunk size for long-prompt prefill; auto uses a 640MB GPU attention budget; auto:<mb> customizes it; 0/off disables it
--decode-loop <mode> Wandler decode loop: auto/on/off (default: auto; on is experimental)
--prefix-cache <mode> Enable prefix KV cache: true/false (default: true)
--prefix-cache-entries <n>
Prefix KV cache entries (default: 2)
--prefix-cache-min-tokens <n>
Minimum prefix tokens to cache (default: 512)
--warmup-tokens <n> Approximate prompt tokens to run once before serving
--warmup-max-new-tokens <n>
Max new tokens for startup warmup
Info:
-v, --version Show version
-h, --help Show this help
Precision suffixes: q4, q8, fp16, fp32 (default: q4)
Every CLI flag has a corresponding environment variable:
| Variable | Default | Description |
|---|---|---|
WANDLER_LLM |
onnx-community/gemma-4-E4B-it-ONNX:q4 | LLM model with precision |
WANDLER_BACKEND |
wandler | LLM backend: wandler for Wandler's serving layer, transformersjs for the direct baseline |
WANDLER_STT |
onnx-community/whisper-tiny:q4 | Speech-to-text model |
WANDLER_EMBEDDING |
— | Embedding model (disabled by default) |
WANDLER_DEVICE |
webgpu | Device: webgpu, cpu, wasm |
WANDLER_PORT |
8000 | Server port |
WANDLER_HOST |
127.0.0.1 | Bind address |
WANDLER_API_KEY |
— | API key for auth |
WANDLER_CORS_ORIGIN |
* | Allowed CORS origin |
WANDLER_MAX_TOKENS |
2048 | Max tokens per request |
WANDLER_MAX_CONCURRENT |
1 | Max concurrent requests |
WANDLER_TIMEOUT |
120000 | Request timeout (ms) |
WANDLER_LOG_LEVEL |
info | Log level |
WANDLER_QUIET |
false | Suppress non-error startup/profile logs |
WANDLER_CACHE_DIR |
~/.cache/huggingface | Model cache directory (also respects HF_HOME) |
WANDLER_PREFILL_CHUNK_SIZE |
auto | Chunk size for long-prompt prefill; auto uses the fastest GPU path that fits a 640MB attention budget, auto:<mb> customizes it; set 0/off to disable |
WANDLER_DECODE_LOOP |
auto | Wandler-owned decode loop for supported text generation; auto uses the safe transformers.js generate() path, on opts into the experimental Wandler loop, off disables it |
WANDLER_PREFIX_CACHE |
true | Enable in-memory prefix KV caching for repeated system/tool prefixes |
WANDLER_PREFIX_CACHE_ENTRIES |
2 | Prefix KV cache entry count |
WANDLER_PREFIX_CACHE_MIN_TOKENS |
512 | Minimum prefix size before caching |
WANDLER_WARMUP_TOKENS |
0 | Approximate prompt tokens to run once before serving |
WANDLER_WARMUP_MAX_NEW_TOKENS |
8 | Max new tokens for startup warmup |
HF_TOKEN |
— | HuggingFace token for gated models |
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completion (streaming + non-streaming) |
/v1/completions |
POST | Text completion (legacy) |
/v1/models |
GET | List loaded models |
/v1/models/{id} |
GET | Get model details |
| Endpoint | Method | Description |
|---|---|---|
/v1/embeddings |
POST | Text embeddings |
| Endpoint | Method | Description |
|---|---|---|
/v1/audio/transcriptions |
POST | Speech-to-text (Whisper) |
| Endpoint | Method | Description |
|---|---|---|
/tokenize |
POST | Text to token IDs |
/detokenize |
POST | Token IDs to text |
/health |
GET | Server status |
/admin/metrics |
GET | Request metrics |
Standard OpenAI parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
messages / prompt |
array / string | required | Input |
temperature |
float | 0.7 | Sampling temperature (0 = greedy) |
top_p |
float | 0.95 | Nucleus sampling |
max_tokens |
int | 2048 | Max tokens to generate |
stream |
bool | false | Enable SSE streaming |
stop |
string | string[] | — | Stop sequences |
presence_penalty |
float | 0 | Penalize token presence |
frequency_penalty |
float | 0 | Penalize token frequency |
response_format |
object | — | {"type": "json_object"} for JSON mode |
tools |
array | — | Function calling definitions |
stream_options |
object | — | {"include_usage": true} |
Extended parameters (vLLM/llama.cpp compatible):
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k |
int | — | Top-k sampling |
min_p |
float | — | Minimum probability threshold |
typical_p |
float | — | Locally typical sampling |
repetition_penalty |
float | — | Direct repetition penalty (> 1.0) |
no_repeat_ngram_size |
int | — | Prevent N-gram repetition |
| Parameter | Type | Default | Description |
|---|---|---|---|
input |
string | string[] | required | Text to embed |
encoding_format |
string | "float" | "float" or "base64" |
List all verified models with their capabilities:
wandler model lstype | size | prec | capabilities | repo:precision | name
------------------------------------------------------------------------------------------------------------------------
llm | 2B | q4 | chat, tool-calling | onnx-community/gemma-4-E4B-it-ONNX:q4 | Gemma 4 E4B
llm | 1.2B | q4 | chat, tool-calling | LiquidAI/LFM2.5-1.2B-Instruct-ONNX:q4 | LFM 2.5 1.2B
llm | 350M | q4 | chat, tool-calling | LiquidAI/LFM2.5-350M-ONNX:q4 | LFM 2.5 350M
llm | 0.8B | q4 | chat, tool-calling | onnx-community/Qwen3.5-0.8B-Text-ONNX:q4 | Qwen 3.5 0.8B
llm | 1.7B | q4 | chat | HuggingFaceTB/SmolLM2-1.7B-Instruct:q4 | SmolLM2 1.7B
embedding | 22M | q8 | embedding | Xenova/all-MiniLM-L6-v2:q8 | all-MiniLM-L6-v2
embedding | 33M | q8 | embedding | Xenova/bge-small-en-v1.5:q8 | BGE Small EN v1.5
embedding | 137M | q8 | embedding | nomic-ai/nomic-embed-text-v1.5:q8 | Nomic Embed Text v1.5
stt | 39M | q4 | transcription | onnx-community/whisper-tiny:q4 | Whisper Tiny
stt | 74M | q4 | transcription | onnx-community/whisper-base:q4 | Whisper Base
stt | 244M | q4 | transcription | onnx-community/whisper-small:q4 | Whisper Small
Filter by type:
wandler model ls --type llm
wandler model ls --type embedding
wandler model ls --type sttUse the repo:precision value directly with --llm, --embedding, or --stt.
Any ONNX model from onnx-community or transformers.js compatible models should work beyond the verified catalog.
wandler parses tool calls from multiple model output formats:
- LFM:
[func_name(arg="val")]and[tool_calls [{...}]] - Qwen:
<tool_call>{"name": "...", "arguments": {...}}</tool_call> - OpenAI JSON:
{"tool_calls": [...]}
Thinking blocks (<think>...</think>) are automatically stripped before parsing.
MIT