Skip to content

runpod-labs/wandler

Repository files navigation

wandler

OpenAI-compatible inference server powered by transformers.js. Run ONNX models locally with WebGPU acceleration or CPU — no Python, no CUDA required.

Think vLLM or llama.cpp, but for the ts crowd.

Quickstart

npx wandler --llm onnx-community/gemma-4-E4B-it-ONNX:q4
# custom model, precision, device, port
npx wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX:fp16 --device cpu --port 3000
# with embeddings and STT
npx wandler --llm onnx-community/gemma-4-E4B-it-ONNX:q4 \
  --embedding Xenova/all-MiniLM-L6-v2:q8 \
  --stt onnx-community/whisper-tiny:q4

Use it with the OpenAI SDK:

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:8000/v1", apiKey: "-" });

const response = await client.chat.completions.create({
  model: "onnx-community/gemma-4-E4B-it-ONNX",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

CLI

wandler — transformers.js inference server

Usage:
  wandler --llm org/repo[:precision] [options]
  wandler model ls [--type <type>]

Commands:
  models                    List available models from the catalog

Model:
  -l, --llm <id>              LLM model
      --backend <name>        LLM backend: wandler, transformersjs (default: wandler)
  -e, --embedding <id>        Embedding model
  -s, --stt <id>              STT model
  -d, --device <type>         Device: auto, cpu, cuda, coreml, dml, webgpu, wasm (default: auto)
      --hf-token <token>      HuggingFace token for gated models
      --cache-dir <path>      Model cache directory

Server:
  -p, --port <number>         Port (default: 8000)
      --host <addr>           Bind address (default: 127.0.0.1)
  -k, --api-key <key>         API key for auth (or WANDLER_API_KEY)
      --cors-origin <origin>  Allowed CORS origin (default: *)
      --max-tokens <n>        Max tokens per request (default: 2048)
      --max-concurrent <n>    Max concurrent requests (default: 1)
      --timeout <ms>          Request timeout in ms (default: 120000)
      --log-level <level>     debug, info, warn, error (default: info)
      --quiet                 Suppress non-error startup/profile logs
      --prefill-chunk-size <n>
                              Chunk size for long-prompt prefill; auto uses a 640MB GPU attention budget; auto:<mb> customizes it; 0/off disables it
      --decode-loop <mode>     Wandler decode loop: auto/on/off (default: auto; on is experimental)
      --prefix-cache <mode>   Enable prefix KV cache: true/false (default: true)
      --prefix-cache-entries <n>
                              Prefix KV cache entries (default: 2)
      --prefix-cache-min-tokens <n>
                              Minimum prefix tokens to cache (default: 512)
      --warmup-tokens <n>     Approximate prompt tokens to run once before serving
      --warmup-max-new-tokens <n>
                              Max new tokens for startup warmup

Info:
  -v, --version               Show version
  -h, --help                  Show this help

Precision suffixes: q4, q8, fp16, fp32 (default: q4)

Environment Variables

Every CLI flag has a corresponding environment variable:

Variable Default Description
WANDLER_LLM onnx-community/gemma-4-E4B-it-ONNX:q4 LLM model with precision
WANDLER_BACKEND wandler LLM backend: wandler for Wandler's serving layer, transformersjs for the direct baseline
WANDLER_STT onnx-community/whisper-tiny:q4 Speech-to-text model
WANDLER_EMBEDDING Embedding model (disabled by default)
WANDLER_DEVICE webgpu Device: webgpu, cpu, wasm
WANDLER_PORT 8000 Server port
WANDLER_HOST 127.0.0.1 Bind address
WANDLER_API_KEY API key for auth
WANDLER_CORS_ORIGIN * Allowed CORS origin
WANDLER_MAX_TOKENS 2048 Max tokens per request
WANDLER_MAX_CONCURRENT 1 Max concurrent requests
WANDLER_TIMEOUT 120000 Request timeout (ms)
WANDLER_LOG_LEVEL info Log level
WANDLER_QUIET false Suppress non-error startup/profile logs
WANDLER_CACHE_DIR ~/.cache/huggingface Model cache directory (also respects HF_HOME)
WANDLER_PREFILL_CHUNK_SIZE auto Chunk size for long-prompt prefill; auto uses the fastest GPU path that fits a 640MB attention budget, auto:<mb> customizes it; set 0/off to disable
WANDLER_DECODE_LOOP auto Wandler-owned decode loop for supported text generation; auto uses the safe transformers.js generate() path, on opts into the experimental Wandler loop, off disables it
WANDLER_PREFIX_CACHE true Enable in-memory prefix KV caching for repeated system/tool prefixes
WANDLER_PREFIX_CACHE_ENTRIES 2 Prefix KV cache entry count
WANDLER_PREFIX_CACHE_MIN_TOKENS 512 Minimum prefix size before caching
WANDLER_WARMUP_TOKENS 0 Approximate prompt tokens to run once before serving
WANDLER_WARMUP_MAX_NEW_TOKENS 8 Max new tokens for startup warmup
HF_TOKEN HuggingFace token for gated models

Endpoints

LLM

Endpoint Method Description
/v1/chat/completions POST Chat completion (streaming + non-streaming)
/v1/completions POST Text completion (legacy)
/v1/models GET List loaded models
/v1/models/{id} GET Get model details

Embeddings

Endpoint Method Description
/v1/embeddings POST Text embeddings

Audio

Endpoint Method Description
/v1/audio/transcriptions POST Speech-to-text (Whisper)

Utilities

Endpoint Method Description
/tokenize POST Text to token IDs
/detokenize POST Token IDs to text
/health GET Server status
/admin/metrics GET Request metrics

Parameters

Chat & Text Completions

Standard OpenAI parameters:

Parameter Type Default Description
messages / prompt array / string required Input
temperature float 0.7 Sampling temperature (0 = greedy)
top_p float 0.95 Nucleus sampling
max_tokens int 2048 Max tokens to generate
stream bool false Enable SSE streaming
stop string | string[] Stop sequences
presence_penalty float 0 Penalize token presence
frequency_penalty float 0 Penalize token frequency
response_format object {"type": "json_object"} for JSON mode
tools array Function calling definitions
stream_options object {"include_usage": true}

Extended parameters (vLLM/llama.cpp compatible):

Parameter Type Default Description
top_k int Top-k sampling
min_p float Minimum probability threshold
typical_p float Locally typical sampling
repetition_penalty float Direct repetition penalty (> 1.0)
no_repeat_ngram_size int Prevent N-gram repetition

Embeddings

Parameter Type Default Description
input string | string[] required Text to embed
encoding_format string "float" "float" or "base64"

Compatible Models

List all verified models with their capabilities:

wandler model ls
type      | size  | prec | capabilities             | repo:precision                                   | name
------------------------------------------------------------------------------------------------------------------------
llm       | 2B    | q4   | chat, tool-calling       | onnx-community/gemma-4-E4B-it-ONNX:q4            | Gemma 4 E4B
llm       | 1.2B  | q4   | chat, tool-calling       | LiquidAI/LFM2.5-1.2B-Instruct-ONNX:q4            | LFM 2.5 1.2B
llm       | 350M  | q4   | chat, tool-calling       | LiquidAI/LFM2.5-350M-ONNX:q4                     | LFM 2.5 350M
llm       | 0.8B  | q4   | chat, tool-calling       | onnx-community/Qwen3.5-0.8B-Text-ONNX:q4         | Qwen 3.5 0.8B
llm       | 1.7B  | q4   | chat                     | HuggingFaceTB/SmolLM2-1.7B-Instruct:q4           | SmolLM2 1.7B
embedding | 22M   | q8   | embedding                | Xenova/all-MiniLM-L6-v2:q8                       | all-MiniLM-L6-v2
embedding | 33M   | q8   | embedding                | Xenova/bge-small-en-v1.5:q8                      | BGE Small EN v1.5
embedding | 137M  | q8   | embedding                | nomic-ai/nomic-embed-text-v1.5:q8                | Nomic Embed Text v1.5
stt       | 39M   | q4   | transcription            | onnx-community/whisper-tiny:q4                   | Whisper Tiny
stt       | 74M   | q4   | transcription            | onnx-community/whisper-base:q4                   | Whisper Base
stt       | 244M  | q4   | transcription            | onnx-community/whisper-small:q4                  | Whisper Small

Filter by type:

wandler model ls --type llm
wandler model ls --type embedding
wandler model ls --type stt

Use the repo:precision value directly with --llm, --embedding, or --stt.

Any ONNX model from onnx-community or transformers.js compatible models should work beyond the verified catalog.

Tool Calling

wandler parses tool calls from multiple model output formats:

  • LFM: [func_name(arg="val")] and [tool_calls [{...}]]
  • Qwen: <tool_call>{"name": "...", "arguments": {...}}</tool_call>
  • OpenAI JSON: {"tool_calls": [...]}

Thinking blocks (<think>...</think>) are automatically stripped before parsing.

License

MIT