feat: remote model server (qmd serve) for shared inference across clients by jaylfc · Pull Request #509 · tobi/qmd

jaylfc · 2026-04-05T17:00:54Z

Summary

Adds qmd serve — a lightweight HTTP server that exposes embedding, reranking, and query expansion via a JSON API. Designed for multi-client setups where multiple QMD instances (e.g. agents in LXC containers) share loaded models over the network.

Supports two backends:

local (default) — loads GGUF models via node-llama-cpp (CPU/Vulkan)
rkllama — proxies to an rkllama NPU server (RK3588/RK3576)

Problem

QMD requires node-llama-cpp for all embedding/reranking operations, which:

Fails on ARM64 (no Vulkan SDK, CMake compilation fails)
Can't share loaded models across multiple agents
Loads models per-process (wasteful on memory-constrained devices)

This is a superset of the functionality requested in #489 and attempted in #490, #480, and #116.

Solution

Server (`qmd serve`)

qmd serve --port 7832 --bind 0.0.0.0
qmd serve --backend rkllama --rkllama-url http://localhost:8080

Endpoints:

Endpoint	Method	Description
`/embed`	POST	Embed a single text
`/embed-batch`	POST	Batch embed multiple texts
`/rerank`	POST	Rerank documents by relevance
`/expand`	POST	Expand a query (lex/vec/hyde)
`/tokenize`	POST	Count tokens in text
`/health`	GET	Server status + loaded models
`/status`	GET	Index health (doc counts, embedding status)
`/collections`	GET	List collections with doc counts
`/search?q=X`	GET	FTS5 keyword search
`/browse`	GET	Paginated chunk listing

Client (`RemoteLLM`)

Drop-in replacement for LlamaCpp that forwards all calls to a remote qmd serve instance:

export QMD_SERVER=http://192.168.6.123:7832
qmd embed          # uses remote server, no local model loading
qmd query "search" # full hybrid search via remote

Auto-detected: if QMD_SERVER is set, skips local LLM initialization entirely. Zero compilation, instant startup.

Testing

Tested extensively on Orange Pi 5 Plus (RK3588, 16GB) running 3 OpenClaw agents in LXC containers sharing one qmd serve instance:

Embed: 0.3s per chunk via NPU (was 3-5s on CPU)
Rerank: 1.8-2.2s via NPU (was 10-15s on CPU)
Batch embed: All chunks in single HTTP call, reduces overhead
100% embedding completion on 900KB+ transcripts with retry logic
Rankings verified identical to standard QMD on x86+RTX 3060

Key fixes included

Batch embedding — sends all chunks in one rkllama call (reduces HTTP overhead)
Error rate threshold — 99% with 4x minimum sample for large docs
Increased timeouts — 5 min for batch operations on ARM CPU
KV cache workaround — documents the rkllama KV cache issue that causes embed degradation

Backwards compatible

No changes to existing CLI behavior when QMD_SERVER is not set
qmd serve is a new command, doesn't affect existing commands
RemoteLLM implements the same LLM interface as LlamaCpp

Related: #489, #490, #480, #116

🤖 Generated with Claude Code

Adds a session layer that prevents LLM contexts from being disposed mid-operation during long-running tasks like batch embedding or multi-step search workflows (expand → embed → rerank). Key changes: - Add LLMSessionManager with reference counting for active sessions - Add LLMSession class for scoped access with automatic acquire/release - Add withLLMSession() API for multi-step workflows - Update idle timer to check canUnloadLLM() before disposing - Wrap querySearch, vectorSearch, and embed command in sessions - Add optional session parameter to searchVec and getEmbedding Co-Authored-By: Claude Opus 4.5 <[email protected]>

- generate_only_variants.py: Creates training data where queries end with 'only: lex', 'only: vec', or 'only: hyde' and output contains ONLY that type - reward.py: Updated scorer to handle 'only:' mode separately - Penalizes presence of unwanted types - Type-specific quality checks - Filters templated low-quality hyde outputs - 4,444 high-quality 'only:' variants from v2 + handcrafted data

Move the hyde (hypothetical document) line to the beginning of the output format, before lex and vec lines. This better reflects the logical flow where the hypothetical document is generated first and then informs the keyword/semantic expansions. Also adds auto-download of eval_common.py in training scripts for standalone HuggingFace Jobs execution. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Brings in: - /only: variants for single-type expansions - LLM session management for lifecycle safety - skills.sh integration for AI agent discovery - Various bug fixes for vector search and embeddings Merge conflicts resolved by keeping hyde-first format ordering from finetune branch while accepting expanded templates and new features from main. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add finetune/CLAUDE.md documenting the training pipeline - Update configs to output to local outputs/ directory (gitignored) - Document that all data/*.jsonl files are training data - Document local CUDA training vs HuggingFace Jobs cloud training - Enforce eval requirement before any model upload - Single model repo (no -v1, -v2, -v4 versioning) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove versioned files (sft_v4.yaml, prepare_v4_dataset.py, train_v2/) - Update configs to use local data/train/ directory - Add glob pattern support to prepare_data.py and train.py - Update .gitignore to properly ignore outputs/ and data/train*/ - Document data preparation step in CLAUDE.md Co-Authored-By: Claude Opus 4.5 <[email protected]>

- List all HuggingFace repos in CLAUDE.md (model, gguf, sft, grpo, train) - Update jobs scripts to use tobil/qmd-query-expansion-train (no -v2) - Clarify rules: no versioned repos, update in place Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Changed temperature from 0/0.1 to 0.7 (Qwen3 non-thinking mode default) - Added topK=20, topP=0.8 per Qwen3 docs - Added repeatPenalty with presencePenalty=0.5 for query expansion - Fixes infinite loop on acronyms like DHH, BFCM Qwen3 docs explicitly warn: 'DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions'

* Fix: Add missing --index option to argument parser The --index flag was documented and used in code but not defined in parseArgs options, causing it to be ignored. Now properly handles custom index names like: qmd --index test status * Feature: Use index name for config files too Now --index <name> loads ~/.config/qmd/<name>.yml instead of index.yml. This allows completely separate indexes with their own collections. Example: qmd --index hackage status → Uses ~/.config/qmd/hackage.yml + ~/.cache/qmd/hackage.sqlite Moved hackage collection to hackage.yml for separation.

Replace Bun.file() async calls with Node.js fs sync methods to work around a Bun bug that corrupts UTF-8 file paths containing non-ASCII characters. Bug: Bun.file(filepath).stat() and Bun.file(filepath).text() internally mangle UTF-8 encoding, causing ENOENT errors with mojibake paths when accessing files in iCloud Drive and other locations. Changes: - src/qmd.ts: Use readFileSync instead of Bun.file().text() - src/qmd.ts: Use statSync instead of Bun.file().stat() for file metadata - src/store.ts: Use statSync for SQLite custom path detection

…i#76) BM25 scores in SQLite FTS5 are negative (lower = better match). The previous code used Math.max(0, score) which clamped all negative scores to 0, resulting in all results showing 100% (score = 1.0). Fix: Use Math.abs(score) to properly convert negative BM25 scores to positive values for the normalization formula. Before: All results show Score: 100% After: Scores vary based on actual BM25 relevance (e.g., 16%, 5%, 6%) Fixes tobi#74

- Add marketplace.json for Claude Code plugin installation - Simplify skill status check to inline `qmd status` (portable across agents) - Update SKILL.md MCP section, reference mcp-setup.md for manual config - Clean up mcp-setup.md (remove redundant prerequisites) - Rename MCP-SETUP.md to mcp-setup.md Co-authored-by: Claude Opus 4.5 <[email protected]>

* feat: MCP HTTP transport with daemon lifecycle Add streaming HTTP transport as an alternative to stdio for the MCP server. A long-lived HTTP server avoids reloading 3 GGUF models (~2GB) on every client connection, reducing warm query latency from ~16s (CLI) to ~10s. New CLI surface: qmd mcp --http [--port N] # foreground, default port 3000 qmd mcp --http --daemon # background, PID in ~/.cache/qmd/mcp.pid qmd mcp stop # stop daemon via PID file qmd status # now shows MCP daemon liveness Server implementation (mcp.ts): - Extract createMcpServer(store) shared by stdio and HTTP transports - HTTP transport uses WebStandardStreamableHTTPServerTransport with JSON responses (stateless, no SSE) - /health endpoint with uptime, /mcp for MCP protocol, 404 otherwise - Request logging to stderr with timestamps, tool names, query args Daemon lifecycle (qmd.ts): - PID file + log file management with stale PID detection - Absolute paths in Bun.spawn (process.execPath + import.meta.path) so daemon works regardless of cwd - mkdirSync for cache dir on fresh installs - Removes top-level SIGTERM/SIGINT handlers before starting HTTP server so async cleanup in mcp.ts actually runs Move hybridQuery() and vectorSearchQuery() into store.ts as standalone functions that take a Store as first argument. Both CLI and MCP now call the identical pipeline, eliminating the class of bugs where one copy drifts from the other. Shared pipeline (store.ts): - hybridQuery(): BM25 probe → expand → FTS+vec search → RRF → chunk → rerank (chunks only) → position-aware blending → dedup - vectorSearchQuery(): expand → vec search → dedup → sort - SearchHooks interface for optional progress callbacks - Constants: STRONG_SIGNAL_MIN_SCORE, STRONG_SIGNAL_MIN_GAP, RERANK_CANDIDATE_LIMIT (40), addLineNumbers() Bugs fixed by unification: - MCP now gets strong-signal short-circuit (was CLI-only) - Reranker candidate limit unified at 40 (MCP had 30) - File dedup added to hybrid query (MCP was missing it) - Collection filter pushed into searchVec DB query - Filter-then-slice ordering fixed (MCP was slice-then-filter) * feat: type-routed query expansion — lex→FTS, vec/hyde→vector expandQuery() now returns typed ExpandedQuery[] instead of string[], preserving the lex/vec/hyde type info from the LLM's GBNF-structured output. hybridQuery() and vectorSearchQuery() route searches by type: lex queries go to FTS only, vec/hyde go to vector only. Previously, every expanded query ran through BOTH backends — keyword variants wasted embedding forward passes, semantic paraphrases wasted BM25 lookups. Type routing eliminates ~4 calls/query with zero quality loss (cross-backend noise actually hurt RRF fusion). Cache format changed from newline-separated text to JSON (preserves types). Old cache entries gracefully re-expand on first access. CLI expansion tree now shows query types: ├─ original query ├─ lex: keyword variant ├─ vec: semantic meaning └─ hyde: hypothetical document... Benchmark (5 queries, 1756-doc index, warm LLM, Apple Silicon): Metric Old (untyped) New (typed) Delta Avg backend calls 10.0 6.0 -40% Total wall time 1278ms 549ms -57% Avg saved/query — — 146ms "authentication setup" 12 → 7 calls 511 → 112ms "database migration strategy" 10 → 6 calls 182 → 106ms "how to handle errors in API" 10 → 6 calls 216 → 121ms "meeting notes from last week" 10 → 6 calls 228 → 110ms "performance optimization" 8 → 5 calls 141 → 100ms Savings come from skipped embed() calls (~30-80ms each). FTS is synchronous SQLite (~0ms), so lex→FTS routing is free while vec/hyde→vector-only avoids wasted embedding passes. * fix: MCP query snippets now use reranker's best chunk, not full body extractSnippet() was scanning the entire document body for keyword matches to build the snippet. But hybridQuery() already identified the most relevant chunk via cross-attention reranking — rescanning the full body is redundant and can land on a less relevant section if the query terms appear elsewhere in the document. CLI was already using bestChunk (set during the refactor). MCP was still using body — a pre-existing inconsistency, not a regression. * feat: dynamic MCP instructions + tool annotations The MCP server now generates instructions at startup from actual index state and injects them into the initialize response. LLMs see collection names, document counts, content descriptions, and search strategy guidance in their system prompt — zero tool calls needed for orientation. Previously, the only guidance was generic static tool descriptions and a user-invocable "query" prompt that no LLM would discover on its own. An LLM connecting to QMD had no idea what collections existed, what they contained, or how to scope searches effectively. * change default port to 8181 * fix: BM25 score normalization was inverted The normalization formula `1 / (1 + |bm25|)` is a decreasing function of match strength. FTS5 BM25 scores are negative where more negative = better match (e.g., -10 is strong, -0.5 is weak). The formula mapped: strong match (raw -10) → 1/(1+10) = 9% ← should be highest weak match (raw -0.5) → 1/(1+0.5) = 67% ← should be lowest Three downstream effects: 1. `--min-score 0.5` (or MCP minScore: 0.5) filtered OUT strong matches and kept only weak ones. The MCP instructions recommend this threshold. 2. CLI `formatScore()` color bands never showed green for BM25 results (best matches scored ~9%, green threshold is 70%). 3. The strong signal optimization in hybridQuery (skip ~2s LLM expansion when BM25 already has a clear winner) was dead code — strong matches scored ~0.09, never reaching the 0.85 threshold. Fix: `|x| / (1 + |x|)` — same (0,1) range, monotonic, no per-query normalization needed, but now correctly maps strong → high, weak → low. The normalization was born broken (Math.max(0, x) clamped all negative BM25 to 0 → every score = 1.0), then PR tobi#76 changed to Math.abs which made scores vary but inverted the direction. Neither state was ever correct. * fix: rerank cache key ignores chunk content The rerank cache key was (query, file, model) but the actual text sent to the reranker is a keyword-selected chunk that varies by query terms. Two different queries hitting the same file can select different chunks, but the second query gets a stale cached score from the first chunk. Example: Query "auth flow" → selects chunk about authentication → score 0.92 Query "auth tokens" → same file, selects chunk about tokens → cache HIT on (query, file, model) → returns 0.92 from wrong chunk Fix: include full chunk text in cache key. getCacheKey() already SHA-256 hashes its inputs, so this adds no key bloat — just disambiguation. Old cache entries become natural misses (different key shape) and re-warm on next query. * rename MCP tools for clarity, rewrite descriptions for LLM tool selection Rename MCP tools: vsearch → vector_search, query → deep_search. LLMs see these names — self-documenting names reduce reliance on descriptions for tool selection. CLI commands stay unchanged (qmd vsearch, qmd query) — different namespace, users type those. Rewrite all search tool descriptions to be action-oriented: - search: "Search by keyword. Finds documents containing exact words and phrases in the query." - vector_search: "Search by meaning. Finds relevant documents even when they use different words than the query — handles synonyms, paraphrases, and related concepts." - deep_search: "Deep search. Auto-expands the query into variations, searches each by keyword and meaning, and reranks for top hits across all results." Rewrite instructions ladder — each tool says what it does, no "start here" / "escalate as needed" strategy language. Delete the "query" prompt (registerPrompt) — it restated what descriptions + instructions already cover. No LLM proactively calls prompts/get to learn how to use tools. * supress HTTP server logs during tests

searchResultsToMarkdown and searchResultsToXml in formatter.ts were silently dropping the context field. Added formatter.test.ts covering context visibility across all output formats. Co-Authored-By: Claude Opus 4.6 <[email protected]>

List query first in --help as the recommended search method. Add vector-search and deep-search as undocumented CLI aliases matching MCP tool names. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Generated with PaperBanana (Gemini 3 Pro). Shows query expansion fanning HyDE+Vec into vector searches, Lex into BM25, merged via reciprocal rank fusion and LLM reranking.

Three improvements to hybridQuery: 1. Collection filter pushed into SQL: searchFTS and searchVec now accept collectionName directly instead of filtering post-hoc. Reduces noise in FTS probe and all expanded-query FTS calls. Also fixes MCP server's FTS search to use SQL-level filtering. 2. Batch embed for vector searches: instead of embedding each vec/hyde query sequentially (one embed call per query), we now collect all texts that need vector search and embed them in a single embedBatch() call. The sqlite-vec lookups still run sequentially (they're fast), but the expensive LLM embed step is batched. 3. FTS-first ordering: all lex expansions run immediately (sync, no LLM needed) before the vector embedding batch. This means FTS results are ready while embeddings compute. Also cleans up legacy collectionId parameter naming (was number, now properly string collectionName throughout).

Fix BM25 field weights to include all 3 FTS columns

Resolve conflict: use CTE approach from tobi#455 with updated BM25 weights (1.5, 4.0, 1.0) from tobi#462. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…g-vec0-replace fix(embed): handle vec0 OR REPLACE limitation in insertEmbedding

fix: increase RERANK_CONTEXT_SIZE default 2048→4096, configurable via env var, fix template overhead underestimate

fix: prevent qmd embed from running indefinitely

Fix hyphenated tokens in FTS5 lex queries

Resolve conflicts: combine AST chunking args (filepath, chunkStrategy) with abort signal parameter from tobi#458. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

startServer() now returns a Promise that stays pending until SIGINT/SIGTERM, preventing the CLI from falling through to process.exit(0) immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Batch embedding of large document sets via the remote server can take significantly longer than 30s on ARM CPU, especially under concurrent load. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

CPU-only embedding on ARM SBCs (RK3588 etc) can take over 2 minutes per large chunk. 120s was still causing failures. 300s gives generous headroom for batch operations without GPU acceleration. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add --backend rkllama flag to qmd serve that proxies embed/rerank/expand requests to an rkllama NPU server instead of loading models locally via node-llama-cpp. Supports all three model types on the RK3588 NPU. Benchmarks: embedding ~1.25s, reranking ~2.2s, query expansion ~3.4s (3-18x faster than CPU on ARM). New CLI flags: --backend rkllama Use NPU backend --rkllama-url http://host:8080 Custom rkllama URL QMD_SERVE_BACKEND=rkllama Env var alternative RKLLAMA_URL=http://host:8080 Env var alternative Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Replace the text-generation workaround in RKLlamaBackend.rerank() with a direct call to rkllama's native /api/rerank endpoint. This uses logit-based cross-encoder scoring (softmax over yes/no tokens) instead of parsing generated text, producing accurate relevance scores. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…andle it Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

When QMD_SERVER is set, getDefaultLLM() now auto-creates a RemoteLLM without falling through to getDefaultLlamaCpp() which triggers node-llama-cpp Vulkan builds and crashes in containers without GPU. Also skip device info display in status when using remote server. This fixes qmd embed failing silently in LXC containers. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Session maxDuration: 10 min → 60 min (large batch embeds on ARM NPU) - Error rate abort threshold: 80% → 95% with doubled minimum sample - Remote request timeout already at 300s (set earlier) SBC/NPU backends have higher per-chunk latency and occasional failures. The old 10-minute session limit caused "session expired" on workspaces with 100+ documents. The 80% error threshold was too aggressive for NPU backends where intermittent failures are normal during model hot-swapping. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Large documents (900KB+ chat transcripts) generate 250+ chunks. Intermittent NPU failures during batch embedding caused the 95% threshold to trigger premature abort. With per-chunk 3x retry already in place, only abort if the server is truly down (99% failure). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

RKLlamaBackend.embedBatch() now sends all texts in one /api/embed call instead of individual HTTP requests per chunk. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

New GET endpoints for TinyAgentOS integration: /status - index health (doc counts, embedding status) /collections - list collections with doc counts /search?q=X - FTS5 keyword search with optional collection filter /browse - paginated chunk listing, most recent first These enable remote memory browsing without direct SQLite access, supporting the TinyAgentOS architecture where agent data stays in the agent's LXC container.

…CHANGELOG

paralizeer · 2026-04-05T17:13:09Z

Great PR @jaylfc — we've been running this on our ARM64 VPS (Oracle Cloud Ampere) since you posted in #489 and it's solid. The RemoteLLM drop-in is clean, and qmd serve solves a real problem for multi-agent deployments.

We did a thorough review and found a few things worth addressing:

Security: default bind should be `127.0.0.1` (not `0.0.0.0`)

Both serve.ts (line 345) and qmd.ts (line 3096) default to binding on all interfaces. This exposes the model server to the entire network out of the box. Since most users will run this on localhost or private networks, defaulting to 127.0.0.1 is safer — users who need network access can explicitly pass --bind 0.0.0.0.

- const bind = options.bind ?? "0.0.0.0";
+ const bind = options.bind ?? "127.0.0.1";

Request body size limit

readBody() accumulates unlimited data, which could OOM the server. A simple 50MB cap:

const MAX_BODY_BYTES = 50 * 1024 * 1024;

function readBody(req: IncomingMessage): Promise<string> {
  return new Promise((resolve, reject) => {
    const chunks: Buffer[] = [];
    let totalBytes = 0;
    req.on("data", (chunk: Buffer) => {
      totalBytes += chunk.length;
      if (totalBytes > MAX_BODY_BYTES) {
        req.destroy();
        reject(new Error(`Request body exceeds ${MAX_BODY_BYTES / 1024 / 1024}MB limit`));
        return;
      }
      chunks.push(chunk);
    });
    req.on("end", () => resolve(Buffer.concat(chunks).toString("utf-8")));
    req.on("error", reject);
  });
}

Input type validation

The POST endpoints check for existence (!text) but not types. This means {"text": 42} passes validation but fails with a confusing stack trace. Using typeof text !== "string" catches bad input cleanly:

- if (!text) {
-   json(res, 400, { error: "text is required" });
+ if (typeof text !== "string" || text.length === 0) {
+   json(res, 400, { error: "text must be a non-empty string" });

Same pattern for /embed-batch (validate array elements are strings), /rerank (validate non-empty documents), and /expand.

README cleanup for upstream

The PR still has fork-specific framing ("Fork note", OpenClaw references in the README). For upstream, the "Remote Model Server" section should be presented as a first-class feature, not a fork addition. The OpenClaw-specific systemd example could be generalized.

Everything else looks good: parameterized SQL in /browse ✅, retry logic with backoff for SBC/NPU ✅, LLM interface abstraction ✅, graceful shutdown ✅, setDefaultLLM global swap ✅.

We've already applied the security fixes locally and confirmed they work. Happy to test any further changes you push.

@paralizeer

- Default bind to 127.0.0.1 instead of 0.0.0.0 (users opt-in to network exposure) - Add 50MB request body size limit to prevent OOM - Strict input type validation on all POST endpoints (typeof string checks) - Clean up README: remove fork-specific framing, generalize for upstream - Document rkllama NPU backend option in examples Addresses review feedback from @paralizeer on tobi#509

@paralizeer

- Default bind to 127.0.0.1 instead of 0.0.0.0 (users opt-in to network exposure) - Add 50MB request body size limit to prevent OOM - Strict input type validation on all POST endpoints (typeof string checks) - Clean up README: remove fork-specific framing, generalize for upstream - Document rkllama NPU backend option in examples Addresses review feedback from @paralizeer on tobi#509

jaylfc · 2026-04-05T17:22:36Z

Thanks for the thorough review @paralizeer — all four points addressed in 69df17d:

Default bind → 127.0.0.1 ✅ Both serve.ts and the CLI default changed. Users explicitly opt-in to network exposure with --bind 0.0.0.0.
Request body size limit ✅ Added 50MB cap to readBody() — destroys the request if exceeded instead of accumulating to OOM.
Input type validation ✅ All POST endpoints now use typeof text !== "string" || text.length === 0 checks. /embed-batch validates array elements are strings. /rerank validates non-empty documents array. /tokenize and /expand also tightened.
README cleaned up ✅ Removed fork-specific framing, generalized the OpenClaw systemd example to a generic agent/container integration section, added rkllama NPU backend to the examples.

Appreciate you testing this on your ARM64 setup — good to know it's solid on Oracle Cloud Ampere too.

@paralizeer

- Default bind to 127.0.0.1 instead of 0.0.0.0 (users opt-in to network exposure) - Add 50MB request body size limit to prevent OOM - Strict input type validation on all POST endpoints (typeof string checks) - Clean up README: remove fork-specific framing, generalize for upstream - Document rkllama NPU backend option in examples Addresses review feedback from @paralizeer on tobi#509

jaylfc and others added 30 commits January 29, 2026 18:27

Add skills.sh integration for AI agent discovery (tobi#64)

3db6c50

Fix DisposedError during slow batch embedding (tobi#41)

80e1de8

Change only: format to only:lex (no space after colon)

6012688

Change format to /only:lex (slash prefix)

0b94833

Add sampled /only: variants (399) for training balance

9907522

fix: correct QMD acronym to Query Markup Documents

c98f3ca

Co-Authored-By: Claude Opus 4.5 <[email protected]>

lots of training stuff

f79c044

add qmd model pull and refresh logic

7b14421

feat: promote query as primary search command, add CLI aliases

ca33f4e

List query first in --help as the recommended search method. Add vector-search and deep-search as undocumented CLI aliases matching MCP tool names. Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix: disable following symlinks in glob.scan. Closes tobi#134

abe01a6

feat: add --version/-v flag. Closes tobi#88

6465943

fix: allow $ route filenames in handelize. Closes tobi#162

18d6fe3

fix: reactivate deactivated documents on re-index. Closes tobi#168

231055d

fix: verify sqlite-vec readiness after extension load. Closes tobi#169

0d2c02a

Add QMD architecture diagram to README

3a21f12

Generated with PaperBanana (Gemini 3 Pro). Shows query expansion fanning HyDE+Vec into vector searches, Lex into BM25, merged via reciprocal rank fusion and LLM reranking.

jaylfc and others added 20 commits March 28, 2026 19:56

Merge pull request tobi#462 from goldsr09/fix/bm25-field-weights

6fbb0fc

Fix BM25 field weights to include all 3 FTS columns

Merge origin/main into fix/fts5-collection-filter-performance

9c62ccd

Resolve conflict: use CTE approach from tobi#455 with updated BM25 weights (1.5, 4.0, 1.0) from tobi#462. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Merge pull request tobi#456 from antonio-mello-ai/fix/insert-embeddin…

0289e93

…g-vec0-replace fix(embed): handle vec0 OR REPLACE limitation in insertEmbedding

Merge pull request tobi#453 from builderjarvis/fix/rerank-context-size

4c67b50

fix: increase RERANK_CONTEXT_SIZE default 2048→4096, configurable via env var, fix template overhead underestimate

Merge pull request tobi#458 from ccc-fff/fix/embed-infinite-loop

c02e78e

fix: prevent qmd embed from running indefinitely

Merge pull request tobi#463 from goldsr09/fix/hyphenated-lex-queries

c5440ee

Fix hyphenated tokens in FTS5 lex queries

Merge origin/main into feat/ast-aware-chunking

cfead4d

Resolve conflicts: combine AST chunking args (filepath, chunkStrategy) with abort signal parameter from tobi#458. Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix(serve): keep server process alive until shutdown signal

0c2db26

startServer() now returns a Promise that stays pending until SIGINT/SIGTERM, preventing the CLI from falling through to process.exit(0) immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

fix(remote): increase default timeout to 120s for batch operations

45dd251

Batch embedding of large document sets via the remote server can take significantly longer than 30s on ARM CPU, especially under concurrent load. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

fix(serve): don't truncate docs before rkllama rerank - let rkllama h…

462e569

…andle it Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

perf: batch embed requests to rkllama in single API call

5cf386b

RKLlamaBackend.embedBatch() now sends all texts in one /api/embed call instead of individual HTTP requests per chunk. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

docs: document new serve endpoints and batch embedding in README and …

b517508

…CHANGELOG

This was referenced Apr 5, 2026

feat: support remote Ollama embeddings via OLLAMA_EMBED_URL #490

Closed

Feature request: Support remote Ollama embeddings via HTTP (OLLAMA_EMBED_URL) #489

Open

jaylfc closed this Apr 5, 2026

jaylfc force-pushed the feat/remote-llm-provider branch from 69df17d to 80eb824 Compare April 5, 2026 18:52

jaylfc mentioned this pull request Apr 5, 2026

feat: remote model server (qmd serve) for shared inference across clients #511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remote model server (qmd serve) for shared inference across clients#509

feat: remote model server (qmd serve) for shared inference across clients#509
jaylfc wants to merge 403 commits intotobi:mainfrom
jaylfc:feat/remote-llm-provider

jaylfc commented Apr 5, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

jaylfc commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaylfc commented Apr 5, 2026

Summary

Problem

Solution

Server (qmd serve)

Client (RemoteLLM)

Testing

Key fixes included

Backwards compatible

Uh oh!

paralizeer commented Apr 5, 2026

Security: default bind should be 127.0.0.1 (not 0.0.0.0)

Request body size limit

Input type validation

README cleanup for upstream

Uh oh!

jaylfc commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Server (`qmd serve`)

Client (`RemoteLLM`)

Security: default bind should be `127.0.0.1` (not `0.0.0.0`)