Skip to content

feat: remote model server (qmd serve) for shared inference across clients#509

Closed
jaylfc wants to merge 403 commits intotobi:mainfrom
jaylfc:feat/remote-llm-provider
Closed

feat: remote model server (qmd serve) for shared inference across clients#509
jaylfc wants to merge 403 commits intotobi:mainfrom
jaylfc:feat/remote-llm-provider

Conversation

@jaylfc
Copy link
Copy Markdown

@jaylfc jaylfc commented Apr 5, 2026

Summary

Adds qmd serve — a lightweight HTTP server that exposes embedding, reranking, and query expansion via a JSON API. Designed for multi-client setups where multiple QMD instances (e.g. agents in LXC containers) share loaded models over the network.

Supports two backends:

  • local (default) — loads GGUF models via node-llama-cpp (CPU/Vulkan)
  • rkllama — proxies to an rkllama NPU server (RK3588/RK3576)

Problem

QMD requires node-llama-cpp for all embedding/reranking operations, which:

  • Fails on ARM64 (no Vulkan SDK, CMake compilation fails)
  • Can't share loaded models across multiple agents
  • Loads models per-process (wasteful on memory-constrained devices)

This is a superset of the functionality requested in #489 and attempted in #490, #480, and #116.

Solution

Server (qmd serve)

qmd serve --port 7832 --bind 0.0.0.0
qmd serve --backend rkllama --rkllama-url http://localhost:8080

Endpoints:

Endpoint Method Description
/embed POST Embed a single text
/embed-batch POST Batch embed multiple texts
/rerank POST Rerank documents by relevance
/expand POST Expand a query (lex/vec/hyde)
/tokenize POST Count tokens in text
/health GET Server status + loaded models
/status GET Index health (doc counts, embedding status)
/collections GET List collections with doc counts
/search?q=X GET FTS5 keyword search
/browse GET Paginated chunk listing

Client (RemoteLLM)

Drop-in replacement for LlamaCpp that forwards all calls to a remote qmd serve instance:

export QMD_SERVER=http://192.168.6.123:7832
qmd embed          # uses remote server, no local model loading
qmd query "search" # full hybrid search via remote

Auto-detected: if QMD_SERVER is set, skips local LLM initialization entirely. Zero compilation, instant startup.

Testing

Tested extensively on Orange Pi 5 Plus (RK3588, 16GB) running 3 OpenClaw agents in LXC containers sharing one qmd serve instance:

  • Embed: 0.3s per chunk via NPU (was 3-5s on CPU)
  • Rerank: 1.8-2.2s via NPU (was 10-15s on CPU)
  • Batch embed: All chunks in single HTTP call, reduces overhead
  • 100% embedding completion on 900KB+ transcripts with retry logic
  • Rankings verified identical to standard QMD on x86+RTX 3060

Key fixes included

  • Batch embedding — sends all chunks in one rkllama call (reduces HTTP overhead)
  • Error rate threshold — 99% with 4x minimum sample for large docs
  • Increased timeouts — 5 min for batch operations on ARM CPU
  • KV cache workaround — documents the rkllama KV cache issue that causes embed degradation

Backwards compatible

  • No changes to existing CLI behavior when QMD_SERVER is not set
  • qmd serve is a new command, doesn't affect existing commands
  • RemoteLLM implements the same LLM interface as LlamaCpp

Related: #489, #490, #480, #116

🤖 Generated with Claude Code

jaylfc and others added 30 commits January 29, 2026 18:27
Adds a session layer that prevents LLM contexts from being disposed
mid-operation during long-running tasks like batch embedding or
multi-step search workflows (expand → embed → rerank).

Key changes:
- Add LLMSessionManager with reference counting for active sessions
- Add LLMSession class for scoped access with automatic acquire/release
- Add withLLMSession() API for multi-step workflows
- Update idle timer to check canUnloadLLM() before disposing
- Wrap querySearch, vectorSearch, and embed command in sessions
- Add optional session parameter to searchVec and getEmbedding

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- generate_only_variants.py: Creates training data where queries end with
  'only: lex', 'only: vec', or 'only: hyde' and output contains ONLY that type
- reward.py: Updated scorer to handle 'only:' mode separately
  - Penalizes presence of unwanted types
  - Type-specific quality checks
  - Filters templated low-quality hyde outputs
- 4,444 high-quality 'only:' variants from v2 + handcrafted data
Move the hyde (hypothetical document) line to the beginning of the
output format, before lex and vec lines. This better reflects the
logical flow where the hypothetical document is generated first and
then informs the keyword/semantic expansions.

Also adds auto-download of eval_common.py in training scripts for
standalone HuggingFace Jobs execution.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Brings in:
- /only: variants for single-type expansions
- LLM session management for lifecycle safety
- skills.sh integration for AI agent discovery
- Various bug fixes for vector search and embeddings

Merge conflicts resolved by keeping hyde-first format ordering
from finetune branch while accepting expanded templates and
new features from main.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add finetune/CLAUDE.md documenting the training pipeline
- Update configs to output to local outputs/ directory (gitignored)
- Document that all data/*.jsonl files are training data
- Document local CUDA training vs HuggingFace Jobs cloud training
- Enforce eval requirement before any model upload
- Single model repo (no -v1, -v2, -v4 versioning)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove versioned files (sft_v4.yaml, prepare_v4_dataset.py, train_v2/)
- Update configs to use local data/train/ directory
- Add glob pattern support to prepare_data.py and train.py
- Update .gitignore to properly ignore outputs/ and data/train*/
- Document data preparation step in CLAUDE.md

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- List all HuggingFace repos in CLAUDE.md (model, gguf, sft, grpo, train)
- Update jobs scripts to use tobil/qmd-query-expansion-train (no -v2)
- Clarify rules: no versioned repos, update in place

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Changed temperature from 0/0.1 to 0.7 (Qwen3 non-thinking mode default)
- Added topK=20, topP=0.8 per Qwen3 docs
- Added repeatPenalty with presencePenalty=0.5 for query expansion
- Fixes infinite loop on acronyms like DHH, BFCM

Qwen3 docs explicitly warn: 'DO NOT use greedy decoding, as it can
lead to performance degradation and endless repetitions'
* Fix: Add missing --index option to argument parser

The --index flag was documented and used in code but not defined
in parseArgs options, causing it to be ignored. Now properly handles
custom index names like: qmd --index test status

* Feature: Use index name for config files too

Now --index <name> loads ~/.config/qmd/<name>.yml instead of index.yml.
This allows completely separate indexes with their own collections.

Example:
  qmd --index hackage status
  → Uses ~/.config/qmd/hackage.yml + ~/.cache/qmd/hackage.sqlite

Moved hackage collection to hackage.yml for separation.
Replace Bun.file() async calls with Node.js fs sync methods to work
around a Bun bug that corrupts UTF-8 file paths containing non-ASCII
characters.

Bug: Bun.file(filepath).stat() and Bun.file(filepath).text() internally
mangle UTF-8 encoding, causing ENOENT errors with mojibake paths when
accessing files in iCloud Drive and other locations.

Changes:
- src/qmd.ts: Use readFileSync instead of Bun.file().text()
- src/qmd.ts: Use statSync instead of Bun.file().stat() for file metadata
- src/store.ts: Use statSync for SQLite custom path detection
…i#76)

BM25 scores in SQLite FTS5 are negative (lower = better match).
The previous code used Math.max(0, score) which clamped all negative
scores to 0, resulting in all results showing 100% (score = 1.0).

Fix: Use Math.abs(score) to properly convert negative BM25 scores
to positive values for the normalization formula.

Before: All results show Score: 100%
After:  Scores vary based on actual BM25 relevance (e.g., 16%, 5%, 6%)

Fixes tobi#74
- Add marketplace.json for Claude Code plugin installation
- Simplify skill status check to inline `qmd status` (portable across agents)
- Update SKILL.md MCP section, reference mcp-setup.md for manual config
- Clean up mcp-setup.md (remove redundant prerequisites)
- Rename MCP-SETUP.md to mcp-setup.md

Co-authored-by: Claude Opus 4.5 <[email protected]>
* feat: MCP HTTP transport with daemon lifecycle

  Add streaming HTTP transport as an alternative to stdio for the MCP
  server. A long-lived HTTP server avoids reloading 3 GGUF models (~2GB)
  on every client connection, reducing warm query latency from ~16s (CLI)
  to ~10s.

  New CLI surface:
    qmd mcp --http [--port N]   # foreground, default port 3000
    qmd mcp --http --daemon     # background, PID in ~/.cache/qmd/mcp.pid
    qmd mcp stop                # stop daemon via PID file
    qmd status                  # now shows MCP daemon liveness

  Server implementation (mcp.ts):
  - Extract createMcpServer(store) shared by stdio and HTTP transports
  - HTTP transport uses WebStandardStreamableHTTPServerTransport with
    JSON responses (stateless, no SSE)
  - /health endpoint with uptime, /mcp for MCP protocol, 404 otherwise
  - Request logging to stderr with timestamps, tool names, query args

  Daemon lifecycle (qmd.ts):
  - PID file + log file management with stale PID detection
  - Absolute paths in Bun.spawn (process.execPath + import.meta.path)
    so daemon works regardless of cwd
  - mkdirSync for cache dir on fresh installs
  - Removes top-level SIGTERM/SIGINT handlers before starting HTTP
    server so async cleanup in mcp.ts actually runs

  Move hybridQuery() and vectorSearchQuery() into store.ts as standalone
  functions that take a Store as first argument. Both CLI and MCP now
  call the identical pipeline, eliminating the class of bugs where one
  copy drifts from the other.

  Shared pipeline (store.ts):
  - hybridQuery(): BM25 probe → expand → FTS+vec search → RRF →
    chunk → rerank (chunks only) → position-aware blending → dedup
  - vectorSearchQuery(): expand → vec search → dedup → sort
  - SearchHooks interface for optional progress callbacks
  - Constants: STRONG_SIGNAL_MIN_SCORE, STRONG_SIGNAL_MIN_GAP,
    RERANK_CANDIDATE_LIMIT (40), addLineNumbers()

  Bugs fixed by unification:
  - MCP now gets strong-signal short-circuit (was CLI-only)
  - Reranker candidate limit unified at 40 (MCP had 30)
  - File dedup added to hybrid query (MCP was missing it)
  - Collection filter pushed into searchVec DB query
  - Filter-then-slice ordering fixed (MCP was slice-then-filter)

* feat: type-routed query expansion — lex→FTS, vec/hyde→vector

  expandQuery() now returns typed ExpandedQuery[] instead of string[],
  preserving the lex/vec/hyde type info from the LLM's GBNF-structured
  output. hybridQuery() and vectorSearchQuery() route searches by type:
  lex queries go to FTS only, vec/hyde go to vector only.

  Previously, every expanded query ran through BOTH backends — keyword
  variants wasted embedding forward passes, semantic paraphrases wasted
  BM25 lookups. Type routing eliminates ~4 calls/query with zero quality
  loss (cross-backend noise actually hurt RRF fusion).

  Cache format changed from newline-separated text to JSON (preserves
  types). Old cache entries gracefully re-expand on first access.

  CLI expansion tree now shows query types:
    ├─ original query
    ├─ lex: keyword variant
    ├─ vec: semantic meaning
    └─ hyde: hypothetical document...

  Benchmark (5 queries, 1756-doc index, warm LLM, Apple Silicon):

    Metric              Old (untyped)  New (typed)  Delta
    Avg backend calls   10.0           6.0          -40%
    Total wall time     1278ms         549ms        -57%
    Avg saved/query     —              —            146ms

    "authentication setup"          12 → 7 calls   511 → 112ms
    "database migration strategy"   10 → 6 calls   182 → 106ms
    "how to handle errors in API"   10 → 6 calls   216 → 121ms
    "meeting notes from last week"  10 → 6 calls   228 → 110ms
    "performance optimization"       8 → 5 calls   141 → 100ms

  Savings come from skipped embed() calls (~30-80ms each). FTS is
  synchronous SQLite (~0ms), so lex→FTS routing is free while
  vec/hyde→vector-only avoids wasted embedding passes.

* fix: MCP query snippets now use reranker's best chunk, not full body

  extractSnippet() was scanning the entire document body for keyword
  matches to build the snippet. But hybridQuery() already identified
  the most relevant chunk via cross-attention reranking — rescanning
  the full body is redundant and can land on a less relevant section
  if the query terms appear elsewhere in the document.

  CLI was already using bestChunk (set during the refactor). MCP was
  still using body — a pre-existing inconsistency, not a regression.

* feat: dynamic MCP instructions + tool annotations

  The MCP server now generates instructions at startup from actual index
  state and injects them into the initialize response. LLMs see collection
  names, document counts, content descriptions, and search strategy
  guidance in their system prompt — zero tool calls needed for orientation.

  Previously, the only guidance was generic static tool descriptions and
  a user-invocable "query" prompt that no LLM would discover on its own.
  An LLM connecting to QMD had no idea what collections existed, what they
  contained, or how to scope searches effectively.

* change default port to 8181

* fix: BM25 score normalization was inverted

  The normalization formula `1 / (1 + |bm25|)` is a decreasing function of
  match strength. FTS5 BM25 scores are negative where more negative = better
  match (e.g., -10 is strong, -0.5 is weak). The formula mapped:

    strong match (raw -10) → 1/(1+10) =  9%   ← should be highest
    weak match   (raw -0.5) → 1/(1+0.5) = 67%  ← should be lowest

  Three downstream effects:
  1. `--min-score 0.5` (or MCP minScore: 0.5) filtered OUT strong matches
     and kept only weak ones. The MCP instructions recommend this threshold.
  2. CLI `formatScore()` color bands never showed green for BM25 results
     (best matches scored ~9%, green threshold is 70%).
  3. The strong signal optimization in hybridQuery (skip ~2s LLM expansion
     when BM25 already has a clear winner) was dead code — strong matches
     scored ~0.09, never reaching the 0.85 threshold.

  Fix: `|x| / (1 + |x|)` — same (0,1) range, monotonic, no per-query
  normalization needed, but now correctly maps strong → high, weak → low.

  The normalization was born broken (Math.max(0, x) clamped all
  negative BM25 to 0 → every score = 1.0), then PR tobi#76 changed to
  Math.abs which made scores vary but inverted the direction. Neither
  state was ever correct.

* fix: rerank cache key ignores chunk content

  The rerank cache key was (query, file, model) but the actual text sent
  to the reranker is a keyword-selected chunk that varies by query terms.
  Two different queries hitting the same file can select different chunks,
  but the second query gets a stale cached score from the first chunk.

  Example:
    Query "auth flow" → selects chunk about authentication → score 0.92
    Query "auth tokens" → same file, selects chunk about tokens
      → cache HIT on (query, file, model) → returns 0.92 from wrong chunk

  Fix: include full chunk text in cache key. getCacheKey() already
  SHA-256 hashes its inputs, so this adds no key bloat — just
  disambiguation. Old cache entries become natural misses (different key
  shape) and re-warm on next query.

* rename MCP tools for clarity, rewrite descriptions for LLM tool selection

  Rename MCP tools: vsearch → vector_search, query → deep_search.
  LLMs see these names — self-documenting names reduce reliance on
  descriptions for tool selection. CLI commands stay unchanged
  (qmd vsearch, qmd query) — different namespace, users type those.

  Rewrite all search tool descriptions to be action-oriented:
    - search: "Search by keyword. Finds documents containing exact
      words and phrases in the query."
    - vector_search: "Search by meaning. Finds relevant documents even
      when they use different words than the query — handles synonyms,
      paraphrases, and related concepts."
    - deep_search: "Deep search. Auto-expands the query into variations,
      searches each by keyword and meaning, and reranks for top hits
      across all results."

  Rewrite instructions ladder — each tool says what it does, no
  "start here" / "escalate as needed" strategy language.

  Delete the "query" prompt (registerPrompt) — it restated what
  descriptions + instructions already cover. No LLM proactively
  calls prompts/get to learn how to use tools.

* supress HTTP server logs during tests
searchResultsToMarkdown and searchResultsToXml in formatter.ts were
silently dropping the context field. Added formatter.test.ts covering
context visibility across all output formats.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
List query first in --help as the recommended search method. Add
vector-search and deep-search as undocumented CLI aliases matching
MCP tool names.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Generated with PaperBanana (Gemini 3 Pro). Shows query expansion
fanning HyDE+Vec into vector searches, Lex into BM25, merged via
reciprocal rank fusion and LLM reranking.
Three improvements to hybridQuery:

1. Collection filter pushed into SQL: searchFTS and searchVec now
   accept collectionName directly instead of filtering post-hoc.
   Reduces noise in FTS probe and all expanded-query FTS calls.
   Also fixes MCP server's FTS search to use SQL-level filtering.

2. Batch embed for vector searches: instead of embedding each
   vec/hyde query sequentially (one embed call per query), we now
   collect all texts that need vector search and embed them in a
   single embedBatch() call. The sqlite-vec lookups still run
   sequentially (they're fast), but the expensive LLM embed step
   is batched.

3. FTS-first ordering: all lex expansions run immediately (sync,
   no LLM needed) before the vector embedding batch. This means
   FTS results are ready while embeddings compute.

Also cleans up legacy collectionId parameter naming (was number,
now properly string collectionName throughout).
jaylfc and others added 20 commits March 28, 2026 19:56
Fix BM25 field weights to include all 3 FTS columns
Resolve conflict: use CTE approach from tobi#455 with updated BM25
weights (1.5, 4.0, 1.0) from tobi#462.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…g-vec0-replace

fix(embed): handle vec0 OR REPLACE limitation in insertEmbedding
fix: increase RERANK_CONTEXT_SIZE default 2048→4096, configurable via env var, fix template overhead underestimate
fix: prevent qmd embed from running indefinitely
Fix hyphenated tokens in FTS5 lex queries
Resolve conflicts: combine AST chunking args (filepath, chunkStrategy)
with abort signal parameter from tobi#458.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add `qmd serve` command that runs a lightweight HTTP server exposing
embedding, reranking, and query expansion endpoints. Multiple QMD clients
can share a single set of loaded models over the network instead of each
loading their own into RAM.

Changes:
- New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize)
- New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP
- Updated LLM interface: added embedBatch, tokenize, intent option
- Updated store.ts: use LLM interface instead of concrete LlamaCpp type
- CLI: added `serve` command, `--server` flag, and QMD_SERVER env var
- README: documented remote model server usage and multi-agent setup

Addresses: tobi#489 tobi#490 tobi#502 tobi#480

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
startServer() now returns a Promise that stays pending until SIGINT/SIGTERM,
preventing the CLI from falling through to process.exit(0) immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Batch embedding of large document sets via the remote server can take
significantly longer than 30s on ARM CPU, especially under concurrent load.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
CPU-only embedding on ARM SBCs (RK3588 etc) can take over 2 minutes per
large chunk. 120s was still causing failures. 300s gives generous headroom
for batch operations without GPU acceleration.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add --backend rkllama flag to qmd serve that proxies embed/rerank/expand
requests to an rkllama NPU server instead of loading models locally via
node-llama-cpp. Supports all three model types on the RK3588 NPU.

Benchmarks: embedding ~1.25s, reranking ~2.2s, query expansion ~3.4s
(3-18x faster than CPU on ARM).

New CLI flags:
  --backend rkllama              Use NPU backend
  --rkllama-url http://host:8080 Custom rkllama URL
  QMD_SERVE_BACKEND=rkllama      Env var alternative
  RKLLAMA_URL=http://host:8080   Env var alternative

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace the text-generation workaround in RKLlamaBackend.rerank() with
a direct call to rkllama's native /api/rerank endpoint. This uses
logit-based cross-encoder scoring (softmax over yes/no tokens) instead
of parsing generated text, producing accurate relevance scores.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
When QMD_SERVER is set, getDefaultLLM() now auto-creates a RemoteLLM
without falling through to getDefaultLlamaCpp() which triggers
node-llama-cpp Vulkan builds and crashes in containers without GPU.

Also skip device info display in status when using remote server.

This fixes qmd embed failing silently in LXC containers.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Session maxDuration: 10 min → 60 min (large batch embeds on ARM NPU)
- Error rate abort threshold: 80% → 95% with doubled minimum sample
- Remote request timeout already at 300s (set earlier)

SBC/NPU backends have higher per-chunk latency and occasional failures.
The old 10-minute session limit caused "session expired" on workspaces
with 100+ documents. The 80% error threshold was too aggressive for
NPU backends where intermittent failures are normal during model
hot-swapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Large documents (900KB+ chat transcripts) generate 250+ chunks.
Intermittent NPU failures during batch embedding caused the 95%
threshold to trigger premature abort. With per-chunk 3x retry already
in place, only abort if the server is truly down (99% failure).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
RKLlamaBackend.embedBatch() now sends all texts in one /api/embed
call instead of individual HTTP requests per chunk.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
New GET endpoints for TinyAgentOS integration:
  /status      - index health (doc counts, embedding status)
  /collections - list collections with doc counts
  /search?q=X  - FTS5 keyword search with optional collection filter
  /browse      - paginated chunk listing, most recent first

These enable remote memory browsing without direct SQLite access,
supporting the TinyAgentOS architecture where agent data stays
in the agent's LXC container.
@paralizeer
Copy link
Copy Markdown

Great PR @jaylfc — we've been running this on our ARM64 VPS (Oracle Cloud Ampere) since you posted in #489 and it's solid. The RemoteLLM drop-in is clean, and qmd serve solves a real problem for multi-agent deployments.

We did a thorough review and found a few things worth addressing:

Security: default bind should be 127.0.0.1 (not 0.0.0.0)

Both serve.ts (line 345) and qmd.ts (line 3096) default to binding on all interfaces. This exposes the model server to the entire network out of the box. Since most users will run this on localhost or private networks, defaulting to 127.0.0.1 is safer — users who need network access can explicitly pass --bind 0.0.0.0.

- const bind = options.bind ?? "0.0.0.0";
+ const bind = options.bind ?? "127.0.0.1";

Request body size limit

readBody() accumulates unlimited data, which could OOM the server. A simple 50MB cap:

const MAX_BODY_BYTES = 50 * 1024 * 1024;

function readBody(req: IncomingMessage): Promise<string> {
  return new Promise((resolve, reject) => {
    const chunks: Buffer[] = [];
    let totalBytes = 0;
    req.on("data", (chunk: Buffer) => {
      totalBytes += chunk.length;
      if (totalBytes > MAX_BODY_BYTES) {
        req.destroy();
        reject(new Error(`Request body exceeds ${MAX_BODY_BYTES / 1024 / 1024}MB limit`));
        return;
      }
      chunks.push(chunk);
    });
    req.on("end", () => resolve(Buffer.concat(chunks).toString("utf-8")));
    req.on("error", reject);
  });
}

Input type validation

The POST endpoints check for existence (!text) but not types. This means {"text": 42} passes validation but fails with a confusing stack trace. Using typeof text !== "string" catches bad input cleanly:

- if (!text) {
-   json(res, 400, { error: "text is required" });
+ if (typeof text !== "string" || text.length === 0) {
+   json(res, 400, { error: "text must be a non-empty string" });

Same pattern for /embed-batch (validate array elements are strings), /rerank (validate non-empty documents), and /expand.

README cleanup for upstream

The PR still has fork-specific framing ("Fork note", OpenClaw references in the README). For upstream, the "Remote Model Server" section should be presented as a first-class feature, not a fork addition. The OpenClaw-specific systemd example could be generalized.


Everything else looks good: parameterized SQL in /browse ✅, retry logic with backoff for SBC/NPU ✅, LLM interface abstraction ✅, graceful shutdown ✅, setDefaultLLM global swap ✅.

We've already applied the security fixes locally and confirmed they work. Happy to test any further changes you push.

- Default bind to 127.0.0.1 instead of 0.0.0.0 (users opt-in to network exposure)
- Add 50MB request body size limit to prevent OOM
- Strict input type validation on all POST endpoints (typeof string checks)
- Clean up README: remove fork-specific framing, generalize for upstream
- Document rkllama NPU backend option in examples

Addresses review feedback from @paralizeer on tobi#509
jaylfc pushed a commit to jaylfc/qmd that referenced this pull request Apr 5, 2026
- Default bind to 127.0.0.1 instead of 0.0.0.0 (users opt-in to network exposure)
- Add 50MB request body size limit to prevent OOM
- Strict input type validation on all POST endpoints (typeof string checks)
- Clean up README: remove fork-specific framing, generalize for upstream
- Document rkllama NPU backend option in examples

Addresses review feedback from @paralizeer on tobi#509
@jaylfc
Copy link
Copy Markdown
Author

jaylfc commented Apr 5, 2026

Thanks for the thorough review @paralizeer — all four points addressed in 69df17d:

  1. Default bind → 127.0.0.1 ✅ Both serve.ts and the CLI default changed. Users explicitly opt-in to network exposure with --bind 0.0.0.0.

  2. Request body size limit ✅ Added 50MB cap to readBody() — destroys the request if exceeded instead of accumulating to OOM.

  3. Input type validation ✅ All POST endpoints now use typeof text !== "string" || text.length === 0 checks. /embed-batch validates array elements are strings. /rerank validates non-empty documents array. /tokenize and /expand also tightened.

  4. README cleaned up ✅ Removed fork-specific framing, generalized the OpenClaw systemd example to a generic agent/container integration section, added rkllama NPU backend to the examples.

Appreciate you testing this on your ARM64 setup — good to know it's solid on Oracle Cloud Ampere too.

@jaylfc jaylfc closed this Apr 5, 2026
@jaylfc jaylfc force-pushed the feat/remote-llm-provider branch from 69df17d to 80eb824 Compare April 5, 2026 18:52
jaylfc added a commit to jaylfc/qmd that referenced this pull request Apr 5, 2026
- Default bind to 127.0.0.1 instead of 0.0.0.0 (users opt-in to network exposure)
- Add 50MB request body size limit to prevent OOM
- Strict input type validation on all POST endpoints (typeof string checks)
- Clean up README: remove fork-specific framing, generalize for upstream
- Document rkllama NPU backend option in examples

Addresses review feedback from @paralizeer on tobi#509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants