feat: remote model server (qmd serve) for shared inference across clients#511
feat: remote model server (qmd serve) for shared inference across clients#511
Conversation
Adds qmd serve — HTTP server for embedding, reranking, query expansion. Supports local (node-llama-cpp) and rkllama (RK3588 NPU) backends. RemoteLLM client auto-activates via QMD_SERVER env var. Includes: - Batch embedding (single HTTP call for all chunks) - NPU timeout/retry tuning for ARM SBCs - rkllama rerank via logit-based scoring - Index endpoints: /search, /browse, /collections, /status - Security: default bind 127.0.0.1, 50MB body limit, type validation - Updated README and CHANGELOG
|
This replaces #509 which had the same changes but was broken by a branch history cleanup (force push orphaned the common ancestor with main). The discussion and review from #509 is still relevant — @paralizeer provided a thorough security review there which has been incorporated into this version:
See the full review thread: #509 |
Embeds the query via the configured backend (rkllama/local), then runs sqlite-vec nearest-neighbour search against stored vectors. Returns ranked results with cosine similarity scores. Enables TinyAgentOS to offer semantic memory search over HTTP.
|
Hi @jaylfc, thanks for this. I just had a flick through your source code changes to see what this brings to the table. I think this is a great idea: running a single qmd server instance through However, I'm not convinced this is the same functionality requested and provided in the other Issues and PRs. A centralised QMD server is a different feature entirely. The other PRs are about connecting to remote OpenAI API -compatible hosted models. Further, I think naming variables as specific to rkllama reduces the apparent utility of this PR. You mention rkllama and Rockwell NPUs (both of which I'd never heard of before) several times. According to rkllama's README though:-
Similarly to a comment I had on #133 (comment)
I'd suggest naming after the API specification, not the specific product. |
|
Thanks @alexleach — both fair points. On the scope distinction: You're right that this is a different (complementary) feature to the remote OpenAI-compatible model support in #116/#480. Those PRs let QMD connect to any OpenAI-compatible endpoint for embeddings. Our PR adds a centralised They solve different problems:
Both should land — they're not competing. I'll update the PR description to make this clearer and remove the "Closes #489" since that issue is better addressed by #116. On the rkllama naming: Agreed — I'll rename the rkllama-specific variables and backend type. The backend is really "ollama-compatible" since rkllama speaks the same API. I'll change:
Will push these changes shortly. |
…I naming - Backend type: 'rkllama' → 'ollama' (Ollama-compatible, works with rkllama/ollama/etc) - CLI: --backend-url replaces --rkllama-url (old flag kept as deprecated alias) - Class: RKLlamaBackend → OllamaCompatBackend - Default URL: localhost:11434 (standard Ollama port) - All internal comments genericised - --rkllama-url and RKLLAMA_URL env var still work for backwards compat
Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Just a heads up that the rename is pushed (commit b6c1019). All the rkllama-specific naming has been replaced with generic ollama-compat as suggested. Let me know if there's anything else needed for this to move forward. |
|
Great work @jaylfc — I cherry-picked this onto current main (resolved conflicts in llm.ts, qmd.ts, store.ts) and have been running it on an RTX 3090 Ti alongside Ollama Gemma 4 26B (~20.3 GB VRAM). The local backend works well but I hit a VRAM budget problem: all three models resident = ~5.4 GB, which exceeds the ~3.7 GB left after Ollama. Same failure mode as #275 — I built a
This required two small additions to Results on 22K-file collection:
This also addresses #275 at the serve level — GPUs with limited free VRAM can use sequential mode instead of crashing on rerank context creation. Happy to open a follow-up PR against your branch with the sequential backend if you're interested. The changes are minimal (~60 lines in serve.ts + ~20 lines in llm.ts). Repo with the working code: brettdavies/qmd@feat/ollama-backend |
|
Thanks @brettdavies — this is a great addition and the numbers are compelling. The sequential pipeline model fits Worth flagging: this is directly useful for us too. On Orange Pi 5 Plus (RK3588 NPU) we run the same three-model shape — qwen3-embedding-0.6b + qwen3-reranker-0.6b + qmd-query-expansion — and hit the same budget pressure when image generation tries to share the NPU. The 0.6B quantised models load in ~1–2 s on NPU which puts us in the same latency-vs-memory tradeoff as your 3 s → 5.6 s GPU numbers, and the win for us is that the NPU cores get freed for SD inference instead of forcing users to pick "chat OR image gen". Happy to run your branch on aarch64 and post the numbers once it's in a follow-up PR — a second hardware class validating the pattern probably helps it upstream. On where this should land: I'd like to keep #511 narrowly scoped to "centralised serve behind HTTP" so it can clear review and merge cleanly — the rename to generic ollama-compat naming is the last open feedback. If you open your sequential mode as a separate PR (either immediately after #511 merges, or stacked against this branch if the maintainers prefer), I think it has a much stronger standalone case:
Two small things worth considering while you're writing it up:
Either way, ping me on the follow-up PR when it's up and I'll review + post RK3588 numbers. |
Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
Adds
qmd serve— a centralised model server that loads QMD's embedding, reranking, and query expansion models once and serves them over HTTP. Multiple QMD clients (e.g. agents in LXC containers) share the same loaded models instead of each loading their own copy.This is a different feature from remote OpenAI-compatible model support (#116, #480). Those PRs connect QMD to external embedding servers. This PR creates a QMD-native server that wraps the full embed+rerank+expand pipeline.
Problem
Multiple QMD instances each load their own models into RAM. On a 16GB device running 3 agents, that's 3x the model memory.
qmd serveloads models once and serves all clients.Solution
Server
/embed/embed-batch/rerank/expand/tokenize/health/status/collections/search?q=X/browse/vsearchTwo backends:
local(node-llama-cpp) andollama(any Ollama-compatible API — ollama, rkllama, etc).Client
Drop-in
RemoteLLM— auto-activates viaQMD_SERVERenv var:Security
127.0.0.1Testing
Tested with 3 agents sharing one server. Rankings identical to standard QMD.
Backwards compatible
No changes when
QMD_SERVERis not set.