feat: remote model server (qmd serve) for shared inference across clients by jaylfc · Pull Request #511 · tobi/qmd

jaylfc · 2026-04-05T19:02:19Z

Summary

Adds qmd serve — a centralised model server that loads QMD's embedding, reranking, and query expansion models once and serves them over HTTP. Multiple QMD clients (e.g. agents in LXC containers) share the same loaded models instead of each loading their own copy.

This is a different feature from remote OpenAI-compatible model support (#116, #480). Those PRs connect QMD to external embedding servers. This PR creates a QMD-native server that wraps the full embed+rerank+expand pipeline.

Problem

Multiple QMD instances each load their own models into RAM. On a 16GB device running 3 agents, that's 3x the model memory. qmd serve loads models once and serves all clients.

Solution

Server

qmd serve --port 7832
qmd serve --backend ollama --backend-url http://localhost:11434

Endpoint	Method	Description
`/embed`	POST	Embed a single text
`/embed-batch`	POST	Batch embed multiple texts
`/rerank`	POST	Rerank documents by relevance
`/expand`	POST	Expand a query (lex/vec/hyde)
`/tokenize`	POST	Count tokens in text
`/health`	GET	Server status + loaded models
`/status`	GET	Index health
`/collections`	GET	List collections
`/search?q=X`	GET	FTS5 keyword search
`/browse`	GET	Paginated chunk listing
`/vsearch`	POST	Semantic vector search

Two backends: local (node-llama-cpp) and ollama (any Ollama-compatible API — ollama, rkllama, etc).

Client

Drop-in RemoteLLM — auto-activates via QMD_SERVER env var:

export QMD_SERVER=http://your-host:7832
qmd query "search terms"   # no local models needed

Security

Default bind 127.0.0.1
50MB request body limit
Strict type validation on all endpoints

Testing

Tested with 3 agents sharing one server. Rankings identical to standard QMD.

Backwards compatible

No changes when QMD_SERVER is not set.

Adds qmd serve — HTTP server for embedding, reranking, query expansion. Supports local (node-llama-cpp) and rkllama (RK3588 NPU) backends. RemoteLLM client auto-activates via QMD_SERVER env var. Includes: - Batch embedding (single HTTP call for all chunks) - NPU timeout/retry tuning for ARM SBCs - rkllama rerank via logit-based scoring - Index endpoints: /search, /browse, /collections, /status - Security: default bind 127.0.0.1, 50MB body limit, type validation - Updated README and CHANGELOG

jaylfc · 2026-04-05T19:04:28Z

This replaces #509 which had the same changes but was broken by a branch history cleanup (force push orphaned the common ancestor with main). The discussion and review from #509 is still relevant — @paralizeer provided a thorough security review there which has been incorporated into this version:

Default bind 127.0.0.1 (was 0.0.0.0)
50MB request body limit
Strict input type validation on all POST endpoints
README cleaned up for upstream (removed fork-specific framing)

See the full review thread: #509

Embeds the query via the configured backend (rkllama/local), then runs sqlite-vec nearest-neighbour search against stored vectors. Returns ranked results with cosine similarity scores. Enables TinyAgentOS to offer semantic memory search over HTTP.

alexleach · 2026-04-06T08:44:13Z

Hi @jaylfc, thanks for this. I just had a flick through your source code changes to see what this brings to the table. I think this is a great idea: running a single qmd server instance through qmd serve and letting multiple containers connect to it. That's great!

However, I'm not convinced this is the same functionality requested and provided in the other Issues and PRs. A centralised QMD server is a different feature entirely. The other PRs are about connecting to remote OpenAI API -compatible hosted models.

Further, I think naming variables as specific to rkllama reduces the apparent utility of this PR. You mention rkllama and Rockwell NPUs (both of which I'd never heard of before) several times. According to rkllama's README though:-

Main Features

Ollama API compatibility - Support for:
...

Partial OpenAI API compatibility - Support for:

Similarly to a comment I had on #133 (comment)

I feel like making this specific for [rkllama] reduces the usefulness of this PR

I'd suggest naming after the API specification, not the specific product.

jaylfc · 2026-04-06T09:50:43Z

Thanks @alexleach — both fair points.

On the scope distinction: You're right that this is a different (complementary) feature to the remote OpenAI-compatible model support in #116/#480. Those PRs let QMD connect to any OpenAI-compatible endpoint for embeddings. Our PR adds a centralised qmd serve that wraps QMD's own model pipeline (embed + rerank + query expansion) behind HTTP, so multiple QMD clients can share one set of loaded models.

They solve different problems:

feat: Add OpenAI embedding and query expansion support #116: "I have an Ollama/OpenAI server elsewhere, let QMD use it for embeddings"
This PR: "I have one machine with models loaded, let multiple QMD instances share them"

Both should land — they're not competing. I'll update the PR description to make this clearer and remove the "Closes #489" since that issue is better addressed by #116.

On the rkllama naming: Agreed — I'll rename the rkllama-specific variables and backend type. The backend is really "ollama-compatible" since rkllama speaks the same API. I'll change:

--rkllama-url → keep as an alias but add --backend-url as the primary flag
rkllamaUrl config → backendUrl
The rkllama backend type stays as one option alongside local, but the API-facing names will be generic

Will push these changes shortly.

…I naming - Backend type: 'rkllama' → 'ollama' (Ollama-compatible, works with rkllama/ollama/etc) - CLI: --backend-url replaces --rkllama-url (old flag kept as deprecated alias) - Class: RKLlamaBackend → OllamaCompatBackend - Default URL: localhost:11434 (standard Ollama port) - All internal comments genericised - --rkllama-url and RKLLAMA_URL env var still work for backwards compat

Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

jaylfc · 2026-04-09T11:38:35Z

Just a heads up that the rename is pushed (commit b6c1019). All the rkllama-specific naming has been replaced with generic ollama-compat as suggested. Let me know if there's anything else needed for this to move forward.

brettdavies · 2026-04-11T06:37:19Z

Great work @jaylfc — I cherry-picked this onto current main (resolved conflicts in llm.ts, qmd.ts, store.ts) and have been running it on an RTX 3090 Ti alongside Ollama Gemma 4 26B (~20.3 GB VRAM).

The local backend works well but I hit a VRAM budget problem: all three models resident = ~5.4 GB, which exceeds the ~3.7 GB left after Ollama. Same failure mode as #275 — Failed to create any rerank context.

I built a SequentialLocalBackend on top of this PR that solves it. Since the three models are used in sequential pipeline stages (expand → embed → search → rerank), only one heavy model needs to be loaded at a time:

Embed model stays resident (320 MiB — tiny)
Generate model loads for query expansion, then disposed
Rerank model loads for reranking, then disposed

This required two small additions to LlamaCpp: disposeGenerateModel() and disposeRerankModel() methods that dispose individual models and reset their load promises.

Results on 22K-file collection:

Mode	Query time	Peak qmd VRAM	Coexists with 20 GB Ollama?
Bare `qmd query`	33s	~3 GB (churning)	Barely, with failures
`qmd serve` (all resident)	3s	5.4 GB	No (25.7 GB total)
`qmd serve --sequential`	5.6s	2.3 GB	Yes (22.6 GB total)

This also addresses #275 at the serve level — GPUs with limited free VRAM can use sequential mode instead of crashing on rerank context creation.

Happy to open a follow-up PR against your branch with the sequential backend if you're interested. The changes are minimal (~60 lines in serve.ts + ~20 lines in llm.ts).

Repo with the working code: brettdavies/qmd@feat/ollama-backend

jaylfc · 2026-04-11T09:35:21Z

Thanks @brettdavies — this is a great addition and the numbers are compelling. The sequential pipeline model fits qmd serve's design neatly because the stages genuinely are sequential (expand → embed → search → rerank), so disposing between stages only costs you load time, not correctness.

Worth flagging: this is directly useful for us too. On Orange Pi 5 Plus (RK3588 NPU) we run the same three-model shape — qwen3-embedding-0.6b + qwen3-reranker-0.6b + qmd-query-expansion — and hit the same budget pressure when image generation tries to share the NPU. The 0.6B quantised models load in ~1–2 s on NPU which puts us in the same latency-vs-memory tradeoff as your 3 s → 5.6 s GPU numbers, and the win for us is that the NPU cores get freed for SD inference instead of forcing users to pick "chat OR image gen". Happy to run your branch on aarch64 and post the numbers once it's in a follow-up PR — a second hardware class validating the pattern probably helps it upstream.

On where this should land: I'd like to keep #511 narrowly scoped to "centralised serve behind HTTP" so it can clear review and merge cleanly — the rename to generic ollama-compat naming is the last open feedback. If you open your sequential mode as a separate PR (either immediately after #511 merges, or stacked against this branch if the maintainers prefer), I think it has a much stronger standalone case:

Demonstrably fixes Low-VRAM GPUs: evict idle models before loading reranker #275 at the serve level on realistic VRAM budgets
Your 22K-file numbers show real-world coexistence with a 20 GB Ollama
The disposal API change (disposeGenerateModel / disposeRerankModel + load-promise reset) lives on its own so reviewers can focus on concurrency and lifecycle questions there rather than reviewing two features at once

Two small things worth considering while you're writing it up:

Concurrency safety of the disposal API. qmd serve can handle overlapping requests — if a new query arrives mid-disposal on the rerank model, what's the expected behaviour? I'd guess the second request waits on a re-load, but a quick note or test in the follow-up would head off reviewer questions.
Flag naming. --sequential describes the mechanism; something like --low-vram or --shared-gpu might read better to users who don't know the pipeline structure. Minor — your call.

Either way, ping me on the follow-up PR when it's up and I'll review + post RK3588 numbers.

Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

jhsmith409 mentioned this pull request Apr 6, 2026

Add OpenAI-compatible remote embedding and reranking #517

Open

4 tasks

This was referenced Apr 11, 2026

Wire memory retrieval through resource scheduler jaylfc/tinyagentos#29

Open

qmd fork drift: upgrade path from pinned feat/remote-llm-provider branch jaylfc/tinyagentos#114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remote model server (qmd serve) for shared inference across clients#511

feat: remote model server (qmd serve) for shared inference across clients#511
jaylfc wants to merge 3 commits intotobi:mainfrom
jaylfc:feat/remote-llm-provider-clean

jaylfc commented Apr 5, 2026 •

edited

Loading

Uh oh!

jaylfc commented Apr 5, 2026

Uh oh!

alexleach commented Apr 6, 2026

Main Features

Uh oh!

jaylfc commented Apr 6, 2026

Uh oh!

jaylfc commented Apr 9, 2026

Uh oh!

brettdavies commented Apr 11, 2026

Uh oh!

jaylfc commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jaylfc commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Server

Client

Security

Testing

Backwards compatible

Uh oh!

jaylfc commented Apr 5, 2026

Uh oh!

alexleach commented Apr 6, 2026

Main Features

Uh oh!

jaylfc commented Apr 6, 2026

Uh oh!

jaylfc commented Apr 9, 2026

Uh oh!

brettdavies commented Apr 11, 2026

Uh oh!

jaylfc commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaylfc commented Apr 5, 2026 •

edited

Loading