Skip to content

feat: remote model server (qmd serve) for shared inference across clients#511

Open
jaylfc wants to merge 3 commits intotobi:mainfrom
jaylfc:feat/remote-llm-provider-clean
Open

feat: remote model server (qmd serve) for shared inference across clients#511
jaylfc wants to merge 3 commits intotobi:mainfrom
jaylfc:feat/remote-llm-provider-clean

Conversation

@jaylfc
Copy link
Copy Markdown

@jaylfc jaylfc commented Apr 5, 2026

Summary

Adds qmd serve — a centralised model server that loads QMD's embedding, reranking, and query expansion models once and serves them over HTTP. Multiple QMD clients (e.g. agents in LXC containers) share the same loaded models instead of each loading their own copy.

This is a different feature from remote OpenAI-compatible model support (#116, #480). Those PRs connect QMD to external embedding servers. This PR creates a QMD-native server that wraps the full embed+rerank+expand pipeline.

Problem

Multiple QMD instances each load their own models into RAM. On a 16GB device running 3 agents, that's 3x the model memory. qmd serve loads models once and serves all clients.

Solution

Server

qmd serve --port 7832
qmd serve --backend ollama --backend-url http://localhost:11434
Endpoint Method Description
/embed POST Embed a single text
/embed-batch POST Batch embed multiple texts
/rerank POST Rerank documents by relevance
/expand POST Expand a query (lex/vec/hyde)
/tokenize POST Count tokens in text
/health GET Server status + loaded models
/status GET Index health
/collections GET List collections
/search?q=X GET FTS5 keyword search
/browse GET Paginated chunk listing
/vsearch POST Semantic vector search

Two backends: local (node-llama-cpp) and ollama (any Ollama-compatible API — ollama, rkllama, etc).

Client

Drop-in RemoteLLM — auto-activates via QMD_SERVER env var:

export QMD_SERVER=http://your-host:7832
qmd query "search terms"   # no local models needed

Security

  • Default bind 127.0.0.1
  • 50MB request body limit
  • Strict type validation on all endpoints

Testing

Tested with 3 agents sharing one server. Rankings identical to standard QMD.

Backwards compatible

No changes when QMD_SERVER is not set.

Adds qmd serve — HTTP server for embedding, reranking, query expansion.
Supports local (node-llama-cpp) and rkllama (RK3588 NPU) backends.
RemoteLLM client auto-activates via QMD_SERVER env var.

Includes:
- Batch embedding (single HTTP call for all chunks)
- NPU timeout/retry tuning for ARM SBCs
- rkllama rerank via logit-based scoring
- Index endpoints: /search, /browse, /collections, /status
- Security: default bind 127.0.0.1, 50MB body limit, type validation
- Updated README and CHANGELOG
@jaylfc
Copy link
Copy Markdown
Author

jaylfc commented Apr 5, 2026

This replaces #509 which had the same changes but was broken by a branch history cleanup (force push orphaned the common ancestor with main). The discussion and review from #509 is still relevant — @paralizeer provided a thorough security review there which has been incorporated into this version:

  • Default bind 127.0.0.1 (was 0.0.0.0)
  • 50MB request body limit
  • Strict input type validation on all POST endpoints
  • README cleaned up for upstream (removed fork-specific framing)

See the full review thread: #509

Embeds the query via the configured backend (rkllama/local), then
runs sqlite-vec nearest-neighbour search against stored vectors.
Returns ranked results with cosine similarity scores.

Enables TinyAgentOS to offer semantic memory search over HTTP.
@alexleach
Copy link
Copy Markdown

Hi @jaylfc, thanks for this. I just had a flick through your source code changes to see what this brings to the table. I think this is a great idea: running a single qmd server instance through qmd serve and letting multiple containers connect to it. That's great!

However, I'm not convinced this is the same functionality requested and provided in the other Issues and PRs. A centralised QMD server is a different feature entirely. The other PRs are about connecting to remote OpenAI API -compatible hosted models.

Further, I think naming variables as specific to rkllama reduces the apparent utility of this PR. You mention rkllama and Rockwell NPUs (both of which I'd never heard of before) several times. According to rkllama's README though:-

Main Features

  • Ollama API compatibility - Support for:
    ...
  • Partial OpenAI API compatibility - Support for:

Similarly to a comment I had on #133 (comment)

I feel like making this specific for [rkllama] reduces the usefulness of this PR

I'd suggest naming after the API specification, not the specific product.

@jaylfc
Copy link
Copy Markdown
Author

jaylfc commented Apr 6, 2026

Thanks @alexleach — both fair points.

On the scope distinction: You're right that this is a different (complementary) feature to the remote OpenAI-compatible model support in #116/#480. Those PRs let QMD connect to any OpenAI-compatible endpoint for embeddings. Our PR adds a centralised qmd serve that wraps QMD's own model pipeline (embed + rerank + query expansion) behind HTTP, so multiple QMD clients can share one set of loaded models.

They solve different problems:

Both should land — they're not competing. I'll update the PR description to make this clearer and remove the "Closes #489" since that issue is better addressed by #116.

On the rkllama naming: Agreed — I'll rename the rkllama-specific variables and backend type. The backend is really "ollama-compatible" since rkllama speaks the same API. I'll change:

  • --rkllama-url → keep as an alias but add --backend-url as the primary flag
  • rkllamaUrl config → backendUrl
  • The rkllama backend type stays as one option alongside local, but the API-facing names will be generic

Will push these changes shortly.

…I naming

- Backend type: 'rkllama' → 'ollama' (Ollama-compatible, works with rkllama/ollama/etc)
- CLI: --backend-url replaces --rkllama-url (old flag kept as deprecated alias)
- Class: RKLlamaBackend → OllamaCompatBackend
- Default URL: localhost:11434 (standard Ollama port)
- All internal comments genericised
- --rkllama-url and RKLLAMA_URL env var still work for backwards compat
jhsmith409 pushed a commit to jhsmith409/qmd that referenced this pull request Apr 6, 2026
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
jhsmith409 pushed a commit to jhsmith409/qmd that referenced this pull request Apr 6, 2026
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@jaylfc
Copy link
Copy Markdown
Author

jaylfc commented Apr 9, 2026

Just a heads up that the rename is pushed (commit b6c1019). All the rkllama-specific naming has been replaced with generic ollama-compat as suggested. Let me know if there's anything else needed for this to move forward.

@brettdavies
Copy link
Copy Markdown

Great work @jaylfc — I cherry-picked this onto current main (resolved conflicts in llm.ts, qmd.ts, store.ts) and have been running it on an RTX 3090 Ti alongside Ollama Gemma 4 26B (~20.3 GB VRAM).

The local backend works well but I hit a VRAM budget problem: all three models resident = ~5.4 GB, which exceeds the ~3.7 GB left after Ollama. Same failure mode as #275Failed to create any rerank context.

I built a SequentialLocalBackend on top of this PR that solves it. Since the three models are used in sequential pipeline stages (expand → embed → search → rerank), only one heavy model needs to be loaded at a time:

  • Embed model stays resident (320 MiB — tiny)
  • Generate model loads for query expansion, then disposed
  • Rerank model loads for reranking, then disposed

This required two small additions to LlamaCpp: disposeGenerateModel() and disposeRerankModel() methods that dispose individual models and reset their load promises.

Results on 22K-file collection:

Mode Query time Peak qmd VRAM Coexists with 20 GB Ollama?
Bare qmd query 33s ~3 GB (churning) Barely, with failures
qmd serve (all resident) 3s 5.4 GB No (25.7 GB total)
qmd serve --sequential 5.6s 2.3 GB Yes (22.6 GB total)

This also addresses #275 at the serve level — GPUs with limited free VRAM can use sequential mode instead of crashing on rerank context creation.

Happy to open a follow-up PR against your branch with the sequential backend if you're interested. The changes are minimal (~60 lines in serve.ts + ~20 lines in llm.ts).

Repo with the working code: brettdavies/qmd@feat/ollama-backend

@jaylfc
Copy link
Copy Markdown
Author

jaylfc commented Apr 11, 2026

Thanks @brettdavies — this is a great addition and the numbers are compelling. The sequential pipeline model fits qmd serve's design neatly because the stages genuinely are sequential (expand → embed → search → rerank), so disposing between stages only costs you load time, not correctness.

Worth flagging: this is directly useful for us too. On Orange Pi 5 Plus (RK3588 NPU) we run the same three-model shape — qwen3-embedding-0.6b + qwen3-reranker-0.6b + qmd-query-expansion — and hit the same budget pressure when image generation tries to share the NPU. The 0.6B quantised models load in ~1–2 s on NPU which puts us in the same latency-vs-memory tradeoff as your 3 s → 5.6 s GPU numbers, and the win for us is that the NPU cores get freed for SD inference instead of forcing users to pick "chat OR image gen". Happy to run your branch on aarch64 and post the numbers once it's in a follow-up PR — a second hardware class validating the pattern probably helps it upstream.

On where this should land: I'd like to keep #511 narrowly scoped to "centralised serve behind HTTP" so it can clear review and merge cleanly — the rename to generic ollama-compat naming is the last open feedback. If you open your sequential mode as a separate PR (either immediately after #511 merges, or stacked against this branch if the maintainers prefer), I think it has a much stronger standalone case:

  • Demonstrably fixes Low-VRAM GPUs: evict idle models before loading reranker #275 at the serve level on realistic VRAM budgets
  • Your 22K-file numbers show real-world coexistence with a 20 GB Ollama
  • The disposal API change (disposeGenerateModel / disposeRerankModel + load-promise reset) lives on its own so reviewers can focus on concurrency and lifecycle questions there rather than reviewing two features at once

Two small things worth considering while you're writing it up:

  1. Concurrency safety of the disposal API. qmd serve can handle overlapping requests — if a new query arrives mid-disposal on the rerank model, what's the expected behaviour? I'd guess the second request waits on a re-load, but a quick note or test in the follow-up would head off reviewer questions.
  2. Flag naming. --sequential describes the mechanism; something like --low-vram or --shared-gpu might read better to users who don't know the pipeline structure. Minor — your call.

Either way, ping me on the follow-up PR when it's up and I'll review + post RK3588 numbers.

jhsmith409 pushed a commit to jhsmith409/qmd that referenced this pull request Apr 12, 2026
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants