feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints by alexei-led · Pull Request #575 · tobi/qmd

alexei-led · 2026-04-15T10:24:41Z

Problem

QMD is designed as an on-device tool, but its value — BM25 + vector search + query expansion + LLM reranking in a single pipeline — makes it equally useful in cloud and multi-agent environments. The current hard dependency on local GGUF models via node-llama-cpp creates two problems in those settings:

Performance. Embedding 300M–0.6B GGUF models on CPU is slow; in cloud environments there is often no GPU at all.
Concurrency. node-llama-cpp serialises all inference calls through a single context. Running multiple agents or workers against the same index causes queuing and timeouts — each waits for the previous embed/rerank call to finish before starting.

Both problems go away when the embedding and reranking workload moves to a dedicated remote server (vLLM, TEI, Ollama, LiteLLM, or any OpenAI-compatible endpoint).

This closes #521 and addresses the use case in #229.

What this adds

Two new classes, plus the plumbing to activate them via environment variables.

`RemoteLLM`

A drop-in LLM implementation that calls an OpenAI-compatible HTTP server:

Capability	Endpoint
Embedding	`POST /v1/embeddings`
Reranking	`POST /v1/rerank` (cross-encoder format)
Query expansion / generation	`POST /v1/chat/completions`

Features:

Independent circuit breakers per endpoint (embed, rerank, chat) — an embed outage does not affect reranking or chat
Configurable connect/read timeouts and breaker thresholds
Bearer auth via QMD_REMOTE_API_KEY
qmd status shows the remote server URLs instead of local model paths
RemoteLLM.modelExists() logs a warning before returning the optimistic fail-open result when the server can't be reached, so operators know the check was skipped
Character-based token approximation (~4 chars/token) keeps chunking and overflow protection working without a local tokenizer

`HybridLLM`

A thin composite that delegates all operations to RemoteLLM. Designed for future extension (e.g. local generate + remote embed), but in the current implementation runs fully remote when the local backend is null.

Changes to existing code

src/llm.ts

LLM interface: adds embedBatch(), tokenize(), countTokens(), detokenize(), isRemote?, embedModelName?
LLMSessionManager and withLLMSessionForLlm accept LLM instead of LlamaCpp only
New exports: getDefaultLLM() / setDefaultLLM() — sit alongside getDefaultLlamaCpp() / setDefaultLlamaCpp() for backward compatibility
LlamaCpp.tokenize / detokenize use unknown[] to satisfy the interface while still casting internally to LlamaToken[]

src/store.ts (~45 lines changed)

getLlm() returns LLM and falls back to getDefaultLLM()
generateEmbeddings: uses formatDoc() — skips the Qwen3 task-prefix for remote backends, which don't need it
chunkDocumentByTokens: accepts an optional llm? parameter so internal callers pass the store-scoped LLM rather than always pulling from the global singleton

src/cli/qmd.ts (~52 lines changed)

Reads QMD_REMOTE_EMBED_URL / QMD_REMOTE_RERANK_URL / QMD_REMOTE_GEN_URL at startup and builds a HybridLLM if set
Guards the YAML models: override in getStore() — skips setDefaultLlamaCpp when remote mode is already active, so a models: block in the config cannot silently revert to local

src/index.ts (~46 lines changed)

StoreOptions.llm? — inject a backend directly (useful for testing or custom integrations)
createStore() checks QMD_REMOTE_* env vars automatically when no llm option is provided, so SDK users get the same remote mode as the CLI

Usage

# Minimal: TEI for embeddings and reranking
export QMD_REMOTE_EMBED_URL=http://localhost:8080
export QMD_REMOTE_RERANK_URL=http://localhost:8081
qmd embed && qmd query "how does auth work?"

# Separate chat server for query expansion
export QMD_REMOTE_GEN_URL=http://localhost:8082

# Authentication (sent as Bearer token to all endpoints)
export QMD_REMOTE_API_KEY=sk-...

All QMD_REMOTE_* variables are optional — when unset, qmd falls back to the existing local GGUF behaviour with no behavioural change.

Variable	Default
`QMD_REMOTE_EMBED_URL`	— (required to enable remote mode)
`QMD_REMOTE_RERANK_URL`	— (required to enable remote mode)
`QMD_REMOTE_GEN_URL`	`QMD_REMOTE_EMBED_URL`
`QMD_REMOTE_API_KEY`	—
`QMD_REMOTE_EMBED_MODEL`	`remote-embedding`
`QMD_REMOTE_RERANK_MODEL`	`remote-reranker`
`QMD_REMOTE_GEN_MODEL`	`gpt-4o-mini`
`QMD_REMOTE_CONNECT_TIMEOUT`	`5000` ms
`QMD_REMOTE_READ_TIMEOUT`	`30000` ms

Testing

test/remote-llm.test.ts — 19 tests, all in-process with real HTTP servers (no mocks):

Embed and rerank routing to separate URLs
Qwen3 instruct prefix added for query embeddings; legacy title: | text: prefixes stripped on receipt
Embedding dimension lock and error on mismatch
Circuit breaker opens after N failures; embed, rerank, and chat circuits are independent
Half-open state retries after cooldown
Bearer auth header
Connect and read timeouts
generate calls chat completions and returns text
expandQuery parses typed lines; falls back to lex+vec+hyde on error
tokenize / countTokens / detokenize character-approximation
HybridLLM routes all operations to remote without as any casts

Backward compatibility

No breaking changes. All new behaviour is opt-in via environment variables. Existing local-only setups are unaffected.

Copilot

Pull request overview

Adds an opt-in remote LLM backend so QMD can run embeddings, reranking, and generation against OpenAI-compatible HTTP endpoints (improving performance and concurrency in cloud/multi-agent setups), while keeping local GGUF behavior as the default.

Changes:

Introduces RemoteLLM (OpenAI-compatible /v1/embeddings, /v1/rerank, /v1/chat/completions) plus HybridLLM wiring.
Plumbs remote-mode activation via QMD_REMOTE_* env vars through CLI and SDK (createStore()), and updates store logic to use the generalized LLM interface.
Adds in-process HTTP tests for Remote/Hybrid behavior and updates dist artifacts + changelog.

Reviewed changes

Copilot reviewed 8 out of 41 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/remote-llm.ts`	New remote OpenAI-compatible implementation with timeouts + circuit breakers.
`src/hybrid-llm.ts`	New composite LLM wrapper to route operations to remote (future hybridization).
`src/store.ts`	Switches store to `LLM`, adds remote-specific formatting paths, and passes store LLM into chunking.
`src/cli/qmd.ts`	CLI env-var activation of remote mode + status output adjustments.
`src/index.ts`	SDK `createStore()` supports `llm` injection and auto-remote via env vars.
`test/remote-llm.test.ts`	Comprehensive in-process tests for routing, breakers, timeouts, auth, token approximation.
`CHANGELOG.md`	Documents new remote mode and related interface changes.
`.gitignore`	Stops ignoring `dist/` to allow GitHub install without a build step.
`dist/*`	Updated build outputs to match source changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ints Adds RemoteLLM and HybridLLM backends so qmd can embed and rerank using any OpenAI-compatible server (vLLM, TEI, LiteLLM, Ollama, etc.) without loading local GGUF models. ## RemoteLLM A new LLM implementation backed by HTTP calls to OpenAI-compatible APIs. - Embedding: POST /v1/embeddings - Reranking: POST /v1/rerank (cross-encoder format) - Chat completions: POST /v1/chat/completions for query expansion and generation. Defaults to embedUrl; set genUrl/QMD_REMOTE_GEN_URL to route to a separate chat server (e.g. TEI for embed + vLLM for chat) - Independent circuit breakers per endpoint (embed, rerank, chat) - Configurable timeouts and breaker thresholds via config or env vars - Bearer auth via QMD_REMOTE_API_KEY - Character-based token approximation (~4 chars/token) for chunking - qmd status shows remote server URLs instead of local model paths ## HybridLLM Composite backend delegating embed/rerank/generate/expandQuery to RemoteLLM. Accepts a null local LlamaCpp for fully remote operation. ## Interface changes (LLM) - embedBatch(), tokenize(), countTokens(), detokenize() added to LLM interface - isRemote?: boolean and embedModelName?: string added to LLM interface - LLMSessionManager, withLLMSessionForLlm accept LLM (not LlamaCpp only) - Store.llm typed as LLM; expandQuery/rerank accept LLM override - getDefaultLLM() and setDefaultLLM() exported alongside existing getDefaultLlamaCpp() / setDefaultLlamaCpp() ## Store / CLI / SDK - generateEmbeddings uses formatDoc() — skips task-prefixes for remote backends - chunkDocumentByTokens accepts optional llm? parameter - QMD_REMOTE_EMBED_URL + QMD_REMOTE_RERANK_URL activate remote mode at startup - QMD_REMOTE_GEN_URL routes chat completions to a separate endpoint - YAML models: block is skipped when remote mode is already active - createStore() respects the same env vars; accepts llm? option for injection ## Build fix package.json build script: replaced printf (subject to histexpand escaping \!) with echo in braces, and changed && to ; so the shebang step runs even when tsc exits non-zero due to pre-existing upstream type errors. ## Environment variables | Variable | Description | Default | |----------------------------|------------------------------------------|----------------------| | QMD_REMOTE_EMBED_URL | Embedding server base URL (required) | — | | QMD_REMOTE_RERANK_URL | Reranking server base URL (required) | — | | QMD_REMOTE_GEN_URL | Chat completions server URL | QMD_REMOTE_EMBED_URL | | QMD_REMOTE_API_KEY | Bearer token | — | | QMD_REMOTE_EMBED_MODEL | Model name in embed requests | remote-embedding | | QMD_REMOTE_RERANK_MODEL | Model name in rerank requests | remote-reranker | | QMD_REMOTE_GEN_MODEL | Model name in chat requests | gpt-4o-mini | | QMD_REMOTE_CONNECT_TIMEOUT | Connect timeout in ms | 5000 | | QMD_REMOTE_READ_TIMEOUT | Read timeout in ms | 30000 |

alexei-led

Fixed. Changed getDefaultLlamaCpp() to getDefaultLLM() in chunkDocumentByTokens — this ensures the remote backend is used for tokenization when no explicit override is provided, rather than falling back to load local GGUF models.

- vi.unstubAllGlobals/vi.stubGlobal not available in Bun test runner: guard with optional chaining, replace vi.stubGlobal with direct globalThis.fetch assignment and manual restore - llm.tokenize missing from fake LLM mocks in embedding batching tests: add tokenize() to createFakeEmbedLlm in store.test.ts and sdk.test.ts - internal.llm?.dispose not a function when mock LLM lacks dispose: use internal.llm?.dispose?.() in index.ts - chunkDocumentByTokens falls back to getDefaultLlamaCpp() bypassing remote backend: use getDefaultLLM() (Copilot review suggestion)

jonesj38 · 2026-04-15T22:03:13Z

see #116

lukeboyett · 2026-04-15T23:44:29Z

FYI — we've been running a much narrower variant of this in prod (lukeboyett/qmd:feat/openai-embed-backend): OpenAI embeddings only, activated by OPENAI_API_KEY alone, no new files, no rerank/expansion changes. ~300 net lines. It's a strict subset of what this PR covers, so no reason to open it upstream if #575 or #517 lands. Happy to share it as prior art if useful for comparing approaches.

alexei-led · 2026-04-16T05:43:17Z

Whatever approach works for me. I just need to be able to use remote embedding engine

Copilot AI review requested due to automatic review settings April 15, 2026 10:24

Copilot started reviewing on behalf of alexei-led April 15, 2026 10:25 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread src/store.ts Outdated

alexei-led force-pushed the pr/remote-llm-slim branch from 77f1bb5 to 871613e Compare April 15, 2026 10:52

alexei-led commented Apr 15, 2026

View reviewed changes

Comment thread src/store.ts

alexei-led force-pushed the pr/remote-llm-slim branch 2 times, most recently from 026dc18 to e6972eb Compare April 15, 2026 12:37

alexei-led force-pushed the pr/remote-llm-slim branch from e6972eb to 0e9f574 Compare April 15, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints#575

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints#575
alexei-led wants to merge 2 commits intotobi:mainfrom
alexei-led:pr/remote-llm-slim

alexei-led commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

alexei-led left a comment

Uh oh!

Uh oh!

jonesj38 commented Apr 15, 2026

Uh oh!

lukeboyett commented Apr 15, 2026

Uh oh!

alexei-led commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexei-led commented Apr 15, 2026

Problem

What this adds

RemoteLLM

HybridLLM

Changes to existing code

Usage

Testing

Backward compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

alexei-led left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonesj38 commented Apr 15, 2026

Uh oh!

lukeboyett commented Apr 15, 2026

Uh oh!

alexei-led commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`RemoteLLM`

`HybridLLM`