Skip to content

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints#575

Open
alexei-led wants to merge 2 commits intotobi:mainfrom
alexei-led:pr/remote-llm-slim
Open

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints#575
alexei-led wants to merge 2 commits intotobi:mainfrom
alexei-led:pr/remote-llm-slim

Conversation

@alexei-led
Copy link
Copy Markdown
Contributor

Problem

QMD is designed as an on-device tool, but its value — BM25 + vector search + query expansion + LLM reranking in a single pipeline — makes it equally useful in cloud and multi-agent environments. The current hard dependency on local GGUF models via node-llama-cpp creates two problems in those settings:

  1. Performance. Embedding 300M–0.6B GGUF models on CPU is slow; in cloud environments there is often no GPU at all.
  2. Concurrency. node-llama-cpp serialises all inference calls through a single context. Running multiple agents or workers against the same index causes queuing and timeouts — each waits for the previous embed/rerank call to finish before starting.

Both problems go away when the embedding and reranking workload moves to a dedicated remote server (vLLM, TEI, Ollama, LiteLLM, or any OpenAI-compatible endpoint).

This closes #521 and addresses the use case in #229.


What this adds

Two new classes, plus the plumbing to activate them via environment variables.

RemoteLLM

A drop-in LLM implementation that calls an OpenAI-compatible HTTP server:

Capability Endpoint
Embedding POST /v1/embeddings
Reranking POST /v1/rerank (cross-encoder format)
Query expansion / generation POST /v1/chat/completions

Features:

  • Independent circuit breakers per endpoint (embed, rerank, chat) — an embed outage does not affect reranking or chat
  • Configurable connect/read timeouts and breaker thresholds
  • Bearer auth via QMD_REMOTE_API_KEY
  • qmd status shows the remote server URLs instead of local model paths
  • RemoteLLM.modelExists() logs a warning before returning the optimistic fail-open result when the server can't be reached, so operators know the check was skipped
  • Character-based token approximation (~4 chars/token) keeps chunking and overflow protection working without a local tokenizer

HybridLLM

A thin composite that delegates all operations to RemoteLLM. Designed for future extension (e.g. local generate + remote embed), but in the current implementation runs fully remote when the local backend is null.

Changes to existing code

src/llm.ts

  • LLM interface: adds embedBatch(), tokenize(), countTokens(), detokenize(), isRemote?, embedModelName?
  • LLMSessionManager and withLLMSessionForLlm accept LLM instead of LlamaCpp only
  • New exports: getDefaultLLM() / setDefaultLLM() — sit alongside getDefaultLlamaCpp() / setDefaultLlamaCpp() for backward compatibility
  • LlamaCpp.tokenize / detokenize use unknown[] to satisfy the interface while still casting internally to LlamaToken[]

src/store.ts (~45 lines changed)

  • getLlm() returns LLM and falls back to getDefaultLLM()
  • generateEmbeddings: uses formatDoc() — skips the Qwen3 task-prefix for remote backends, which don't need it
  • chunkDocumentByTokens: accepts an optional llm? parameter so internal callers pass the store-scoped LLM rather than always pulling from the global singleton

src/cli/qmd.ts (~52 lines changed)

  • Reads QMD_REMOTE_EMBED_URL / QMD_REMOTE_RERANK_URL / QMD_REMOTE_GEN_URL at startup and builds a HybridLLM if set
  • Guards the YAML models: override in getStore() — skips setDefaultLlamaCpp when remote mode is already active, so a models: block in the config cannot silently revert to local

src/index.ts (~46 lines changed)

  • StoreOptions.llm? — inject a backend directly (useful for testing or custom integrations)
  • createStore() checks QMD_REMOTE_* env vars automatically when no llm option is provided, so SDK users get the same remote mode as the CLI

Usage

# Minimal: TEI for embeddings and reranking
export QMD_REMOTE_EMBED_URL=http://localhost:8080
export QMD_REMOTE_RERANK_URL=http://localhost:8081
qmd embed && qmd query "how does auth work?"

# Separate chat server for query expansion
export QMD_REMOTE_GEN_URL=http://localhost:8082

# Authentication (sent as Bearer token to all endpoints)
export QMD_REMOTE_API_KEY=sk-...

All QMD_REMOTE_* variables are optional — when unset, qmd falls back to the existing local GGUF behaviour with no behavioural change.

Variable Default
QMD_REMOTE_EMBED_URL — (required to enable remote mode)
QMD_REMOTE_RERANK_URL — (required to enable remote mode)
QMD_REMOTE_GEN_URL QMD_REMOTE_EMBED_URL
QMD_REMOTE_API_KEY
QMD_REMOTE_EMBED_MODEL remote-embedding
QMD_REMOTE_RERANK_MODEL remote-reranker
QMD_REMOTE_GEN_MODEL gpt-4o-mini
QMD_REMOTE_CONNECT_TIMEOUT 5000 ms
QMD_REMOTE_READ_TIMEOUT 30000 ms

Testing

test/remote-llm.test.ts — 19 tests, all in-process with real HTTP servers (no mocks):

  • Embed and rerank routing to separate URLs
  • Qwen3 instruct prefix added for query embeddings; legacy title: | text: prefixes stripped on receipt
  • Embedding dimension lock and error on mismatch
  • Circuit breaker opens after N failures; embed, rerank, and chat circuits are independent
  • Half-open state retries after cooldown
  • Bearer auth header
  • Connect and read timeouts
  • generate calls chat completions and returns text
  • expandQuery parses typed lines; falls back to lex+vec+hyde on error
  • tokenize / countTokens / detokenize character-approximation
  • HybridLLM routes all operations to remote without as any casts

Backward compatibility

No breaking changes. All new behaviour is opt-in via environment variables. Existing local-only setups are unaffected.

Copilot AI review requested due to automatic review settings April 15, 2026 10:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in remote LLM backend so QMD can run embeddings, reranking, and generation against OpenAI-compatible HTTP endpoints (improving performance and concurrency in cloud/multi-agent setups), while keeping local GGUF behavior as the default.

Changes:

  • Introduces RemoteLLM (OpenAI-compatible /v1/embeddings, /v1/rerank, /v1/chat/completions) plus HybridLLM wiring.
  • Plumbs remote-mode activation via QMD_REMOTE_* env vars through CLI and SDK (createStore()), and updates store logic to use the generalized LLM interface.
  • Adds in-process HTTP tests for Remote/Hybrid behavior and updates dist artifacts + changelog.

Reviewed changes

Copilot reviewed 8 out of 41 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/remote-llm.ts New remote OpenAI-compatible implementation with timeouts + circuit breakers.
src/hybrid-llm.ts New composite LLM wrapper to route operations to remote (future hybridization).
src/store.ts Switches store to LLM, adds remote-specific formatting paths, and passes store LLM into chunking.
src/cli/qmd.ts CLI env-var activation of remote mode + status output adjustments.
src/index.ts SDK createStore() supports llm injection and auto-remote via env vars.
test/remote-llm.test.ts Comprehensive in-process tests for routing, breakers, timeouts, auth, token approximation.
CHANGELOG.md Documents new remote mode and related interface changes.
.gitignore Stops ignoring dist/ to allow GitHub install without a build step.
dist/* Updated build outputs to match source changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/store.ts Outdated
…ints

Adds RemoteLLM and HybridLLM backends so qmd can embed and rerank using
any OpenAI-compatible server (vLLM, TEI, LiteLLM, Ollama, etc.) without
loading local GGUF models.

## RemoteLLM

A new LLM implementation backed by HTTP calls to OpenAI-compatible APIs.

- Embedding: POST /v1/embeddings
- Reranking: POST /v1/rerank (cross-encoder format)
- Chat completions: POST /v1/chat/completions for query expansion and
  generation. Defaults to embedUrl; set genUrl/QMD_REMOTE_GEN_URL to
  route to a separate chat server (e.g. TEI for embed + vLLM for chat)
- Independent circuit breakers per endpoint (embed, rerank, chat)
- Configurable timeouts and breaker thresholds via config or env vars
- Bearer auth via QMD_REMOTE_API_KEY
- Character-based token approximation (~4 chars/token) for chunking
- qmd status shows remote server URLs instead of local model paths

## HybridLLM

Composite backend delegating embed/rerank/generate/expandQuery to RemoteLLM.
Accepts a null local LlamaCpp for fully remote operation.

## Interface changes (LLM)

- embedBatch(), tokenize(), countTokens(), detokenize() added to LLM interface
- isRemote?: boolean and embedModelName?: string added to LLM interface
- LLMSessionManager, withLLMSessionForLlm accept LLM (not LlamaCpp only)
- Store.llm typed as LLM; expandQuery/rerank accept LLM override
- getDefaultLLM() and setDefaultLLM() exported alongside existing
  getDefaultLlamaCpp() / setDefaultLlamaCpp()

## Store / CLI / SDK

- generateEmbeddings uses formatDoc() — skips task-prefixes for remote backends
- chunkDocumentByTokens accepts optional llm? parameter
- QMD_REMOTE_EMBED_URL + QMD_REMOTE_RERANK_URL activate remote mode at startup
- QMD_REMOTE_GEN_URL routes chat completions to a separate endpoint
- YAML models: block is skipped when remote mode is already active
- createStore() respects the same env vars; accepts llm? option for injection

## Build fix

package.json build script: replaced printf (subject to histexpand escaping \!)
with echo in braces, and changed && to ; so the shebang step runs even when
tsc exits non-zero due to pre-existing upstream type errors.

## Environment variables

| Variable                   | Description                              | Default              |
|----------------------------|------------------------------------------|----------------------|
| QMD_REMOTE_EMBED_URL       | Embedding server base URL (required)     | —                    |
| QMD_REMOTE_RERANK_URL      | Reranking server base URL (required)     | —                    |
| QMD_REMOTE_GEN_URL         | Chat completions server URL              | QMD_REMOTE_EMBED_URL |
| QMD_REMOTE_API_KEY         | Bearer token                             | —                    |
| QMD_REMOTE_EMBED_MODEL     | Model name in embed requests             | remote-embedding     |
| QMD_REMOTE_RERANK_MODEL    | Model name in rerank requests            | remote-reranker      |
| QMD_REMOTE_GEN_MODEL       | Model name in chat requests              | gpt-4o-mini          |
| QMD_REMOTE_CONNECT_TIMEOUT | Connect timeout in ms                    | 5000                 |
| QMD_REMOTE_READ_TIMEOUT    | Read timeout in ms                       | 30000                |
Copy link
Copy Markdown
Contributor Author

@alexei-led alexei-led left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed getDefaultLlamaCpp() to getDefaultLLM() in chunkDocumentByTokens — this ensures the remote backend is used for tokenization when no explicit override is provided, rather than falling back to load local GGUF models.

Comment thread src/store.ts
@alexei-led alexei-led force-pushed the pr/remote-llm-slim branch 2 times, most recently from 026dc18 to e6972eb Compare April 15, 2026 12:37
- vi.unstubAllGlobals/vi.stubGlobal not available in Bun test runner:
  guard with optional chaining, replace vi.stubGlobal with direct
  globalThis.fetch assignment and manual restore

- llm.tokenize missing from fake LLM mocks in embedding batching tests:
  add tokenize() to createFakeEmbedLlm in store.test.ts and sdk.test.ts

- internal.llm?.dispose not a function when mock LLM lacks dispose:
  use internal.llm?.dispose?.() in index.ts

- chunkDocumentByTokens falls back to getDefaultLlamaCpp() bypassing
  remote backend: use getDefaultLLM() (Copilot review suggestion)
@jonesj38
Copy link
Copy Markdown

see #116

@lukeboyett
Copy link
Copy Markdown

FYI — we've been running a much narrower variant of this in prod (lukeboyett/qmd:feat/openai-embed-backend): OpenAI embeddings only, activated by OPENAI_API_KEY alone, no new files, no rerank/expansion changes. ~300 net lines. It's a strict subset of what this PR covers, so no reason to open it upstream if #575 or #517 lands. Happy to share it as prior art if useful for comparing approaches.

@alexei-led
Copy link
Copy Markdown
Contributor Author

Whatever approach works for me. I just need to be able to use remote embedding engine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support external API embedding providers (OpenAI/Ollama compatible)

4 participants