feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints#575
feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints#575alexei-led wants to merge 2 commits intotobi:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an opt-in remote LLM backend so QMD can run embeddings, reranking, and generation against OpenAI-compatible HTTP endpoints (improving performance and concurrency in cloud/multi-agent setups), while keeping local GGUF behavior as the default.
Changes:
- Introduces
RemoteLLM(OpenAI-compatible/v1/embeddings,/v1/rerank,/v1/chat/completions) plusHybridLLMwiring. - Plumbs remote-mode activation via
QMD_REMOTE_*env vars through CLI and SDK (createStore()), and updates store logic to use the generalizedLLMinterface. - Adds in-process HTTP tests for Remote/Hybrid behavior and updates dist artifacts + changelog.
Reviewed changes
Copilot reviewed 8 out of 41 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/remote-llm.ts |
New remote OpenAI-compatible implementation with timeouts + circuit breakers. |
src/hybrid-llm.ts |
New composite LLM wrapper to route operations to remote (future hybridization). |
src/store.ts |
Switches store to LLM, adds remote-specific formatting paths, and passes store LLM into chunking. |
src/cli/qmd.ts |
CLI env-var activation of remote mode + status output adjustments. |
src/index.ts |
SDK createStore() supports llm injection and auto-remote via env vars. |
test/remote-llm.test.ts |
Comprehensive in-process tests for routing, breakers, timeouts, auth, token approximation. |
CHANGELOG.md |
Documents new remote mode and related interface changes. |
.gitignore |
Stops ignoring dist/ to allow GitHub install without a build step. |
dist/* |
Updated build outputs to match source changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ints Adds RemoteLLM and HybridLLM backends so qmd can embed and rerank using any OpenAI-compatible server (vLLM, TEI, LiteLLM, Ollama, etc.) without loading local GGUF models. ## RemoteLLM A new LLM implementation backed by HTTP calls to OpenAI-compatible APIs. - Embedding: POST /v1/embeddings - Reranking: POST /v1/rerank (cross-encoder format) - Chat completions: POST /v1/chat/completions for query expansion and generation. Defaults to embedUrl; set genUrl/QMD_REMOTE_GEN_URL to route to a separate chat server (e.g. TEI for embed + vLLM for chat) - Independent circuit breakers per endpoint (embed, rerank, chat) - Configurable timeouts and breaker thresholds via config or env vars - Bearer auth via QMD_REMOTE_API_KEY - Character-based token approximation (~4 chars/token) for chunking - qmd status shows remote server URLs instead of local model paths ## HybridLLM Composite backend delegating embed/rerank/generate/expandQuery to RemoteLLM. Accepts a null local LlamaCpp for fully remote operation. ## Interface changes (LLM) - embedBatch(), tokenize(), countTokens(), detokenize() added to LLM interface - isRemote?: boolean and embedModelName?: string added to LLM interface - LLMSessionManager, withLLMSessionForLlm accept LLM (not LlamaCpp only) - Store.llm typed as LLM; expandQuery/rerank accept LLM override - getDefaultLLM() and setDefaultLLM() exported alongside existing getDefaultLlamaCpp() / setDefaultLlamaCpp() ## Store / CLI / SDK - generateEmbeddings uses formatDoc() — skips task-prefixes for remote backends - chunkDocumentByTokens accepts optional llm? parameter - QMD_REMOTE_EMBED_URL + QMD_REMOTE_RERANK_URL activate remote mode at startup - QMD_REMOTE_GEN_URL routes chat completions to a separate endpoint - YAML models: block is skipped when remote mode is already active - createStore() respects the same env vars; accepts llm? option for injection ## Build fix package.json build script: replaced printf (subject to histexpand escaping \!) with echo in braces, and changed && to ; so the shebang step runs even when tsc exits non-zero due to pre-existing upstream type errors. ## Environment variables | Variable | Description | Default | |----------------------------|------------------------------------------|----------------------| | QMD_REMOTE_EMBED_URL | Embedding server base URL (required) | — | | QMD_REMOTE_RERANK_URL | Reranking server base URL (required) | — | | QMD_REMOTE_GEN_URL | Chat completions server URL | QMD_REMOTE_EMBED_URL | | QMD_REMOTE_API_KEY | Bearer token | — | | QMD_REMOTE_EMBED_MODEL | Model name in embed requests | remote-embedding | | QMD_REMOTE_RERANK_MODEL | Model name in rerank requests | remote-reranker | | QMD_REMOTE_GEN_MODEL | Model name in chat requests | gpt-4o-mini | | QMD_REMOTE_CONNECT_TIMEOUT | Connect timeout in ms | 5000 | | QMD_REMOTE_READ_TIMEOUT | Read timeout in ms | 30000 |
77f1bb5 to
871613e
Compare
alexei-led
left a comment
There was a problem hiding this comment.
Fixed. Changed getDefaultLlamaCpp() to getDefaultLLM() in chunkDocumentByTokens — this ensures the remote backend is used for tokenization when no explicit override is provided, rather than falling back to load local GGUF models.
026dc18 to
e6972eb
Compare
- vi.unstubAllGlobals/vi.stubGlobal not available in Bun test runner: guard with optional chaining, replace vi.stubGlobal with direct globalThis.fetch assignment and manual restore - llm.tokenize missing from fake LLM mocks in embedding batching tests: add tokenize() to createFakeEmbedLlm in store.test.ts and sdk.test.ts - internal.llm?.dispose not a function when mock LLM lacks dispose: use internal.llm?.dispose?.() in index.ts - chunkDocumentByTokens falls back to getDefaultLlamaCpp() bypassing remote backend: use getDefaultLLM() (Copilot review suggestion)
e6972eb to
0e9f574
Compare
|
see #116 |
|
FYI — we've been running a much narrower variant of this in prod ( |
|
Whatever approach works for me. I just need to be able to use remote embedding engine |
Problem
QMD is designed as an on-device tool, but its value — BM25 + vector search + query expansion + LLM reranking in a single pipeline — makes it equally useful in cloud and multi-agent environments. The current hard dependency on local GGUF models via node-llama-cpp creates two problems in those settings:
Both problems go away when the embedding and reranking workload moves to a dedicated remote server (vLLM, TEI, Ollama, LiteLLM, or any OpenAI-compatible endpoint).
This closes #521 and addresses the use case in #229.
What this adds
Two new classes, plus the plumbing to activate them via environment variables.
RemoteLLMA drop-in
LLMimplementation that calls an OpenAI-compatible HTTP server:POST /v1/embeddingsPOST /v1/rerank(cross-encoder format)POST /v1/chat/completionsFeatures:
embed,rerank,chat) — an embed outage does not affect reranking or chatQMD_REMOTE_API_KEYqmd statusshows the remote server URLs instead of local model pathsRemoteLLM.modelExists()logs a warning before returning the optimistic fail-open result when the server can't be reached, so operators know the check was skippedHybridLLMA thin composite that delegates all operations to
RemoteLLM. Designed for future extension (e.g. local generate + remote embed), but in the current implementation runs fully remote when the local backend isnull.Changes to existing code
src/llm.tsLLMinterface: addsembedBatch(),tokenize(),countTokens(),detokenize(),isRemote?,embedModelName?LLMSessionManagerandwithLLMSessionForLlmacceptLLMinstead ofLlamaCpponlygetDefaultLLM()/setDefaultLLM()— sit alongsidegetDefaultLlamaCpp()/setDefaultLlamaCpp()for backward compatibilityLlamaCpp.tokenize/detokenizeuseunknown[]to satisfy the interface while still casting internally toLlamaToken[]src/store.ts(~45 lines changed)getLlm()returnsLLMand falls back togetDefaultLLM()generateEmbeddings: usesformatDoc()— skips the Qwen3 task-prefix for remote backends, which don't need itchunkDocumentByTokens: accepts an optionalllm?parameter so internal callers pass the store-scoped LLM rather than always pulling from the global singletonsrc/cli/qmd.ts(~52 lines changed)QMD_REMOTE_EMBED_URL/QMD_REMOTE_RERANK_URL/QMD_REMOTE_GEN_URLat startup and builds aHybridLLMif setmodels:override ingetStore()— skipssetDefaultLlamaCppwhen remote mode is already active, so amodels:block in the config cannot silently revert to localsrc/index.ts(~46 lines changed)StoreOptions.llm?— inject a backend directly (useful for testing or custom integrations)createStore()checksQMD_REMOTE_*env vars automatically when nollmoption is provided, so SDK users get the same remote mode as the CLIUsage
All
QMD_REMOTE_*variables are optional — when unset, qmd falls back to the existing local GGUF behaviour with no behavioural change.QMD_REMOTE_EMBED_URLQMD_REMOTE_RERANK_URLQMD_REMOTE_GEN_URLQMD_REMOTE_EMBED_URLQMD_REMOTE_API_KEYQMD_REMOTE_EMBED_MODELremote-embeddingQMD_REMOTE_RERANK_MODELremote-rerankerQMD_REMOTE_GEN_MODELgpt-4o-miniQMD_REMOTE_CONNECT_TIMEOUT5000msQMD_REMOTE_READ_TIMEOUT30000msTesting
test/remote-llm.test.ts— 19 tests, all in-process with real HTTP servers (no mocks):title: | text:prefixes stripped on receiptembed,rerank, andchatcircuits are independentgeneratecalls chat completions and returns textexpandQueryparses typed lines; falls back to lex+vec+hyde on errortokenize/countTokens/detokenizecharacter-approximationHybridLLMroutes all operations to remote withoutas anycastsBackward compatibility
No breaking changes. All new behaviour is opt-in via environment variables. Existing local-only setups are unaffected.