Add OpenAI-compatible remote embedding and reranking#517
Add OpenAI-compatible remote embedding and reranking#517jhsmith409 wants to merge 3 commits intotobi:mainfrom
Conversation
8e26a6c to
f640303
Compare
Test results from a live vLLM deploymentRan the full test suite on
Remote LLM tests (the tests added by this PR): ✅ 66/66 passFull suite: 699 pass, 48 failThe 48 failures are all in the LlamaCpp/local-model path (token chunking, query expansion, local reranking) — expected on a machine without a local GGUF model downloaded. No failures in any remote, BM25, AST, SDK, collections, or MCP tests. This feature works well in practice. The remote OpenAI-compatible embedding is significantly faster than CPU GGUF inference for bulk indexing — happy to help test anything else if useful. |
Full test suite —
|
Two fixes found while deploying on a live systemWhile integrating this branch into production, I hit two issues and fixed them. Hopefully, I figured out how to incorporate them correctly into the PR: 1.
|
|
Both fixes above have been pushed to the branch: jhsmith409@6596448 |
|
All comments have been addressed and two production fixes have been pushed (see above). Tested against live vLLM servers — 699/747 tests passing, all 48 failures are pre-existing LlamaCpp-path issues unrelated to this PR. Ready for review. |
|
great work! would you consider adding support for remote query expansion as well? |
Let's close out this PR and get it merged. Then open an issue for remote query expansion and I'll try to address it. |
|
Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern) |
I'll tackle the first part (query expansion) and unit tests but I'll let the qmd models serve option for someone else to implement. Does that work for you, tobi? |
|
Remote query expansion is now implemented. Here's what was added (commit f8c6030): Changes
|
Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add intent? to LLM interface and ILLMSession expandQuery signature
(store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
getStore() so content_vectors.model reflects the actual LLM in use
(previously always stored DEFAULT_EMBED_MODEL_URI even with remote)
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is configured; throws "expandApiModel not configured" otherwise - Independent circuit breaker for the expand endpoint - parseExpandResponse() parses lex/vec/hyde lines, filters terms that don't share a word with the original query, falls back gracefully on bad model output - RemoteLLM.supportsExpand getter for routing decisions - HybridLLM routes expandQuery to remote when remote.supportsExpand, otherwise falls back to local LlamaCpp (no interface changes) - remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields - Unit tests (mock HTTP server, VCR-style): payload shape, auth header fallback, lex/vec/hyde parsing, includeLexical=false filtering, fallback on bad output, query-term filtering, circuit breaker, HybridLLM routing (remote vs local), config env vars - Integration tests: live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM routing verified via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars, skipped when absent)
f8c6030 to
f2fd64e
Compare
Summary
RemoteLLMclass that calls OpenAI-compatible HTTP endpoints (POST /v1/embeddings,POST /v1/rerank) for embedding and reranking, with circuit breaker, dimension validation, batch splitting, auth headers, and configurable timeoutsHybridLLMcompositor that routes embed/rerank to a remote server while keeping query expansion and tokenization local viaLlamaCppLLMinterface withembedBatchandembedModelName, and updates the singleton/session management to accept anyLLMimplementation (backward-compatible)QMD_EMBED_API_URL+QMD_EMBED_API_MODELenv vars, orembed_api_url/embed_api_modelin the YAMLmodels:sectionfetch()Motivation
Allows using a GPU server (e.g. vLLM with
BAAI/bge-m3orQwen/Qwen3-Embedding-0.6B) for embedding and reranking while keeping QMD's fine-tuned local query expansion model. Useful when the indexing machine doesn't have a GPU, or when you want to use larger/better embedding models than what fits in local VRAM.Related: #489, #427, #446, #511
Files changed
src/remote-llm.tsRemoteLLMclass +remoteConfigFromEnv()src/hybrid-llm.tsHybridLLMrouting compositorsrc/llm.tsembedBatch/embedModelNametoLLMinterface,isRemoteModel(), generalize singleton toLLMsrc/store.tsLlamaCpptype refs →LLMinterface, gracefultokenize()fallbacksrc/collections.tsModelsConfigsrc/cli/qmd.tsHybridLLMwhen configuredCHANGELOG.mdREADME.mdTest plan
test/remote-llm.test.ts) — mock HTTP server covering embed, batch, auth, dimension validation, circuit breaker, rerank, HybridLLM routing, config parsing, local-only pathtest/remote-llm-integration.test.ts) — live vLLM servers (Qwen3-Embedding-0.6B + Qwen3-Reranker-4B) covering single embed, batch, dimension consistency, normalization, semantic similarity, reranking relevance, edge cases, end-to-end search simulationgetDefaultLLM()returnsLlamaCppwhen no remote config, all interface methods present,tokenize()duck-typing worksbun test(existing tests unaffected — only type-level changes tostore.ts)🤖 Generated with Claude Code