Skip to content

Add OpenAI-compatible remote embedding and reranking#517

Open
jhsmith409 wants to merge 3 commits intotobi:mainfrom
jhsmith409:feature/remote-openai-embed
Open

Add OpenAI-compatible remote embedding and reranking#517
jhsmith409 wants to merge 3 commits intotobi:mainfrom
jhsmith409:feature/remote-openai-embed

Conversation

@jhsmith409
Copy link
Copy Markdown

@jhsmith409 jhsmith409 commented Apr 6, 2026

Summary

  • Adds RemoteLLM class that calls OpenAI-compatible HTTP endpoints (POST /v1/embeddings, POST /v1/rerank) for embedding and reranking, with circuit breaker, dimension validation, batch splitting, auth headers, and configurable timeouts
  • Adds HybridLLM compositor that routes embed/rerank to a remote server while keeping query expansion and tokenization local via LlamaCpp
  • Generalizes the LLM interface with embedBatch and embedModelName, and updates the singleton/session management to accept any LLM implementation (backward-compatible)
  • Configured via QMD_EMBED_API_URL + QMD_EMBED_API_MODEL env vars, or embed_api_url/embed_api_model in the YAML models: section
  • Skips nomic/Qwen3 text formatting prefixes for remote models (they handle their own prompt formatting)
  • Zero new dependencies — uses Node.js built-in fetch()

Motivation

Allows using a GPU server (e.g. vLLM with BAAI/bge-m3 or Qwen/Qwen3-Embedding-0.6B) for embedding and reranking while keeping QMD's fine-tuned local query expansion model. Useful when the indexing machine doesn't have a GPU, or when you want to use larger/better embedding models than what fits in local VRAM.

Related: #489, #427, #446, #511

Files changed

File Change
src/remote-llm.ts NewRemoteLLM class + remoteConfigFromEnv()
src/hybrid-llm.ts NewHybridLLM routing compositor
src/llm.ts Add embedBatch/embedModelName to LLM interface, isRemoteModel(), generalize singleton to LLM
src/store.ts Change LlamaCpp type refs → LLM interface, graceful tokenize() fallback
src/collections.ts Add remote fields to ModelsConfig
src/cli/qmd.ts Auto-detect remote config, create HybridLLM when configured
CHANGELOG.md Unreleased entry
README.md Configuration docs + vLLM example

Test plan

  • 36 unit tests (test/remote-llm.test.ts) — mock HTTP server covering embed, batch, auth, dimension validation, circuit breaker, rerank, HybridLLM routing, config parsing, local-only path
  • 30 integration tests (test/remote-llm-integration.test.ts) — live vLLM servers (Qwen3-Embedding-0.6B + Qwen3-Reranker-4B) covering single embed, batch, dimension consistency, normalization, semantic similarity, reranking relevance, edge cases, end-to-end search simulation
  • Local-only path verified: getDefaultLLM() returns LlamaCpp when no remote config, all interface methods present, tokenize() duck-typing works
  • Full test suite with bun test (existing tests unaffected — only type-level changes to store.ts)

🤖 Generated with Claude Code

@jhsmith409 jhsmith409 force-pushed the feature/remote-openai-embed branch from 8e26a6c to f640303 Compare April 6, 2026 14:23
@jhsmith409
Copy link
Copy Markdown
Author

jhsmith409 commented Apr 6, 2026

Test results from a live vLLM deployment

Ran the full test suite on feature/remote-openai-embed against real vLLM servers (no mocks):

Endpoint Model
Embedding (http://192.168.x.x:x/v1) Qwen/Qwen3-Embedding-0.6B
Reranking (http://192.168.x.x:x/v1) qwen3-reranker-4b

Remote LLM tests (the tests added by this PR): ✅ 66/66 pass

bun test v1.3.11 (af24e281)

test/remote-llm-integration.test.ts:
  Embedding dimension: 1024
  Rerank scores: {
    "cookies.md": 0.37974488735198975,
    "quantum.md": 0.35188794136047363,
    "space.md": 0.35012370347976685,
    "baking.md": 0.12711383402347565,
  }
  Similarity ranking:
    git.md: 0.7216
    typescript.md: 0.4591
    docker.md: 0.4050
    cooking.md: 0.3419
    gardening.md: 0.3270

 66 pass
 0 fail
 1214 expect() calls
Ran 66 tests across 2 files. [1149.00ms]

Full suite: 699 pass, 48 fail

The 48 failures are all in the LlamaCpp/local-model path (token chunking, query expansion, local reranking) — expected on a machine without a local GGUF model downloaded. No failures in any remote, BM25, AST, SDK, collections, or MCP tests.


This feature works well in practice. The remote OpenAI-compatible embedding is significantly faster than CPU GGUF inference for bulk indexing — happy to help test anything else if useful.

@jhsmith409
Copy link
Copy Markdown
Author

Full test suite — bun test

bun test v1.3.11 (af24e281)

 699 pass
 48 fail
 2720 expect() calls
Ran 747 tests across 20 files. [555.28s]

All 48 failures are in the local LlamaCpp path (token-based chunking, query expansion, local reranking, hybrid pipeline) — these require a local GGUF model to be downloaded, which isn't present on this machine. They are pre-existing failures unrelated to this PR.

Failures breakdown:

  • Token-based Chunking — node-llama-cpp compile/load timeout (no local model)
  • LlamaCpp Integration — expandQuery/rerank timeout (no local model)
  • LLM Session Management — withLLMSession timeout (no local model)
  • MCP Server > hybridQuery — LLM query expansion timeout (no local model)
  • search > with LLM query expansion — same
  • MCP HTTP Transport — depends on query expansion

Zero failures in: AST chunking, BM25, collections config, store paths, SDK, formatter, intent, multi-collection filter, RRF trace, structured search, store helpers, remote LLM — i.e., everything that doesn't require a local GGUF model passes cleanly. Existing tests are unaffected.

@jhsmith409
Copy link
Copy Markdown
Author

jhsmith409 commented Apr 6, 2026

Two fixes found while deploying on a live system

While integrating this branch into production, I hit two issues and fixed them. Hopefully, I figured out how to incorporate them correctly into the PR:

1. intent missing from LLM interface / ILLMSession (src/llm.ts)

store.ts calls llm.expandQuery(query, { intent }) but the interface only declared { context?, includeLexical? }, so tsc failed to build:

src/store.ts(3191,50): error TS2353: Object literal may only specify known properties,
and 'intent' does not exist in type '{ context?: string | undefined; includeLexical?: boolean | undefined; }'

Fix — add intent? to both interface declarations:

-  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean }): Promise<Queryable[]>;
+  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean; intent?: string }): Promise<Queryable[]>;

(Same change needed on both ILLMSession at line ~172 and LLM at line ~359.)

2. vectorIndex logs and stores the default GGUF URI as the model label, even when using remote embedding (src/cli/qmd.ts)

vectorIndex has model: string = DEFAULT_EMBED_MODEL_URI as a default parameter and passes it straight to generateEmbeddings. When remote embedding is active, getStore() sets up a HybridLLM — but model still shows/stores the GGUF string, not Qwen/Qwen3-Embedding-0.6B.

Fix — derive the label from the actual configured LLM after getStore():

   const storeInstance = getStore();
   const db = storeInstance.db;

+  // Use the actual model name from the configured LLM (may be remote, not the default GGUF URI)
+  model = getDefaultLLM().embedModelName;
+
   if (force) {

After this fix, qmd embed -f correctly shows and stores Model: Qwen/Qwen3-Embedding-0.6B in content_vectors.model.

@jhsmith409
Copy link
Copy Markdown
Author

Both fixes above have been pushed to the branch: jhsmith409@6596448

@jhsmith409
Copy link
Copy Markdown
Author

All comments have been addressed and two production fixes have been pushed (see above). Tested against live vLLM servers — 699/747 tests passing, all 48 failures are pre-existing LlamaCpp-path issues unrelated to this PR. Ready for review.

@viniciushsantana
Copy link
Copy Markdown

great work! would you consider adding support for remote query expansion as well?

@jhsmith409
Copy link
Copy Markdown
Author

great work! would you consider adding support for remote query expansion as well?

Let's close out this PR and get it merged. Then open an issue for remote query expansion and I'll try to address it.

@tobi
Copy link
Copy Markdown
Owner

tobi commented Apr 9, 2026

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern)
And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

@jhsmith409
Copy link
Copy Markdown
Author

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern) And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

I'll tackle the first part (query expansion) and unit tests but I'll let the qmd models serve option for someone else to implement. Does that work for you, tobi?

@jhsmith409
Copy link
Copy Markdown
Author

Remote query expansion is now implemented. Here's what was added (commit f8c6030):

Changes

src/remote-llm.ts

  • expandQuery() now calls /chat/completions when expandApiModel is configured (throws "expandApiModel not configured" otherwise — no behavior change for existing users)
  • supportsExpand getter for routing decisions
  • Independent circuit breaker for the expand endpoint
  • parseExpandResponse() helper: parses lex:/vec:/hyde: lines, filters out variants with no term overlap with the original query, falls back gracefully on bad model output
  • remoteConfigFromEnv() now reads QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields

src/hybrid-llm.ts

  • expandQuery() routes to remote when remote instanceof RemoteLLM && remote.supportsExpand, otherwise falls back to local LlamaCpp — no interface changes

test/remote-llm.test.ts (unit tests, mock HTTP server)

New expandQuery describe block covering:

  • supportsExpand flag behavior
  • /chat/completions payload shape (model, message roles, intent inclusion)
  • Auth header: expandApiKey → falls back to embedApiKey
  • lex/vec/hyde parsing, includeLexical: false filtering
  • Fallback Queryable[] when model output is unparseable
  • Query-term filtering (variants with no overlap are dropped)
  • Circuit breaker trips after 3 failures
  • HybridLLM routing: remote when expandApiModel set, local when not

test/remote-llm-integration.test.ts (live server)

  • New VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars (tests skipped when absent)
  • All three types returned, includeLexical: false, intent incorporation
  • HybridLLM routing verified via LOCAL_SENTINEL sentinel value

Test results

Unit tests: 52/52 pass
Integration tests (live vLLM): 37/37 pass
Full suite: 773 pass, 48 fail (same pre-existing LlamaCpp failures as before — no regressions)

Jim Smith and others added 3 commits April 12, 2026 18:26
Support offloading embedding and reranking to remote OpenAI-compatible
servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query
expansion and tokenization via a hybrid routing layer.

- RemoteLLM: HTTP client with circuit breaker, dimension validation,
  batch splitting, auth headers, configurable timeouts
- HybridLLM: routes embed/rerank → remote, generate/expand → local
- LLM interface: add embedBatch, embedModelName; generalize singleton
  and session management from LlamaCpp to LLM
- Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section
- Skip nomic/Qwen3 text formatting prefixes for remote models
- 36 unit tests + 30 integration tests against live vLLM

Related: tobi#489, tobi#427, tobi#446, tobi#511

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add intent? to LLM interface and ILLMSession expandQuery signature
  (store.ts passes { intent } but interface didn't declare it — tsc error)
- Derive embed model label from getDefaultLLM().embedModelName after
  getStore() so content_vectors.model reflects the actual LLM in use
  (previously always stored DEFAULT_EMBED_MODEL_URI even with remote)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is
  configured; throws "expandApiModel not configured" otherwise
- Independent circuit breaker for the expand endpoint
- parseExpandResponse() parses lex/vec/hyde lines, filters terms that
  don't share a word with the original query, falls back gracefully on
  bad model output
- RemoteLLM.supportsExpand getter for routing decisions
- HybridLLM routes expandQuery to remote when remote.supportsExpand,
  otherwise falls back to local LlamaCpp (no interface changes)
- remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL /
  QMD_EXPAND_API_KEY and YAML expand_api_* fields
- Unit tests (mock HTTP server, VCR-style): payload shape, auth header
  fallback, lex/vec/hyde parsing, includeLexical=false filtering,
  fallback on bad output, query-term filtering, circuit breaker,
  HybridLLM routing (remote vs local), config env vars
- Integration tests: live server connectivity, all three types returned,
  includeLexical=false, intent incorporation, HybridLLM routing verified
  via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL
  env vars, skipped when absent)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants