Add OpenAI-compatible remote embedding and reranking by jhsmith409 · Pull Request #517 · tobi/qmd

jhsmith409 · 2026-04-06T14:19:13Z

Summary

Adds RemoteLLM class that calls OpenAI-compatible HTTP endpoints (POST /v1/embeddings, POST /v1/rerank) for embedding and reranking, with circuit breaker, dimension validation, batch splitting, auth headers, and configurable timeouts
Adds HybridLLM compositor that routes embed/rerank to a remote server while keeping query expansion and tokenization local via LlamaCpp
Generalizes the LLM interface with embedBatch and embedModelName, and updates the singleton/session management to accept any LLM implementation (backward-compatible)
Configured via QMD_EMBED_API_URL + QMD_EMBED_API_MODEL env vars, or embed_api_url/embed_api_model in the YAML models: section
Skips nomic/Qwen3 text formatting prefixes for remote models (they handle their own prompt formatting)
Zero new dependencies — uses Node.js built-in fetch()

Motivation

Allows using a GPU server (e.g. vLLM with BAAI/bge-m3 or Qwen/Qwen3-Embedding-0.6B) for embedding and reranking while keeping QMD's fine-tuned local query expansion model. Useful when the indexing machine doesn't have a GPU, or when you want to use larger/better embedding models than what fits in local VRAM.

Related: #489, #427, #446, #511

Files changed

File	Change
`src/remote-llm.ts`	New — `RemoteLLM` class + `remoteConfigFromEnv()`
`src/hybrid-llm.ts`	New — `HybridLLM` routing compositor
`src/llm.ts`	Add `embedBatch`/`embedModelName` to `LLM` interface, `isRemoteModel()`, generalize singleton to `LLM`
`src/store.ts`	Change `LlamaCpp` type refs → `LLM` interface, graceful `tokenize()` fallback
`src/collections.ts`	Add remote fields to `ModelsConfig`
`src/cli/qmd.ts`	Auto-detect remote config, create `HybridLLM` when configured
`CHANGELOG.md`	Unreleased entry
`README.md`	Configuration docs + vLLM example

Test plan

36 unit tests (test/remote-llm.test.ts) — mock HTTP server covering embed, batch, auth, dimension validation, circuit breaker, rerank, HybridLLM routing, config parsing, local-only path
30 integration tests (test/remote-llm-integration.test.ts) — live vLLM servers (Qwen3-Embedding-0.6B + Qwen3-Reranker-4B) covering single embed, batch, dimension consistency, normalization, semantic similarity, reranking relevance, edge cases, end-to-end search simulation
Local-only path verified: getDefaultLLM() returns LlamaCpp when no remote config, all interface methods present, tokenize() duck-typing works
Full test suite with bun test (existing tests unaffected — only type-level changes to store.ts)

🤖 Generated with Claude Code

jhsmith409 · 2026-04-06T16:48:44Z

Test results from a live vLLM deployment

Ran the full test suite on feature/remote-openai-embed against real vLLM servers (no mocks):

Endpoint	Model
Embedding (`http://192.168.x.x:x/v1`)	`Qwen/Qwen3-Embedding-0.6B`
Reranking (`http://192.168.x.x:x/v1`)	`qwen3-reranker-4b`

Remote LLM tests (the tests added by this PR): ✅ 66/66 pass

bun test v1.3.11 (af24e281)

test/remote-llm-integration.test.ts:
  Embedding dimension: 1024
  Rerank scores: {
    "cookies.md": 0.37974488735198975,
    "quantum.md": 0.35188794136047363,
    "space.md": 0.35012370347976685,
    "baking.md": 0.12711383402347565,
  }
  Similarity ranking:
    git.md: 0.7216
    typescript.md: 0.4591
    docker.md: 0.4050
    cooking.md: 0.3419
    gardening.md: 0.3270

 66 pass
 0 fail
 1214 expect() calls
Ran 66 tests across 2 files. [1149.00ms]

Full suite: 699 pass, 48 fail

The 48 failures are all in the LlamaCpp/local-model path (token chunking, query expansion, local reranking) — expected on a machine without a local GGUF model downloaded. No failures in any remote, BM25, AST, SDK, collections, or MCP tests.

This feature works well in practice. The remote OpenAI-compatible embedding is significantly faster than CPU GGUF inference for bulk indexing — happy to help test anything else if useful.

jhsmith409 · 2026-04-06T16:51:45Z

Full test suite — `bun test`

bun test v1.3.11 (af24e281)

 699 pass
 48 fail
 2720 expect() calls
Ran 747 tests across 20 files. [555.28s]

All 48 failures are in the local LlamaCpp path (token-based chunking, query expansion, local reranking, hybrid pipeline) — these require a local GGUF model to be downloaded, which isn't present on this machine. They are pre-existing failures unrelated to this PR.

Failures breakdown:

Token-based Chunking — node-llama-cpp compile/load timeout (no local model)
LlamaCpp Integration — expandQuery/rerank timeout (no local model)
LLM Session Management — withLLMSession timeout (no local model)
MCP Server > hybridQuery — LLM query expansion timeout (no local model)
search > with LLM query expansion — same
MCP HTTP Transport — depends on query expansion

Zero failures in: AST chunking, BM25, collections config, store paths, SDK, formatter, intent, multi-collection filter, RRF trace, structured search, store helpers, remote LLM — i.e., everything that doesn't require a local GGUF model passes cleanly. Existing tests are unaffected.

jhsmith409 · 2026-04-06T17:36:38Z

Two fixes found while deploying on a live system

While integrating this branch into production, I hit two issues and fixed them. Hopefully, I figured out how to incorporate them correctly into the PR:

1. `intent` missing from `LLM` interface / `ILLMSession` (`src/llm.ts`)

store.ts calls llm.expandQuery(query, { intent }) but the interface only declared { context?, includeLexical? }, so tsc failed to build:

src/store.ts(3191,50): error TS2353: Object literal may only specify known properties,
and 'intent' does not exist in type '{ context?: string | undefined; includeLexical?: boolean | undefined; }'

Fix — add intent? to both interface declarations:

-  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean }): Promise<Queryable[]>;
+  expandQuery(query: string, options?: { context?: string; includeLexical?: boolean; intent?: string }): Promise<Queryable[]>;

(Same change needed on both ILLMSession at line ~172 and LLM at line ~359.)

2. `vectorIndex` logs and stores the default GGUF URI as the model label, even when using remote embedding (`src/cli/qmd.ts`)

vectorIndex has model: string = DEFAULT_EMBED_MODEL_URI as a default parameter and passes it straight to generateEmbeddings. When remote embedding is active, getStore() sets up a HybridLLM — but model still shows/stores the GGUF string, not Qwen/Qwen3-Embedding-0.6B.

Fix — derive the label from the actual configured LLM after getStore():

   const storeInstance = getStore();
   const db = storeInstance.db;

+  // Use the actual model name from the configured LLM (may be remote, not the default GGUF URI)
+  model = getDefaultLLM().embedModelName;
+
   if (force) {

After this fix, qmd embed -f correctly shows and stores Model: Qwen/Qwen3-Embedding-0.6B in content_vectors.model.

jhsmith409 · 2026-04-06T17:41:21Z

Both fixes above have been pushed to the branch: jhsmith409@6596448

jhsmith409 · 2026-04-08T15:47:35Z

All comments have been addressed and two production fixes have been pushed (see above). Tested against live vLLM servers — 699/747 tests passing, all 48 failures are pre-existing LlamaCpp-path issues unrelated to this PR. Ready for review.

viniciushsantana · 2026-04-08T23:41:30Z

great work! would you consider adding support for remote query expansion as well?

jhsmith409 · 2026-04-09T00:06:14Z

great work! would you consider adding support for remote query expansion as well?

Let's close out this PR and get it merged. Then open an issue for remote query expansion and I'll try to address it.

tobi · 2026-04-09T01:23:20Z

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern)
And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

jhsmith409 · 2026-04-09T12:25:05Z

Let's get query expansion in there. Add unit tests to the remote calls (maybe do a vcr pattern) And what about a qmd models serve that can serve the three models on a remote via simple OpenAI compatible protocol?

I'll tackle the first part (query expansion) and unit tests but I'll let the qmd models serve option for someone else to implement. Does that work for you, tobi?

jhsmith409 · 2026-04-12T22:22:15Z

Remote query expansion is now implemented. Here's what was added (commit f8c6030):

Changes

`src/remote-llm.ts`

expandQuery() now calls /chat/completions when expandApiModel is configured (throws "expandApiModel not configured" otherwise — no behavior change for existing users)
supportsExpand getter for routing decisions
Independent circuit breaker for the expand endpoint
parseExpandResponse() helper: parses lex:/vec:/hyde: lines, filters out variants with no term overlap with the original query, falls back gracefully on bad model output
remoteConfigFromEnv() now reads QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields

`src/hybrid-llm.ts`

expandQuery() routes to remote when remote instanceof RemoteLLM && remote.supportsExpand, otherwise falls back to local LlamaCpp — no interface changes

`test/remote-llm.test.ts` (unit tests, mock HTTP server)

New expandQuery describe block covering:

supportsExpand flag behavior
/chat/completions payload shape (model, message roles, intent inclusion)
Auth header: expandApiKey → falls back to embedApiKey
lex/vec/hyde parsing, includeLexical: false filtering
Fallback Queryable[] when model output is unparseable
Query-term filtering (variants with no overlap are dropped)
Circuit breaker trips after 3 failures
HybridLLM routing: remote when expandApiModel set, local when not

`test/remote-llm-integration.test.ts` (live server)

New VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars (tests skipped when absent)
All three types returned, includeLexical: false, intent incorporation
HybridLLM routing verified via LOCAL_SENTINEL sentinel value

Test results

Unit tests: 52/52 pass
Integration tests (live vLLM): 37/37 pass
Full suite: 773 pass, 48 fail (same pre-existing LlamaCpp failures as before — no regressions)

Support offloading embedding and reranking to remote OpenAI-compatible servers (vLLM, Ollama, LM Studio, OpenAI) while preserving local query expansion and tokenization via a hybrid routing layer. - RemoteLLM: HTTP client with circuit breaker, dimension validation, batch splitting, auth headers, configurable timeouts - HybridLLM: routes embed/rerank → remote, generate/expand → local - LLM interface: add embedBatch, embedModelName; generalize singleton and session management from LlamaCpp to LLM - Config: QMD_EMBED_API_URL/MODEL env vars or YAML models section - Skip nomic/Qwen3 text formatting prefixes for remote models - 36 unit tests + 30 integration tests against live vLLM Related: tobi#489, tobi#427, tobi#446, tobi#511 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Add intent? to LLM interface and ILLMSession expandQuery signature (store.ts passes { intent } but interface didn't declare it — tsc error) - Derive embed model label from getDefaultLLM().embedModelName after getStore() so content_vectors.model reflects the actual LLM in use (previously always stored DEFAULT_EMBED_MODEL_URI even with remote) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- RemoteLLM.expandQuery() calls /chat/completions when expandApiModel is configured; throws "expandApiModel not configured" otherwise - Independent circuit breaker for the expand endpoint - parseExpandResponse() parses lex/vec/hyde lines, filters terms that don't share a word with the original query, falls back gracefully on bad model output - RemoteLLM.supportsExpand getter for routing decisions - HybridLLM routes expandQuery to remote when remote.supportsExpand, otherwise falls back to local LlamaCpp (no interface changes) - remoteConfigFromEnv() handles QMD_EXPAND_API_URL / QMD_EXPAND_API_MODEL / QMD_EXPAND_API_KEY and YAML expand_api_* fields - Unit tests (mock HTTP server, VCR-style): payload shape, auth header fallback, lex/vec/hyde parsing, includeLexical=false filtering, fallback on bad output, query-term filtering, circuit breaker, HybridLLM routing (remote vs local), config env vars - Integration tests: live server connectivity, all three types returned, includeLexical=false, intent incorporation, HybridLLM routing verified via LOCAL_SENTINEL sentinel (new VLLM_EXPAND_URL / VLLM_EXPAND_MODEL env vars, skipped when absent)

jhsmith409 force-pushed the feature/remote-openai-embed branch from 8e26a6c to f640303 Compare April 6, 2026 14:23

Jim Smith and others added 3 commits April 12, 2026 18:26

jhsmith409 force-pushed the feature/remote-openai-embed branch from f8c6030 to f2fd64e Compare April 12, 2026 22:27

lukeboyett mentioned this pull request Apr 15, 2026

feat(llm): add remote embedding/reranking via OpenAI-compatible endpoints #575

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI-compatible remote embedding and reranking#517

Add OpenAI-compatible remote embedding and reranking#517
jhsmith409 wants to merge 3 commits intotobi:mainfrom
jhsmith409:feature/remote-openai-embed

jhsmith409 commented Apr 6, 2026 •

edited

Loading

Uh oh!

jhsmith409 commented Apr 6, 2026 •

edited

Loading

Uh oh!

jhsmith409 commented Apr 6, 2026

Uh oh!

jhsmith409 commented Apr 6, 2026 •

edited

Loading

Uh oh!

jhsmith409 commented Apr 6, 2026

Uh oh!

jhsmith409 commented Apr 8, 2026

Uh oh!

viniciushsantana commented Apr 8, 2026

Uh oh!

jhsmith409 commented Apr 9, 2026

Uh oh!

tobi commented Apr 9, 2026

Uh oh!

jhsmith409 commented Apr 9, 2026

Uh oh!

jhsmith409 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhsmith409 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Files changed

Test plan

Uh oh!

jhsmith409 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test results from a live vLLM deployment

Remote LLM tests (the tests added by this PR): ✅ 66/66 pass

Full suite: 699 pass, 48 fail

Uh oh!

jhsmith409 commented Apr 6, 2026

Full test suite — bun test

Uh oh!

jhsmith409 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Two fixes found while deploying on a live system

1. intent missing from LLM interface / ILLMSession (src/llm.ts)

2. vectorIndex logs and stores the default GGUF URI as the model label, even when using remote embedding (src/cli/qmd.ts)

Uh oh!

jhsmith409 commented Apr 6, 2026

Uh oh!

jhsmith409 commented Apr 8, 2026

Uh oh!

viniciushsantana commented Apr 8, 2026

Uh oh!

jhsmith409 commented Apr 9, 2026

Uh oh!

tobi commented Apr 9, 2026

Uh oh!

jhsmith409 commented Apr 9, 2026

Uh oh!

jhsmith409 commented Apr 12, 2026

Changes

src/remote-llm.ts

src/hybrid-llm.ts

test/remote-llm.test.ts (unit tests, mock HTTP server)

test/remote-llm-integration.test.ts (live server)

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhsmith409 commented Apr 6, 2026 •

edited

Loading

jhsmith409 commented Apr 6, 2026 •

edited

Loading

Full test suite — `bun test`

jhsmith409 commented Apr 6, 2026 •

edited

Loading

1. `intent` missing from `LLM` interface / `ILLMSession` (`src/llm.ts`)

2. `vectorIndex` logs and stores the default GGUF URI as the model label, even when using remote embedding (`src/cli/qmd.ts`)

`src/remote-llm.ts`

`src/hybrid-llm.ts`

`test/remote-llm.test.ts` (unit tests, mock HTTP server)

`test/remote-llm-integration.test.ts` (live server)